[PATCH v16 06/17]move member destructor_arg before member dataref
From: Xin Xiaohui xiaohui@intel.com Then we can clear destructor_arg when __alloc_skb(). Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 696e690..6e1e991 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -199,14 +199,15 @@ struct skb_shared_info { struct sk_buff *frag_list; struct skb_shared_hwtstamps hwtstamps; + /* Intermediate layers must ensure that destructor_arg +* remains valid until skb destructor */ + void * destructor_arg; + /* * Warning : all fields before dataref are cleared in __alloc_skb() */ atomic_tdataref; - /* Intermediate layers must ensure that destructor_arg -* remains valid until skb destructor */ - void * destructor_arg; /* must be last field, see pskb_expand_head() */ skb_frag_t frags[MAX_SKB_FRAGS]; }; -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 05/17] Add a function to indicate if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 8dcf6de..f91d9bb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1739,6 +1739,11 @@ extern gro_result_t napi_gro_frags(struct napi_struct *napi); extern int netdev_mp_port_prep(struct net_device *dev, struct mp_port *port); +static inline bool dev_is_mpassthru(struct net_device *dev) +{ + return dev dev-mp_port; +} + static inline void napi_free_frags(struct napi_struct *napi) { kfree_skb(napi-skb); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 08/17]Modify netdev_free_page() to release external buffer
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |4 +++- net/core/skbuff.c | 24 2 files changed, 27 insertions(+), 1 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6e1e991..6309ce6 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1586,9 +1586,11 @@ static inline struct page *netdev_alloc_page(struct net_device *dev) return __netdev_alloc_page(dev, GFP_ATOMIC); } +extern void __netdev_free_page(struct net_device *dev, struct page *page); + static inline void netdev_free_page(struct net_device *dev, struct page *page) { - __free_page(page); + __netdev_free_page(dev, page); } /** diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b9858c7..d3ece5c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -298,6 +298,30 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) } EXPORT_SYMBOL(__netdev_alloc_page); +void netdev_free_ext_page(struct net_device *dev, struct page *page) +{ + struct skb_ext_page *ext_page = NULL; + if (dev_is_mpassthru(dev) dev-mp_port-hash) { + ext_page = dev-mp_port-hash(dev, page); + if (ext_page) + ext_page-dtor(ext_page); + else + __free_page(page); + } +} +EXPORT_SYMBOL(netdev_free_ext_page); + +void __netdev_free_page(struct net_device *dev, struct page *page) +{ + if (dev_is_mpassthru(dev)) { + netdev_free_ext_page(dev, page); + return; + } + + __free_page(page); +} +EXPORT_SYMBOL(__netdev_free_page); + void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, int size) { -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 10/17] If device is in zero-copy mode first, bonding will fail.
From: Xin Xiaohui xiaohui@intel.com If device is in this zero-copy mode first, we cannot handle this, so fail it. This patch is for this. If bonding is created first, and one of the device will be in zero-copy mode, this will be handled by mp device. It will first check if all the slaves have the zero-copy capability. If no, fail too. Otherwise make all the slaves in zero-copy mode. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/net/bonding/bond_main.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 3b16f62..dfb6a2c 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1428,6 +1428,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) bond_dev-name); } + /* if the device is in zero-copy mode before bonding, fail it. */ + if (dev_is_mpassthru(slave_dev)) + return -EBUSY; + /* already enslaved */ if (slave_dev-flags IFF_SLAVE) { pr_debug(Error, Device was already enslaved\n); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 11/17]Add a hook to intercept external buffers from NIC driver.
From: Xin Xiaohui xiaohui@intel.com The hook is called in __netif_receive_skb(). Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/dev.c | 40 1 files changed, 40 insertions(+), 0 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 84fbb83..bdad1c8 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2814,6 +2814,40 @@ int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master) } EXPORT_SYMBOL(__skb_bond_should_drop); +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +/* Add a hook to intercept mediate passthru(zero-copy) packets, + * and insert it to the socket queue owned by mp_port specially. + */ +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb, + struct packet_type **pt_prev, + int *ret, + struct net_device *orig_dev) +{ + struct mp_port *mp_port = NULL; + struct sock *sk = NULL; + + if (!dev_is_mpassthru(skb-dev) !dev_is_mpassthru(orig_dev)) + return skb; + if (dev_is_mpassthru(skb-dev)) + mp_port = skb-dev-mp_port; + else if (orig_dev-master == skb-dev dev_is_mpassthru(orig_dev)) + mp_port = orig_dev-mp_port; + + if (*pt_prev) { + *ret = deliver_skb(skb, *pt_prev, orig_dev); + *pt_prev = NULL; + } + + sk = mp_port-sock-sk; + skb_queue_tail(sk-sk_receive_queue, skb); + sk-sk_state_change(sk); + + return NULL; +} +#else +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb) +#endif + static int __netif_receive_skb(struct sk_buff *skb) { struct packet_type *ptype, *pt_prev; @@ -2891,6 +2925,11 @@ static int __netif_receive_skb(struct sk_buff *skb) ncls: #endif + /* To intercept mediate passthru(zero-copy) packets here */ + skb = handle_mpassthru(skb, pt_prev, ret, orig_dev); + if (!skb) + goto out; + /* Handle special case of bridge or macvlan */ rx_handler = rcu_dereference(skb-dev-rx_handler); if (rx_handler) { @@ -2983,6 +3022,7 @@ err: EXPORT_SYMBOL(netdev_mp_port_prep); #endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 13/17] Add mp(mediate passthru) device.
From: Xin Xiaohui xiaohui@intel.com The patch add mp(mediate passthru) device, which now based on vhost-net backend driver and provides proto_ops to send/receive guest buffers data from/to guest vitio-net driver. It also exports async functions which can be used by other drivers like macvtap to utilize zero-copy too. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/mpassthru.c | 1495 + 1 files changed, 1495 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..868200a --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1495 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/compat.h +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h +#include ../net/bonding/bonding.h + +struct mp_struct { + struct mp_file *mfile; + struct net_device *dev; + struct page_pool*pool; + struct socket socket; + struct socket_wqwq; + struct mm_struct*mm; +}; + +struct mp_file { + atomic_t count; + struct mp_struct *mp; + struct net *net; +}; + +struct mp_sock { + struct sock sk; + struct mp_struct*mp; +}; + +/* The main function to allocate external buffers */ +static struct skb_ext_page *page_ctor(struct mp_port *port, + struct sk_buff *skb, + int npages) +{ + int i; + unsigned long flags; + struct page_pool *pool; + struct page_info *info = NULL; + + if (npages != 1) + BUG(); + pool = container_of(port, struct page_pool, port); + + spin_lock_irqsave(pool-read_lock, flags); + if (!list_empty(pool-readq)) { + info = list_first_entry(pool-readq, struct page_info, list); + list_del(info-list); + } + spin_unlock_irqrestore(pool-read_lock, flags); + if (!info) + return NULL; + + for (i = 0; i info-pnum; i++) + get_page(info-pages[i]); + info-skb = skb; + return info-ext_page; +} + +static struct page_info *mp_hash_lookup(struct page_pool *pool, + struct page *page); +static struct page_info *mp_hash_delete(struct page_pool *pool, + struct page_info *info); + +static struct skb_ext_page *mp_lookup(struct net_device *dev, + struct page *page) +{ + struct mp_struct *mp = + container_of(dev-mp_port-sock-sk, struct mp_sock, sk)-mp; + struct page_pool *pool = mp-pool; + struct page_info *info; + + info = mp_hash_lookup(pool, page); + if (!info) + return NULL; + return info-ext_page; +} + +struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock) +{ + struct page_pool *pool; + struct net_device *master; + struct slave *slave; + struct bonding *bond; + int i; + int rc; + + pool = kzalloc(sizeof(*pool), GFP_KERNEL); + if (!pool) + return NULL; + + /* How to deal with bonding device: +* check if all the slaves are capable of zero-copy. +* if not, fail. +
[PATCH v16 16/17]An example how to modifiy NIC driver to use napi_gro_frags() interface
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver. It provides API is_rx_buffer_mapped_as_page() to indicate if the driver use napi_gro_frags() interface or not. The example allocates 2 pages for DMA for one ring descriptor using netdev_alloc_page(). When packets is coming, using napi_gro_frags() to allocate skb and to receive the packets. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/net/ixgbe/ixgbe.h |3 + drivers/net/ixgbe/ixgbe_main.c | 169 +++ 2 files changed, 137 insertions(+), 35 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h index 9e15eb9..89367ca 100644 --- a/drivers/net/ixgbe/ixgbe.h +++ b/drivers/net/ixgbe/ixgbe.h @@ -131,6 +131,9 @@ struct ixgbe_rx_buffer { struct page *page; dma_addr_t page_dma; unsigned int page_offset; + u16 mapped_as_page; + struct page *page_skb; + unsigned int page_skb_offset; }; struct ixgbe_queue_stats { diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index e32af43..cd69080 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1029,6 +1029,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring-reg_idx), val); } +static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, + struct net_device *dev) +{ + return true; +} + /** * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split * @adapter: address of board private structure @@ -1045,13 +1051,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, i = rx_ring-next_to_use; bi = rx_ring-rx_buffer_info[i]; + while (cleaned_count--) { rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i); + bi-mapped_as_page = + is_rx_buffer_mapped_as_page(bi, adapter-netdev); + if (!bi-page_dma (rx_ring-flags IXGBE_RING_RX_PS_ENABLED)) { if (!bi-page) { - bi-page = alloc_page(GFP_ATOMIC); + bi-page = netdev_alloc_page(adapter-netdev); if (!bi-page) { adapter-alloc_rx_page_failed++; goto no_buffers; @@ -1068,7 +1078,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, DMA_FROM_DEVICE); } - if (!bi-skb) { + if (!bi-mapped_as_page !bi-skb) { struct sk_buff *skb; /* netdev_alloc_skb reserves 32 bytes up front!! */ uint bufsz = rx_ring-rx_buf_len + SMP_CACHE_BYTES; @@ -1088,6 +1098,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, rx_ring-rx_buf_len, DMA_FROM_DEVICE); } + + if (bi-mapped_as_page !bi-page_skb) { + bi-page_skb = netdev_alloc_page(adapter-netdev); + if (!bi-page_skb) { + adapter-alloc_rx_page_failed++; + goto no_buffers; + } + bi-page_skb_offset = 0; + bi-dma = dma_map_page(pdev-dev, bi-page_skb, + bi-page_skb_offset, + (PAGE_SIZE / 2), + PCI_DMA_FROMDEVICE); + } /* Refresh the desc even if buffer_addrs didn't change because * each write-back erases this info. */ if (rx_ring-flags IXGBE_RING_RX_PS_ENABLED) { @@ -1165,6 +1188,13 @@ struct ixgbe_rsc_cb { bool delay_unmap; }; +static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info) +{ + return (!rx_buffer_info-skb || + !rx_buffer_info-page_skb) + !rx_buffer_info-page; +} + #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)-cb) static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, @@ -1174,6 +1204,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, struct ixgbe_adapter *adapter = q_vector-adapter; struct net_device *netdev = adapter-netdev; struct pci_dev *pdev = adapter-pdev; + struct napi_struct *napi = q_vector-napi; union ixgbe_adv_rx_desc *rx_desc, *next_rxd; struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer; struct sk_buff *skb; @@ -1211,32 +1242,74 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length);
[PATCH v16 17/17]An example how to alloc user buffer based on napi_gro_frags() interface.
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver which using napi_gro_frags(). It can get buffers from guest side directly using netdev_alloc_page() and release guest buffers using netdev_free_page(). Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/net/ixgbe/ixgbe_main.c | 37 + 1 files changed, 33 insertions(+), 4 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index cd69080..807a51e 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1032,7 +1032,14 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, struct net_device *dev) { - return true; + return dev_is_mpassthru(dev); +} + +static u32 get_page_skb_offset(struct net_device *dev) +{ + if (!dev_is_mpassthru(dev)) + return 0; + return dev-mp_port-vnet_hlen; } /** @@ -1105,7 +1112,8 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, adapter-alloc_rx_page_failed++; goto no_buffers; } - bi-page_skb_offset = 0; + bi-page_skb_offset = + get_page_skb_offset(adapter-netdev); bi-dma = dma_map_page(pdev-dev, bi-page_skb, bi-page_skb_offset, (PAGE_SIZE / 2), @@ -1242,8 +1250,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length); } - if (is_no_buffer(rx_buffer_info)) + if (is_no_buffer(rx_buffer_info)) { + printk(no buffers\n); break; + } cleaned = true; if (!rx_buffer_info-mapped_as_page) { @@ -1305,6 +1315,11 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, rx_buffer_info-page_skb, rx_buffer_info-page_skb_offset, len); + if (dev_is_mpassthru(netdev) + netdev-mp_port-hash) + skb_shinfo(skb)-destructor_arg = + netdev-mp_port-hash(netdev, + rx_buffer_info-page_skb); rx_buffer_info-page_skb = NULL; skb-len += len; skb-data_len += len; @@ -1322,7 +1337,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, upper_len); if ((rx_ring-rx_buf_len (PAGE_SIZE / 2)) || - (page_count(rx_buffer_info-page) != 1)) + (page_count(rx_buffer_info-page) != 1) || + dev_is_mpassthru(netdev)) rx_buffer_info-page = NULL; else get_page(rx_buffer_info-page); @@ -6535,6 +6551,16 @@ static void ixgbe_netpoll(struct net_device *netdev) } #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +static int ixgbe_ndo_mp_port_prep(struct net_device *dev, struct mp_port *port) +{ + port-hdr_len = 128; + port-data_len = 2048; + port-npages = 1; + return 0; +} +#endif + static const struct net_device_ops ixgbe_netdev_ops = { .ndo_open = ixgbe_open, .ndo_stop = ixgbe_close, @@ -6554,6 +6580,9 @@ static const struct net_device_ops ixgbe_netdev_ops = { .ndo_set_vf_vlan= ixgbe_ndo_set_vf_vlan, .ndo_set_vf_tx_rate = ixgbe_ndo_set_vf_bw, .ndo_get_vf_config = ixgbe_ndo_get_vf_config, +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) + .ndo_mp_port_prep = ixgbe_ndo_mp_port_prep, +#endif #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller= ixgbe_netpoll, #endif -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 00/17] Provide a zero-copy method on KVM virtio-net.
We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-11:net core and kernel changes. patch 12-14:new device as interface to mantpulate external buffers. patch 15: for vhost-net. patch 16: An example on modifying NIC driver to using napi_gro_frags(). patch 17: An example how to get guest buffers based on driver who using napi_gro_frags(). The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. What we have not done yet: Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation. Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg. use get_user_pages_fast() and set_page_dirty_lock() when read. Add some comments for netdev_mp_port_prep() and handle_mpassthru(). what we have done in v3: the async write logging is rewritten a drafted synchronous write function for qemu live migration a limit for locked pages from get_user_pages_fast() to prevent Dos by using RLIMIT_MEMLOCK what we have done in v4: add iocb completion callback from vhost-net to queue iocb in mp device replace vq-receiver by mp_sock_data_ready() remove stuff in mp device which access structures from vhost-net modify skb_reserve() to ignore host NIC driver reserved space rebase to the latest vhost tree split large patches into small pieces, especially for net core part. what we have done in v5: address Arnd Bergmann's comments -remove IFF_MPASSTHRU_EXCL flag in mp device -Add CONFIG_COMPAT macro -remove mp_release ops move dev_is_mpassthru() as inline func fix a bug in memory relinquish Apply to current git (2.6.34-rc6) tree. what we have done in v6: move create_iocb() out of page_dtor which may happen in interrupt context -This remove the potential issues which lock called in interrupt context make the cache used by mp, vhost as static, and created/destoryed during modules init/exit functions. -This makes multiple mp guest created at the same time. what we have done in v7: some cleanup prepared to suppprt PS mode what we have done in v8: discarding the modifications to point skb-data to guest buffer directly. Add code to modify driver to support napi_gro_frags() with Herbert's comments. To support PS mode. Add mergeable buffer support in mp device. Add GSO/GRO support in mp deice. Address comments from Eric Dumazet about cache line and rcu usage. what we have done in v9: v8
[PATCH v16 15/17]Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui xiaohui@intel.com The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/vhost/net.c | 361 + drivers/vhost/vhost.c | 78 +++ drivers/vhost/vhost.h | 15 ++- 3 files changed, 429 insertions(+), 25 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 7c80082..8ec4edf 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -24,6 +24,8 @@ #include linux/if_arp.h #include linux/if_tun.h #include linux/if_macvlan.h +#include linux/mpassthru.h +#include linux/aio.h #include net/sock.h @@ -32,6 +34,7 @@ /* Max number of bytes transferred before requeueing the job. * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +static struct kmem_cache *notify_cache; enum { VHOST_NET_VQ_RX = 0, @@ -49,6 +52,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -109,11 +113,184 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_iocb(struct kiocb *iocb) +{ + struct vhost_virtqueue *vq = iocb-private; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + list_add_tail(iocb-ki_list, vq-notifier); + spin_unlock_irqrestore(vq-notify_lock, flags); +} + +static int is_async_vq(struct vhost_virtqueue *vq) +{ + return (vq-link_state == VHOST_VQ_LINK_ASYNC); +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq, + struct socket *sock) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + unsigned int head, log, in, out; + int size; + + if (!is_async_vq(vq)) + return; + + if (sock-sk-sk_data_ready) + sock-sk-sk_data_ready(sock-sk, 0); + + vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? + vq-log : NULL; + + while ((iocb = notify_dequeue(vq)) != NULL) { + if (!iocb-ki_left) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + /* when log is enabled, recomputing the log is needed, +* since these buffers are in async queue, may not get +* the log info before. +*/ + if (unlikely(vq_log)) { + if (!log) + __vhost_get_vq_desc(net-dev, vq, + vq-iov, + ARRAY_SIZE(vq-iov), + out, in, vq_log, + log, head); + vhost_log_write(vq, vq_log, log, size); + } + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } else { + int i = 0; + int count = iocb-ki_left; + int hc = count; + while (count--) { + if (iocb) { + vq-heads[i].id = iocb-ki_pos; + vq-heads[i].len = iocb-ki_nbytes; + size = iocb-ki_nbytes; + head
[PATCH v16 14/17]Add a kconfig entry and make entry for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/Kconfig | 10 ++ drivers/vhost/Makefile |2 ++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index e4e2fd1..a6b8cbf 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU + tristate mediate passthru network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support, we call it as mediate passthru to + be distiguish with hardare passthru. + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..c18b9fc 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 12/17] Add header file for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/mpassthru.h | 133 + 1 files changed, 133 insertions(+), 0 deletions(-) create mode 100644 include/linux/mpassthru.h diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h new file mode 100644 index 000..1115f55 --- /dev/null +++ b/include/linux/mpassthru.h @@ -0,0 +1,133 @@ +#ifndef __MPASSTHRU_H +#define __MPASSTHRU_H + +#include linux/types.h +#include linux/if_ether.h +#include linux/ioctl.h + +/* ioctl defines */ +#define MPASSTHRU_BINDDEV _IOW('M', 213, int) +#define MPASSTHRU_UNBINDDEV_IO('M', 214) +#define MPASSTHRU_SET_MEM_LOCKED _IOW('M', 215, unsigned long) +#define MPASSTHRU_GET_MEM_LOCKED_NEED _IOR('M', 216, unsigned long) + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +#define DEFAULT_NEED ((8192*2*2)*4096) + +struct frag { + u16 offset; + u16 size; +}; + +#define HASH_BUCKETS(8192*2) +struct page_info { + struct list_headlist; + struct page_info*next; + struct page_info*prev; + struct page *pages[MAX_SKB_FRAGS]; + struct sk_buff *skb; + struct page_pool*pool; + + /* The pointer relayed to skb, to indicate +* it's a external allocated skb or kernel +*/ + struct skb_ext_pageext_page; + /* flag to indicate read or write */ +#define INFO_READ 0 +#define INFO_WRITE 1 + unsignedflags; + /* exact number of locked pages */ + unsignedpnum; + + /* The fields after that is for backend +* driver, now for vhost-net. +*/ + /* the kiocb structure related to */ + struct kiocb*iocb; + /* the ring descriptor index */ + unsigned intdesc_pos; + /* the iovec coming from backend, we only +* need few of them */ + struct iovechdr[2]; + struct ioveciov[2]; +}; + +struct page_pool { + /* the queue for rx side */ + struct list_headreadq; + /* the lock to protect readq */ + spinlock_t read_lock; + /* record the orignal rlimit */ + struct rlimit o_rlim; + /* userspace wants to locked */ + int locked_pages; + /* currently locked pages */ + int cur_pages; + /* the memory locked before */ + unsigned long orig_locked_vm; + /* the device according to */ + struct net_device *dev; + /* the mp_port according to dev */ + struct mp_port port; + /* the hash_table list to find each locked page */ + struct page_info**hash_table; +}; + +static struct kmem_cache *ext_page_info_cache; + +#ifdef __KERNEL__ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +struct socket *mp_get_socket(struct file *); +struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock); +int async_recvmsg(struct kiocb *iocb, struct page_pool *pool, + struct iovec *iov, int count, int flags); +int async_sendmsg(struct sock *sk, struct kiocb *iocb, + struct page_pool *pool, struct iovec *iov, + int count); +void async_data_ready(struct sock *sk, struct page_pool *pool); +void dev_change_state(struct net_device *dev); +void page_pool_destroy(struct mm_struct *mm, struct page_pool *pool); +#else +#include linux/err.h +#include linux/errno.h +struct file; +struct socket; +static inline struct socket *mp_get_socket(struct file *f) +{ + return ERR_PTR(-EINVAL); +} +static inline struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock) +{ + return ERR_PTR(-EINVAL); +} +static inline int async_recvmsg(struct kiocb *iocb, struct page_pool *pool, + struct iovec *iov, int count, int flags) +{ + return -EINVAL; +} +static inline int async_sendmsg(struct sock *sk, struct kiocb *iocb, + struct page_pool *pool, struct iovec *iov, + int count) +{ + return -EINVAL; +} +static inline void async_data_ready(struct sock *sk, struct page_pool *pool) +{ + return; +} +static inline void dev_change_state(struct net_device *dev) +{ + return; +} +static inline void page_pool_destroy(struct mm_struct *mm, +struct page_pool *pool) +{ + return; +} +#endif /* CONFIG_MEDIATE_PASSTHRU */ +#endif /* __KERNEL__ */ +#endif /* __MPASSTHRU_H */ -- 1.7.3 -- To unsubscribe from this list: send the line
[PATCH v16 09/17] Don't do skb recycle, if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index d3ece5c..11833b4 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -550,6 +550,12 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size) if (skb_shared(skb) || skb_cloned(skb)) return false; + /* if the device wants to do mediate passthru, the skb may +* get external buffer, so don't recycle +*/ + if (dev_is_mpassthru(skb-dev)) + return 0; + skb_release_head_state(skb); shinfo = skb_shinfo(skb); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 07/17]Modify netdev_alloc_page() to get external buffer
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c83b421..b9858c7 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -261,11 +261,38 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, } EXPORT_SYMBOL(__netdev_alloc_skb); +struct page *netdev_alloc_ext_pages(struct net_device *dev, int npages) +{ + struct mp_port *port; + struct skb_ext_page *ext_page = NULL; + + port = dev-mp_port; + if (!port) + goto out; + ext_page = port-ctor(port, NULL, npages); + if (ext_page) + return ext_page-page; +out: + return NULL; + +} +EXPORT_SYMBOL(netdev_alloc_ext_pages); + +struct page *netdev_alloc_ext_page(struct net_device *dev) +{ + return netdev_alloc_ext_pages(dev, 1); + +} +EXPORT_SYMBOL(netdev_alloc_ext_page); + struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) { int node = dev-dev.parent ? dev_to_node(dev-dev.parent) : -1; struct page *page; + if (dev_is_mpassthru(dev)) + return netdev_alloc_ext_page(dev); + page = alloc_pages_node(node, gfp_mask, 0); return page; } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 03/17] Add a ndo_mp_port_prep pointer to net_device_ops.
From: Xin Xiaohui xiaohui@intel.com If the driver want to allocate external buffers, then it can export it's capability, as the skb buffer header length, the page length can be DMA, etc. The external buffers owner may utilize this. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f6b1870..575777f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -723,6 +723,12 @@ struct netdev_rx_queue { * int (*ndo_set_vf_port)(struct net_device *dev, int vf, * struct nlattr *port[]); * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb); + * + * int (*ndo_mp_port_prep)(struct net_device *dev, struct mp_port *port); + * If the driver want to allocate external buffers, + * then it can export it's capability, as the skb + * buffer header length, the page length can be DMA, etc. + * The external buffers owner may utilize this. */ #define HAVE_NET_DEVICE_OPS struct net_device_ops { @@ -795,6 +801,10 @@ struct net_device_ops { int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) + int (*ndo_mp_port_prep)(struct net_device *dev, + struct mp_port *port); +#endif }; /* -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 01/17] Add a new structure for skb buffer from external.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 77eb60d..696e690 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -211,6 +211,15 @@ struct skb_shared_info { skb_frag_t frags[MAX_SKB_FRAGS]; }; +/* The structure is for a skb which pages may point to + * an external buffer, which is not allocated from kernel space. + * It also contains a destructor for itself. + */ +struct skb_ext_page { + struct page *page; + void(*dtor)(struct skb_ext_page *); +}; + /* We divide dataref into two halves. The higher 16 bits hold references * to the payload part of skb-data. The lower 16 bits hold references to * the entire skb-data. A clone of a headerless skb holds the length of -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 04/17]Add a function make external buffer owner to query capability.
From: Xin Xiaohui xiaohui@intel.com The external buffer owner can use the functions to get the capability of the underlying NIC driver. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |2 ++ net/core/dev.c| 41 + 2 files changed, 43 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 575777f..8dcf6de 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1736,6 +1736,8 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi, gro_result_t ret); extern struct sk_buff *napi_frags_skb(struct napi_struct *napi); extern gro_result_tnapi_gro_frags(struct napi_struct *napi); +extern int netdev_mp_port_prep(struct net_device *dev, + struct mp_port *port); static inline void napi_free_frags(struct napi_struct *napi) { diff --git a/net/core/dev.c b/net/core/dev.c index 660dd41..84fbb83 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2942,6 +2942,47 @@ out: return ret; } +/* To support meidate passthru(zero-copy) with NIC driver, + * we'd better query NIC driver for the capability it can + * provide, especially for packet split mode, now we only + * query for the header size, and the payload a descriptor + * may carry. + * Now, it's only called by mpassthru device. + */ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +int netdev_mp_port_prep(struct net_device *dev, + struct mp_port *port) +{ + int rc; + int npages, data_len; + const struct net_device_ops *ops = dev-netdev_ops; + + if (ops-ndo_mp_port_prep) { + rc = ops-ndo_mp_port_prep(dev, port); + if (rc) + return rc; + } else + return -EINVAL; + + if (port-hdr_len = 0) + goto err; + + npages = port-npages; + data_len = port-data_len; + if (npages = 0 || npages MAX_SKB_FRAGS || + (data_len PAGE_SIZE * (npages - 1) || +data_len PAGE_SIZE * npages)) + goto err; + + return 0; +err: + dev_warn(dev-dev, invalid page constructor parameters\n); + + return -EINVAL; +} +EXPORT_SYMBOL(netdev_mp_port_prep); +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v16 02/17]Add a new struct for device to manipulate external buffer.
From: Xin Xiaohui xiaohui@intel.com Add a structure in structure net_device, the new field is named as mp_port. It's for mediate passthru (zero-copy). It contains the capability for the net device driver, a socket, and an external buffer creator, external means skb buffer belongs to the device may not be allocated from kernel space. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 25 - 1 files changed, 24 insertions(+), 1 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 46c36ff..f6b1870 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -325,6 +325,28 @@ enum netdev_state_t { __LINK_STATE_DORMANT, }; +/*The structure for mediate passthru(zero-copy). */ +struct mp_port { + /* the header len */ + int hdr_len; + /* the max payload len for one descriptor */ + int data_len; + /* the pages for DMA in one time */ + int npages; + /* the socket bind to */ + struct socket *sock; + /* the header len for virtio-net */ + int vnet_hlen; + /* the external buffer page creator */ + struct skb_ext_page *(*ctor)(struct mp_port *, + struct sk_buff *, int); + /* the hash function attached to find according +* backend ring descriptor info for one external +* buffer page. +*/ + struct skb_ext_page *(*hash)(struct net_device *, + struct page *); +}; /* * This structure holds at boot time configured netdevice settings. They @@ -1045,7 +1067,8 @@ struct net_device { /* GARP */ struct garp_port*garp_port; - + /* mpassthru */ + struct mp_port *mp_port; /* class/net/name entry */ struct device dev; /* space for optional device, statistics, and wireless sysfs groups */ -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.
From: Xin Xiaohui xiaohui@intel.com 2) The idea to key off of skb-dev in skb_release_data() is fundamentally flawed since many actions can change skb-dev on you, which will end up causing a leak of your external data areas. How about this one? If the destructor_arg is not a good candidate, then I have to add an apparent field in shinfo. Thanks Xiaohui diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 10ba67d..ad4636e 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -199,14 +199,15 @@ struct skb_shared_info { struct sk_buff *frag_list; struct skb_shared_hwtstamps hwtstamps; + /* Intermediate layers must ensure that destructor_arg +* remains valid until skb destructor */ + void * destructor_arg; + /* * Warning : all fields before dataref are cleared in __alloc_skb() */ atomic_tdataref; - /* Intermediate layers must ensure that destructor_arg -* remains valid until skb destructor */ - void * destructor_arg; /* must be last field, see pskb_expand_head() */ skb_frag_t frags[MAX_SKB_FRAGS]; }; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c83b421..eb040f4 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -343,6 +343,13 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + if (skb_shinfo(skb)-destructor_arg) { + struct skb_ext_page *ext_page = + skb_shinfo(skb)-destructor_arg; + if (ext_page-dtor) + ext_page-dtor(ext_page); + } + kfree(skb-head); } } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 02/17]Add a new struct for device to manipulate external buffer.
From: Xin Xiaohui xiaohui@intel.com Add a structure in structure net_device, the new field is named as mp_port. It's for mediate passthru (zero-copy). It contains the capability for the net device driver, a socket, and an external buffer creator, external means skb buffer belongs to the device may not be allocated from kernel space. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 25 - 1 files changed, 24 insertions(+), 1 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 46c36ff..f6b1870 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -325,6 +325,28 @@ enum netdev_state_t { __LINK_STATE_DORMANT, }; +/*The structure for mediate passthru(zero-copy). */ +struct mp_port { + /* the header len */ + int hdr_len; + /* the max payload len for one descriptor */ + int data_len; + /* the pages for DMA in one time */ + int npages; + /* the socket bind to */ + struct socket *sock; + /* the header len for virtio-net */ + int vnet_hlen; + /* the external buffer page creator */ + struct skb_ext_page *(*ctor)(struct mp_port *, + struct sk_buff *, int); + /* the hash function attached to find according +* backend ring descriptor info for one external +* buffer page. +*/ + struct skb_ext_page *(*hash)(struct net_device *, + struct page *); +}; /* * This structure holds at boot time configured netdevice settings. They @@ -1045,7 +1067,8 @@ struct net_device { /* GARP */ struct garp_port*garp_port; - + /* mpassthru */ + struct mp_port *mp_port; /* class/net/name entry */ struct device dev; /* space for optional device, statistics, and wireless sysfs groups */ -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 04/17]Add a function make external buffer owner to query capability.
From: Xin Xiaohui xiaohui@intel.com The external buffer owner can use the functions to get the capability of the underlying NIC driver. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |2 ++ net/core/dev.c| 41 + 2 files changed, 43 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 575777f..8dcf6de 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1736,6 +1736,8 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi, gro_result_t ret); extern struct sk_buff *napi_frags_skb(struct napi_struct *napi); extern gro_result_tnapi_gro_frags(struct napi_struct *napi); +extern int netdev_mp_port_prep(struct net_device *dev, + struct mp_port *port); static inline void napi_free_frags(struct napi_struct *napi) { diff --git a/net/core/dev.c b/net/core/dev.c index 660dd41..84fbb83 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2942,6 +2942,47 @@ out: return ret; } +/* To support meidate passthru(zero-copy) with NIC driver, + * we'd better query NIC driver for the capability it can + * provide, especially for packet split mode, now we only + * query for the header size, and the payload a descriptor + * may carry. + * Now, it's only called by mpassthru device. + */ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +int netdev_mp_port_prep(struct net_device *dev, + struct mp_port *port) +{ + int rc; + int npages, data_len; + const struct net_device_ops *ops = dev-netdev_ops; + + if (ops-ndo_mp_port_prep) { + rc = ops-ndo_mp_port_prep(dev, port); + if (rc) + return rc; + } else + return -EINVAL; + + if (port-hdr_len = 0) + goto err; + + npages = port-npages; + data_len = port-data_len; + if (npages = 0 || npages MAX_SKB_FRAGS || + (data_len PAGE_SIZE * (npages - 1) || +data_len PAGE_SIZE * npages)) + goto err; + + return 0; +err: + dev_warn(dev-dev, invalid page constructor parameters\n); + + return -EINVAL; +} +EXPORT_SYMBOL(netdev_mp_port_prep); +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 13/17] Add mp(mediate passthru) device.
From: Xin Xiaohui xiaohui@intel.com The patch add mp(mediate passthru) device, which now based on vhost-net backend driver and provides proto_ops to send/receive guest buffers data from/to guest vitio-net driver. It also exports async functions which can be used by other drivers like macvtap to utilize zero-copy too. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/mpassthru.c | 1515 + 1 files changed, 1515 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..492430c --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1515 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/compat.h +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h +#include ../net/bonding/bonding.h + +struct mp_struct { + struct mp_file *mfile; + struct net_device *dev; + struct page_pool*pool; + struct socket socket; + struct socket_wqwq; + struct mm_struct*mm; +}; + +struct mp_file { + atomic_t count; + struct mp_struct *mp; + struct net *net; +}; + +struct mp_sock { + struct sock sk; + struct mp_struct*mp; +}; + +/* The main function to allocate external buffers */ +static struct skb_ext_page *page_ctor(struct mp_port *port, + struct sk_buff *skb, + int npages) +{ + int i; + unsigned long flags; + struct page_pool *pool; + struct page_info *info = NULL; + + if (npages != 1) + BUG(); + pool = container_of(port, struct page_pool, port); + + spin_lock_irqsave(pool-read_lock, flags); + if (!list_empty(pool-readq)) { + info = list_first_entry(pool-readq, struct page_info, list); + list_del(info-list); + } + spin_unlock_irqrestore(pool-read_lock, flags); + if (!info) + return NULL; + + for (i = 0; i info-pnum; i++) + get_page(info-pages[i]); + info-skb = skb; + return info-ext_page; +} + +static struct page_info *mp_hash_lookup(struct page_pool *pool, + struct page *page); +static struct page_info *mp_hash_delete(struct page_pool *pool, + struct page_info *info); + +static struct skb_ext_page *mp_lookup(struct net_device *dev, + struct page *page) +{ + struct mp_struct *mp = + container_of(dev-mp_port-sock-sk, struct mp_sock, sk)-mp; + struct page_pool *pool = mp-pool; + struct page_info *info; + + info = mp_hash_lookup(pool, page); + if (!info) + return NULL; + return info-ext_page; +} + +struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock) +{ + struct page_pool *pool; + struct net_device *master; + struct slave *slave; + struct bonding *bond; + int i; + int rc; + + pool = kzalloc(sizeof(*pool), GFP_KERNEL); + if (!pool) + return NULL; + + /* How to deal with bonding device: +* check if all the slaves are capable of zero-copy. +* if not, fail. +
[PATCH v15 17/17] An example how to alloc user buffer based on napi_gro_frags() interface.
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver which using napi_gro_frags(). It can get buffers from guest side directly using netdev_alloc_page() and release guest buffers using netdev_free_page(). Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/net/ixgbe/ixgbe_main.c | 37 + 1 files changed, 33 insertions(+), 4 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index a4a5263..9f5598b 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1032,7 +1032,14 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, struct net_device *dev) { - return true; + return dev_is_mpassthru(dev); +} + +static u32 get_page_skb_offset(struct net_device *dev) +{ + if (!dev_is_mpassthru(dev)) + return 0; + return dev-mp_port-vnet_hlen; } /** @@ -1105,7 +1112,8 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, adapter-alloc_rx_page_failed++; goto no_buffers; } - bi-page_skb_offset = 0; + bi-page_skb_offset = + get_page_skb_offset(adapter-netdev); bi-dma = dma_map_page(pdev-dev, bi-page_skb, bi-page_skb_offset, (PAGE_SIZE / 2), @@ -1242,8 +1250,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length); } - if (is_no_buffer(rx_buffer_info)) + if (is_no_buffer(rx_buffer_info)) { + printk(no buffers\n); break; + } cleaned = true; if (!rx_buffer_info-mapped_as_page) { @@ -1299,6 +1309,11 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, rx_buffer_info-page_skb, rx_buffer_info-page_skb_offset, len); + if (dev_is_mpassthru(netdev) + netdev-mp_port-hash) + skb_shinfo(skb)-destructor_arg = + netdev-mp_port-hash(netdev, + rx_buffer_info-page_skb); rx_buffer_info-page_skb = NULL; skb-len += len; skb-data_len += len; @@ -1316,7 +1331,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, upper_len); if ((rx_ring-rx_buf_len (PAGE_SIZE / 2)) || - (page_count(rx_buffer_info-page) != 1)) + (page_count(rx_buffer_info-page) != 1) || + dev_is_mpassthru(netdev)) rx_buffer_info-page = NULL; else get_page(rx_buffer_info-page); @@ -6529,6 +6545,16 @@ static void ixgbe_netpoll(struct net_device *netdev) } #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +static int ixgbe_ndo_mp_port_prep(struct net_device *dev, struct mp_port *port) +{ + port-hdr_len = 128; + port-data_len = 2048; + port-npages = 1; + return 0; +} +#endif + static const struct net_device_ops ixgbe_netdev_ops = { .ndo_open = ixgbe_open, .ndo_stop = ixgbe_close, @@ -6548,6 +6574,9 @@ static const struct net_device_ops ixgbe_netdev_ops = { .ndo_set_vf_vlan= ixgbe_ndo_set_vf_vlan, .ndo_set_vf_tx_rate = ixgbe_ndo_set_vf_bw, .ndo_get_vf_config = ixgbe_ndo_get_vf_config, +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) + .ndo_mp_port_prep = ixgbe_ndo_mp_port_prep, +#endif #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller= ixgbe_netpoll, #endif -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 16/17] An example how to modifiy NIC driver to use napi_gro_frags() interface
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver. It provides API is_rx_buffer_mapped_as_page() to indicate if the driver use napi_gro_frags() interface or not. The example allocates 2 pages for DMA for one ring descriptor using netdev_alloc_page(). When packets is coming, using napi_gro_frags() to allocate skb and to receive the packets. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/net/ixgbe/ixgbe.h |3 + drivers/net/ixgbe/ixgbe_main.c | 163 +++- 2 files changed, 131 insertions(+), 35 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h index 9e15eb9..89367ca 100644 --- a/drivers/net/ixgbe/ixgbe.h +++ b/drivers/net/ixgbe/ixgbe.h @@ -131,6 +131,9 @@ struct ixgbe_rx_buffer { struct page *page; dma_addr_t page_dma; unsigned int page_offset; + u16 mapped_as_page; + struct page *page_skb; + unsigned int page_skb_offset; }; struct ixgbe_queue_stats { diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index e32af43..a4a5263 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1029,6 +1029,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring-reg_idx), val); } +static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, + struct net_device *dev) +{ + return true; +} + /** * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split * @adapter: address of board private structure @@ -1045,13 +1051,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, i = rx_ring-next_to_use; bi = rx_ring-rx_buffer_info[i]; + while (cleaned_count--) { rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i); + bi-mapped_as_page = + is_rx_buffer_mapped_as_page(bi, adapter-netdev); + if (!bi-page_dma (rx_ring-flags IXGBE_RING_RX_PS_ENABLED)) { if (!bi-page) { - bi-page = alloc_page(GFP_ATOMIC); + bi-page = netdev_alloc_page(adapter-netdev); if (!bi-page) { adapter-alloc_rx_page_failed++; goto no_buffers; @@ -1068,7 +1078,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, DMA_FROM_DEVICE); } - if (!bi-skb) { + if (!bi-mapped_as_page !bi-skb) { struct sk_buff *skb; /* netdev_alloc_skb reserves 32 bytes up front!! */ uint bufsz = rx_ring-rx_buf_len + SMP_CACHE_BYTES; @@ -1088,6 +1098,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, rx_ring-rx_buf_len, DMA_FROM_DEVICE); } + + if (bi-mapped_as_page !bi-page_skb) { + bi-page_skb = netdev_alloc_page(adapter-netdev); + if (!bi-page_skb) { + adapter-alloc_rx_page_failed++; + goto no_buffers; + } + bi-page_skb_offset = 0; + bi-dma = dma_map_page(pdev-dev, bi-page_skb, + bi-page_skb_offset, + (PAGE_SIZE / 2), + PCI_DMA_FROMDEVICE); + } /* Refresh the desc even if buffer_addrs didn't change because * each write-back erases this info. */ if (rx_ring-flags IXGBE_RING_RX_PS_ENABLED) { @@ -1165,6 +1188,13 @@ struct ixgbe_rsc_cb { bool delay_unmap; }; +static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info) +{ + return (!rx_buffer_info-skb || + !rx_buffer_info-page_skb) + !rx_buffer_info-page; +} + #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)-cb) static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, @@ -1174,6 +1204,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, struct ixgbe_adapter *adapter = q_vector-adapter; struct net_device *netdev = adapter-netdev; struct pci_dev *pdev = adapter-pdev; + struct napi_struct *napi = q_vector-napi; union ixgbe_adv_rx_desc *rx_desc, *next_rxd; struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer; struct sk_buff *skb; @@ -1211,32 +1242,68 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length);
[PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.
We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-11:net core and kernel changes. patch 12-14:new device as interface to mantpulate external buffers. patch 15: for vhost-net. patch 16: An example on modifying NIC driver to using napi_gro_frags(). patch 17: An example how to get guest buffers based on driver who using napi_gro_frags(). The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. What we have not done yet: Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation. Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg. use get_user_pages_fast() and set_page_dirty_lock() when read. Add some comments for netdev_mp_port_prep() and handle_mpassthru(). what we have done in v3: the async write logging is rewritten a drafted synchronous write function for qemu live migration a limit for locked pages from get_user_pages_fast() to prevent Dos by using RLIMIT_MEMLOCK what we have done in v4: add iocb completion callback from vhost-net to queue iocb in mp device replace vq-receiver by mp_sock_data_ready() remove stuff in mp device which access structures from vhost-net modify skb_reserve() to ignore host NIC driver reserved space rebase to the latest vhost tree split large patches into small pieces, especially for net core part. what we have done in v5: address Arnd Bergmann's comments -remove IFF_MPASSTHRU_EXCL flag in mp device -Add CONFIG_COMPAT macro -remove mp_release ops move dev_is_mpassthru() as inline func fix a bug in memory relinquish Apply to current git (2.6.34-rc6) tree. what we have done in v6: move create_iocb() out of page_dtor which may happen in interrupt context -This remove the potential issues which lock called in interrupt context make the cache used by mp, vhost as static, and created/destoryed during modules init/exit functions. -This makes multiple mp guest created at the same time. what we have done in v7: some cleanup prepared to suppprt PS mode what we have done in v8: discarding the modifications to point skb-data to guest buffer directly. Add code to modify driver to support napi_gro_frags() with Herbert's comments. To support PS mode. Add mergeable buffer support in mp device. Add GSO/GRO support in mp deice. Address comments from Eric Dumazet about cache line and rcu usage. what we have done in v9: v8
[PATCH v15 07/17]Modify netdev_alloc_page() to get external buffer
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 68e197e..a1018bd 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -261,11 +261,38 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, } EXPORT_SYMBOL(__netdev_alloc_skb); +struct page *netdev_alloc_ext_pages(struct net_device *dev, int npages) +{ + struct mp_port *port; + struct skb_ext_page *ext_page = NULL; + + port = dev-mp_port; + if (!port) + goto out; + ext_page = port-ctor(port, NULL, npages); + if (ext_page) + return ext_page-page; +out: + return NULL; + +} +EXPORT_SYMBOL(netdev_alloc_ext_pages); + +struct page *netdev_alloc_ext_page(struct net_device *dev) +{ + return netdev_alloc_ext_pages(dev, 1); + +} +EXPORT_SYMBOL(netdev_alloc_ext_page); + struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) { int node = dev-dev.parent ? dev_to_node(dev-dev.parent) : -1; struct page *page; + if (dev_is_mpassthru(dev)) + return netdev_alloc_ext_page(dev); + page = alloc_pages_node(node, gfp_mask, 0); return page; } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 10/17]If device is in zero-copy mode first, bonding will fail.
From: Xin Xiaohui xiaohui@intel.com If device is in this zero-copy mode first, we cannot handle this, so fail it. This patch is for this. If bonding is created first, and one of the device will be in zero-copy mode, this will be handled by mp device. It will first check if all the slaves have the zero-copy capability. If no, fail too. Otherwise make all the slaves in zero-copy mode. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/net/bonding/bond_main.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 3b16f62..dfb6a2c 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1428,6 +1428,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) bond_dev-name); } + /* if the device is in zero-copy mode before bonding, fail it. */ + if (dev_is_mpassthru(slave_dev)) + return -EBUSY; + /* already enslaved */ if (slave_dev-flags IFF_SLAVE) { pr_debug(Error, Device was already enslaved\n); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 12/17] Add header file for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/mpassthru.h | 133 + 1 files changed, 133 insertions(+), 0 deletions(-) create mode 100644 include/linux/mpassthru.h diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h new file mode 100644 index 000..1115f55 --- /dev/null +++ b/include/linux/mpassthru.h @@ -0,0 +1,133 @@ +#ifndef __MPASSTHRU_H +#define __MPASSTHRU_H + +#include linux/types.h +#include linux/if_ether.h +#include linux/ioctl.h + +/* ioctl defines */ +#define MPASSTHRU_BINDDEV _IOW('M', 213, int) +#define MPASSTHRU_UNBINDDEV_IO('M', 214) +#define MPASSTHRU_SET_MEM_LOCKED _IOW('M', 215, unsigned long) +#define MPASSTHRU_GET_MEM_LOCKED_NEED _IOR('M', 216, unsigned long) + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +#define DEFAULT_NEED ((8192*2*2)*4096) + +struct frag { + u16 offset; + u16 size; +}; + +#define HASH_BUCKETS(8192*2) +struct page_info { + struct list_headlist; + struct page_info*next; + struct page_info*prev; + struct page *pages[MAX_SKB_FRAGS]; + struct sk_buff *skb; + struct page_pool*pool; + + /* The pointer relayed to skb, to indicate +* it's a external allocated skb or kernel +*/ + struct skb_ext_pageext_page; + /* flag to indicate read or write */ +#define INFO_READ 0 +#define INFO_WRITE 1 + unsignedflags; + /* exact number of locked pages */ + unsignedpnum; + + /* The fields after that is for backend +* driver, now for vhost-net. +*/ + /* the kiocb structure related to */ + struct kiocb*iocb; + /* the ring descriptor index */ + unsigned intdesc_pos; + /* the iovec coming from backend, we only +* need few of them */ + struct iovechdr[2]; + struct ioveciov[2]; +}; + +struct page_pool { + /* the queue for rx side */ + struct list_headreadq; + /* the lock to protect readq */ + spinlock_t read_lock; + /* record the orignal rlimit */ + struct rlimit o_rlim; + /* userspace wants to locked */ + int locked_pages; + /* currently locked pages */ + int cur_pages; + /* the memory locked before */ + unsigned long orig_locked_vm; + /* the device according to */ + struct net_device *dev; + /* the mp_port according to dev */ + struct mp_port port; + /* the hash_table list to find each locked page */ + struct page_info**hash_table; +}; + +static struct kmem_cache *ext_page_info_cache; + +#ifdef __KERNEL__ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +struct socket *mp_get_socket(struct file *); +struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock); +int async_recvmsg(struct kiocb *iocb, struct page_pool *pool, + struct iovec *iov, int count, int flags); +int async_sendmsg(struct sock *sk, struct kiocb *iocb, + struct page_pool *pool, struct iovec *iov, + int count); +void async_data_ready(struct sock *sk, struct page_pool *pool); +void dev_change_state(struct net_device *dev); +void page_pool_destroy(struct mm_struct *mm, struct page_pool *pool); +#else +#include linux/err.h +#include linux/errno.h +struct file; +struct socket; +static inline struct socket *mp_get_socket(struct file *f) +{ + return ERR_PTR(-EINVAL); +} +static inline struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock) +{ + return ERR_PTR(-EINVAL); +} +static inline int async_recvmsg(struct kiocb *iocb, struct page_pool *pool, + struct iovec *iov, int count, int flags) +{ + return -EINVAL; +} +static inline int async_sendmsg(struct sock *sk, struct kiocb *iocb, + struct page_pool *pool, struct iovec *iov, + int count) +{ + return -EINVAL; +} +static inline void async_data_ready(struct sock *sk, struct page_pool *pool) +{ + return; +} +static inline void dev_change_state(struct net_device *dev) +{ + return; +} +static inline void page_pool_destroy(struct mm_struct *mm, +struct page_pool *pool) +{ + return; +} +#endif /* CONFIG_MEDIATE_PASSTHRU */ +#endif /* __KERNEL__ */ +#endif /* __MPASSTHRU_H */ -- 1.7.3 -- To unsubscribe from this list: send the line
[PATCH v15 14/17]Add a kconfig entry and make entry for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/Kconfig | 10 ++ drivers/vhost/Makefile |2 ++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index e4e2fd1..a6b8cbf 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU + tristate mediate passthru network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support, we call it as mediate passthru to + be distiguish with hardare passthru. + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..c18b9fc 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 15/17] Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui xiaohui@intel.com The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/vhost/net.c | 355 + drivers/vhost/vhost.c | 78 +++ drivers/vhost/vhost.h | 15 ++- 3 files changed, 423 insertions(+), 25 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 7c80082..17c599a 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -24,6 +24,8 @@ #include linux/if_arp.h #include linux/if_tun.h #include linux/if_macvlan.h +#include linux/mpassthru.h +#include linux/aio.h #include net/sock.h @@ -32,6 +34,7 @@ /* Max number of bytes transferred before requeueing the job. * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +static struct kmem_cache *notify_cache; enum { VHOST_NET_VQ_RX = 0, @@ -49,6 +52,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -109,11 +113,184 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_iocb(struct kiocb *iocb) +{ + struct vhost_virtqueue *vq = iocb-private; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + list_add_tail(iocb-ki_list, vq-notifier); + spin_unlock_irqrestore(vq-notify_lock, flags); +} + +static int is_async_vq(struct vhost_virtqueue *vq) +{ + return (vq-link_state == VHOST_VQ_LINK_ASYNC); +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq, + struct socket *sock) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + unsigned int head, log, in, out; + int size; + + if (!is_async_vq(vq)) + return; + + if (sock-sk-sk_data_ready) + sock-sk-sk_data_ready(sock-sk, 0); + + vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? + vq-log : NULL; + + while ((iocb = notify_dequeue(vq)) != NULL) { + if (!iocb-ki_left) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + /* when log is enabled, recomputing the log is needed, +* since these buffers are in async queue, may not get +* the log info before. +*/ + if (unlikely(vq_log)) { + if (!log) + __vhost_get_vq_desc(net-dev, vq, + vq-iov, + ARRAY_SIZE(vq-iov), + out, in, vq_log, + log, head); + vhost_log_write(vq, vq_log, log, size); + } + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } else { + int i = 0; + int count = iocb-ki_left; + int hc = count; + while (count--) { + if (iocb) { + vq-heads[i].id = iocb-ki_pos; + vq-heads[i].len = iocb-ki_nbytes; + size = iocb-ki_nbytes; + head
[PATCH v15 11/17] Add a hook to intercept external buffers from NIC driver.
From: Xin Xiaohui xiaohui@intel.com The hook is called in __netif_receive_skb(). Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/dev.c | 40 1 files changed, 40 insertions(+), 0 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 84fbb83..bdad1c8 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2814,6 +2814,40 @@ int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master) } EXPORT_SYMBOL(__skb_bond_should_drop); +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +/* Add a hook to intercept mediate passthru(zero-copy) packets, + * and insert it to the socket queue owned by mp_port specially. + */ +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb, + struct packet_type **pt_prev, + int *ret, + struct net_device *orig_dev) +{ + struct mp_port *mp_port = NULL; + struct sock *sk = NULL; + + if (!dev_is_mpassthru(skb-dev) !dev_is_mpassthru(orig_dev)) + return skb; + if (dev_is_mpassthru(skb-dev)) + mp_port = skb-dev-mp_port; + else if (orig_dev-master == skb-dev dev_is_mpassthru(orig_dev)) + mp_port = orig_dev-mp_port; + + if (*pt_prev) { + *ret = deliver_skb(skb, *pt_prev, orig_dev); + *pt_prev = NULL; + } + + sk = mp_port-sock-sk; + skb_queue_tail(sk-sk_receive_queue, skb); + sk-sk_state_change(sk); + + return NULL; +} +#else +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb) +#endif + static int __netif_receive_skb(struct sk_buff *skb) { struct packet_type *ptype, *pt_prev; @@ -2891,6 +2925,11 @@ static int __netif_receive_skb(struct sk_buff *skb) ncls: #endif + /* To intercept mediate passthru(zero-copy) packets here */ + skb = handle_mpassthru(skb, pt_prev, ret, orig_dev); + if (!skb) + goto out; + /* Handle special case of bridge or macvlan */ rx_handler = rcu_dereference(skb-dev-rx_handler); if (rx_handler) { @@ -2983,6 +3022,7 @@ err: EXPORT_SYMBOL(netdev_mp_port_prep); #endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 09/17] Don't do skb recycle, if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3d81113..075f4c5 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -557,6 +557,12 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size) if (skb_shared(skb) || skb_cloned(skb)) return false; + /* if the device wants to do mediate passthru, the skb may +* get external buffer, so don't recycle +*/ + if (dev_is_mpassthru(skb-dev)) + return 0; + skb_release_head_state(skb); shinfo = skb_shinfo(skb); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 08/17]Modify netdev_free_page() to release external buffer
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |4 +++- net/core/skbuff.c | 24 2 files changed, 27 insertions(+), 1 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6e1e991..6309ce6 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1586,9 +1586,11 @@ static inline struct page *netdev_alloc_page(struct net_device *dev) return __netdev_alloc_page(dev, GFP_ATOMIC); } +extern void __netdev_free_page(struct net_device *dev, struct page *page); + static inline void netdev_free_page(struct net_device *dev, struct page *page) { - __free_page(page); + __netdev_free_page(dev, page); } /** diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a1018bd..3d81113 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -298,6 +298,30 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) } EXPORT_SYMBOL(__netdev_alloc_page); +void netdev_free_ext_page(struct net_device *dev, struct page *page) +{ + struct skb_ext_page *ext_page = NULL; + if (dev_is_mpassthru(dev) dev-mp_port-hash) { + ext_page = dev-mp_port-hash(dev, page); + if (ext_page) + ext_page-dtor(ext_page); + else + __free_page(page); + } +} +EXPORT_SYMBOL(netdev_free_ext_page); + +void __netdev_free_page(struct net_device *dev, struct page *page) +{ + if (dev_is_mpassthru(dev)) { + netdev_free_ext_page(dev, page); + return; + } + + __free_page(page); +} +EXPORT_SYMBOL(__netdev_free_page); + void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, int size) { -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 06/17] Use callback to deal with skb_release_data() specially.
From: Xin Xiaohui xiaohui@intel.com If buffer is external, then use the callback to destruct buffers. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |7 --- net/core/skbuff.c |7 +++ 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 696e690..6e1e991 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -199,14 +199,15 @@ struct skb_shared_info { struct sk_buff *frag_list; struct skb_shared_hwtstamps hwtstamps; + /* Intermediate layers must ensure that destructor_arg +* remains valid until skb destructor */ + void * destructor_arg; + /* * Warning : all fields before dataref are cleared in __alloc_skb() */ atomic_tdataref; - /* Intermediate layers must ensure that destructor_arg -* remains valid until skb destructor */ - void * destructor_arg; /* must be last field, see pskb_expand_head() */ skb_frag_t frags[MAX_SKB_FRAGS]; }; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c83b421..68e197e 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -343,6 +343,13 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + if (skb-dev dev_is_mpassthru(skb-dev)) { + struct skb_ext_page *ext_page = + skb_shinfo(skb)-destructor_arg; + if (ext_page ext_page-dtor) + ext_page-dtor(ext_page); + } + kfree(skb-head); } } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 05/17] Add a function to indicate if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 8dcf6de..f91d9bb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1739,6 +1739,11 @@ extern gro_result_t napi_gro_frags(struct napi_struct *napi); extern int netdev_mp_port_prep(struct net_device *dev, struct mp_port *port); +static inline bool dev_is_mpassthru(struct net_device *dev) +{ + return dev dev-mp_port; +} + static inline void napi_free_frags(struct napi_struct *napi) { kfree_skb(napi-skb); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 03/17]Add a ndo_mp_port_prep pointer to net_device_ops.
From: Xin Xiaohui xiaohui@intel.com If the driver want to allocate external buffers, then it can export it's capability, as the skb buffer header length, the page length can be DMA, etc. The external buffers owner may utilize this. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f6b1870..575777f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -723,6 +723,12 @@ struct netdev_rx_queue { * int (*ndo_set_vf_port)(struct net_device *dev, int vf, * struct nlattr *port[]); * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb); + * + * int (*ndo_mp_port_prep)(struct net_device *dev, struct mp_port *port); + * If the driver want to allocate external buffers, + * then it can export it's capability, as the skb + * buffer header length, the page length can be DMA, etc. + * The external buffers owner may utilize this. */ #define HAVE_NET_DEVICE_OPS struct net_device_ops { @@ -795,6 +801,10 @@ struct net_device_ops { int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) + int (*ndo_mp_port_prep)(struct net_device *dev, + struct mp_port *port); +#endif }; /* -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v15 01/17] Add a new structure for skb buffer from external.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 77eb60d..696e690 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -211,6 +211,15 @@ struct skb_shared_info { skb_frag_t frags[MAX_SKB_FRAGS]; }; +/* The structure is for a skb which pages may point to + * an external buffer, which is not allocated from kernel space. + * It also contains a destructor for itself. + */ +struct skb_ext_page { + struct page *page; + void(*dtor)(struct skb_ext_page *); +}; + /* We divide dataref into two halves. The higher 16 bits hold references * to the payload part of skb-data. The lower 16 bits hold references to * the entire skb-data. A clone of a headerless skb holds the length of -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re:[PATCH v14 06/17] Use callback to deal with skb_release_data() specially.
From: Xin Xiaohui xiaohui@intel.com Hmm, I suggest you read the comment two lines above. If destructor_arg is now cleared each time we allocate a new skb, then, please move it before dataref in shinfo structure, so that the following memset() does the job efficiently... Something like : diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index e6ba898..2dca504 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -195,6 +195,9 @@ struct skb_shared_info { __be32 ip6_frag_id; __u8tx_flags; struct sk_buff *frag_list; + /* Intermediate layers must ensure that destructor_arg + * remains valid until skb destructor */ + void*destructor_arg; struct skb_shared_hwtstamps hwtstamps; /* @@ -202,9 +205,6 @@ struct skb_shared_info { */ atomic_tdataref; - /* Intermediate layers must ensure that destructor_arg - * remains valid until skb destructor */ - void * destructor_arg; /* must be last field, see pskb_expand_head() */ skb_frag_t frags[MAX_SKB_FRAGS]; }; Will that affect the cache line? Or, we can move the line to clear destructor_arg to the end of __alloc_skb(). It looks like as the following, which one do you prefer? Thanks Xiaohui --- net/core/skbuff.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c83b421..df852f2 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -224,6 +224,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, child-fclone = SKB_FCLONE_UNAVAILABLE; } + shinfo-destructor_arg = NULL; out: return skb; nodata: @@ -343,6 +344,13 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + if (skb-dev dev_is_mpassthru(skb-dev)) { + struct skb_ext_page *ext_page = + skb_shinfo(skb)-destructor_arg; + if (ext_page ext_page-dtor) + ext_page-dtor(ext_page); + } + kfree(skb-head); } } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 05/17] Add a function to indicate if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 8dcf6de..f91d9bb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1739,6 +1739,11 @@ extern gro_result_t napi_gro_frags(struct napi_struct *napi); extern int netdev_mp_port_prep(struct net_device *dev, struct mp_port *port); +static inline bool dev_is_mpassthru(struct net_device *dev) +{ + return dev dev-mp_port; +} + static inline void napi_free_frags(struct napi_struct *napi) { kfree_skb(napi-skb); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 08/17] Modify netdev_free_page() to release external buffer
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |4 +++- net/core/skbuff.c | 24 2 files changed, 27 insertions(+), 1 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 696e690..8cfde3e 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1585,9 +1585,11 @@ static inline struct page *netdev_alloc_page(struct net_device *dev) return __netdev_alloc_page(dev, GFP_ATOMIC); } +extern void __netdev_free_page(struct net_device *dev, struct page *page); + static inline void netdev_free_page(struct net_device *dev, struct page *page) { - __free_page(page); + __netdev_free_page(dev, page); } /** diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f39d372..02439e0 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -299,6 +299,30 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) } EXPORT_SYMBOL(__netdev_alloc_page); +void netdev_free_ext_page(struct net_device *dev, struct page *page) +{ + struct skb_ext_page *ext_page = NULL; + if (dev_is_mpassthru(dev) dev-mp_port-hash) { + ext_page = dev-mp_port-hash(dev, page); + if (ext_page) + ext_page-dtor(ext_page); + else + __free_page(page); + } +} +EXPORT_SYMBOL(netdev_free_ext_page); + +void __netdev_free_page(struct net_device *dev, struct page *page) +{ + if (dev_is_mpassthru(dev)) { + netdev_free_ext_page(dev, page); + return; + } + + __free_page(page); +} +EXPORT_SYMBOL(__netdev_free_page); + void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, int size) { -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 09/17] Don't do skb recycle, if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 02439e0..196aa99 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -558,6 +558,12 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size) if (skb_shared(skb) || skb_cloned(skb)) return false; + /* if the device wants to do mediate passthru, the skb may +* get external buffer, so don't recycle +*/ + if (dev_is_mpassthru(skb-dev)) + return 0; + skb_release_head_state(skb); shinfo = skb_shinfo(skb); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 10/17] If device is in zero-copy mode first, bonding will fail.
From: Xin Xiaohui xiaohui@intel.com If device is in this zero-copy mode first, we cannot handle this, so fail it. This patch is for this. If bonding is created first, and one of the device will be in zero-copy mode, this will be handled by mp device. It will first check if all the slaves have the zero-copy capability. If no, fail too. Otherwise make all the slaves in zero-copy mode. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/net/bonding/bond_main.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 3b16f62..dfb6a2c 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1428,6 +1428,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) bond_dev-name); } + /* if the device is in zero-copy mode before bonding, fail it. */ + if (dev_is_mpassthru(slave_dev)) + return -EBUSY; + /* already enslaved */ if (slave_dev-flags IFF_SLAVE) { pr_debug(Error, Device was already enslaved\n); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 13/17] Add mp(mediate passthru) device.
From: Xin Xiaohui xiaohui@intel.com The patch add mp(mediate passthru) device, which now based on vhost-net backend driver and provides proto_ops to send/receive guest buffers data from/to guest vitio-net driver. It also exports async functions which can be used by other drivers like macvtap to utilize zero-copy too. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/mpassthru.c | 1515 + 1 files changed, 1515 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..492430c --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1515 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/compat.h +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h +#include ../net/bonding/bonding.h + +struct mp_struct { + struct mp_file *mfile; + struct net_device *dev; + struct page_pool*pool; + struct socket socket; + struct socket_wqwq; + struct mm_struct*mm; +}; + +struct mp_file { + atomic_t count; + struct mp_struct *mp; + struct net *net; +}; + +struct mp_sock { + struct sock sk; + struct mp_struct*mp; +}; + +/* The main function to allocate external buffers */ +static struct skb_ext_page *page_ctor(struct mp_port *port, + struct sk_buff *skb, + int npages) +{ + int i; + unsigned long flags; + struct page_pool *pool; + struct page_info *info = NULL; + + if (npages != 1) + BUG(); + pool = container_of(port, struct page_pool, port); + + spin_lock_irqsave(pool-read_lock, flags); + if (!list_empty(pool-readq)) { + info = list_first_entry(pool-readq, struct page_info, list); + list_del(info-list); + } + spin_unlock_irqrestore(pool-read_lock, flags); + if (!info) + return NULL; + + for (i = 0; i info-pnum; i++) + get_page(info-pages[i]); + info-skb = skb; + return info-ext_page; +} + +static struct page_info *mp_hash_lookup(struct page_pool *pool, + struct page *page); +static struct page_info *mp_hash_delete(struct page_pool *pool, + struct page_info *info); + +static struct skb_ext_page *mp_lookup(struct net_device *dev, + struct page *page) +{ + struct mp_struct *mp = + container_of(dev-mp_port-sock-sk, struct mp_sock, sk)-mp; + struct page_pool *pool = mp-pool; + struct page_info *info; + + info = mp_hash_lookup(pool, page); + if (!info) + return NULL; + return info-ext_page; +} + +struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock) +{ + struct page_pool *pool; + struct net_device *master; + struct slave *slave; + struct bonding *bond; + int i; + int rc; + + pool = kzalloc(sizeof(*pool), GFP_KERNEL); + if (!pool) + return NULL; + + /* How to deal with bonding device: +* check if all the slaves are capable of zero-copy. +* if not, fail. +
[PATCH v14 14/17] Add a kconfig entry and make entry for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/Kconfig | 10 ++ drivers/vhost/Makefile |2 ++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index e4e2fd1..a6b8cbf 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU + tristate mediate passthru network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support, we call it as mediate passthru to + be distiguish with hardare passthru. + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..c18b9fc 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 17/17] An example how to alloc user buffer based on napi_gro_frags() interface.
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver which using napi_gro_frags(). It can get buffers from guest side directly using netdev_alloc_page() and release guest buffers using netdev_free_page(). Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/net/ixgbe/ixgbe_main.c | 37 + 1 files changed, 33 insertions(+), 4 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index a4a5263..9f5598b 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1032,7 +1032,14 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, struct net_device *dev) { - return true; + return dev_is_mpassthru(dev); +} + +static u32 get_page_skb_offset(struct net_device *dev) +{ + if (!dev_is_mpassthru(dev)) + return 0; + return dev-mp_port-vnet_hlen; } /** @@ -1105,7 +1112,8 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, adapter-alloc_rx_page_failed++; goto no_buffers; } - bi-page_skb_offset = 0; + bi-page_skb_offset = + get_page_skb_offset(adapter-netdev); bi-dma = dma_map_page(pdev-dev, bi-page_skb, bi-page_skb_offset, (PAGE_SIZE / 2), @@ -1242,8 +1250,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length); } - if (is_no_buffer(rx_buffer_info)) + if (is_no_buffer(rx_buffer_info)) { + printk(no buffers\n); break; + } cleaned = true; if (!rx_buffer_info-mapped_as_page) { @@ -1299,6 +1309,11 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, rx_buffer_info-page_skb, rx_buffer_info-page_skb_offset, len); + if (dev_is_mpassthru(netdev) + netdev-mp_port-hash) + skb_shinfo(skb)-destructor_arg = + netdev-mp_port-hash(netdev, + rx_buffer_info-page_skb); rx_buffer_info-page_skb = NULL; skb-len += len; skb-data_len += len; @@ -1316,7 +1331,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, upper_len); if ((rx_ring-rx_buf_len (PAGE_SIZE / 2)) || - (page_count(rx_buffer_info-page) != 1)) + (page_count(rx_buffer_info-page) != 1) || + dev_is_mpassthru(netdev)) rx_buffer_info-page = NULL; else get_page(rx_buffer_info-page); @@ -6529,6 +6545,16 @@ static void ixgbe_netpoll(struct net_device *netdev) } #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +static int ixgbe_ndo_mp_port_prep(struct net_device *dev, struct mp_port *port) +{ + port-hdr_len = 128; + port-data_len = 2048; + port-npages = 1; + return 0; +} +#endif + static const struct net_device_ops ixgbe_netdev_ops = { .ndo_open = ixgbe_open, .ndo_stop = ixgbe_close, @@ -6548,6 +6574,9 @@ static const struct net_device_ops ixgbe_netdev_ops = { .ndo_set_vf_vlan= ixgbe_ndo_set_vf_vlan, .ndo_set_vf_tx_rate = ixgbe_ndo_set_vf_bw, .ndo_get_vf_config = ixgbe_ndo_get_vf_config, +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) + .ndo_mp_port_prep = ixgbe_ndo_mp_port_prep, +#endif #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller= ixgbe_netpoll, #endif -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 16/17] An example how to modifiy NIC driver to use napi_gro_frags() interface
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver. It provides API is_rx_buffer_mapped_as_page() to indicate if the driver use napi_gro_frags() interface or not. The example allocates 2 pages for DMA for one ring descriptor using netdev_alloc_page(). When packets is coming, using napi_gro_frags() to allocate skb and to receive the packets. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/net/ixgbe/ixgbe.h |3 + drivers/net/ixgbe/ixgbe_main.c | 163 +++- 2 files changed, 131 insertions(+), 35 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h index 9e15eb9..89367ca 100644 --- a/drivers/net/ixgbe/ixgbe.h +++ b/drivers/net/ixgbe/ixgbe.h @@ -131,6 +131,9 @@ struct ixgbe_rx_buffer { struct page *page; dma_addr_t page_dma; unsigned int page_offset; + u16 mapped_as_page; + struct page *page_skb; + unsigned int page_skb_offset; }; struct ixgbe_queue_stats { diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index e32af43..a4a5263 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1029,6 +1029,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring-reg_idx), val); } +static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, + struct net_device *dev) +{ + return true; +} + /** * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split * @adapter: address of board private structure @@ -1045,13 +1051,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, i = rx_ring-next_to_use; bi = rx_ring-rx_buffer_info[i]; + while (cleaned_count--) { rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i); + bi-mapped_as_page = + is_rx_buffer_mapped_as_page(bi, adapter-netdev); + if (!bi-page_dma (rx_ring-flags IXGBE_RING_RX_PS_ENABLED)) { if (!bi-page) { - bi-page = alloc_page(GFP_ATOMIC); + bi-page = netdev_alloc_page(adapter-netdev); if (!bi-page) { adapter-alloc_rx_page_failed++; goto no_buffers; @@ -1068,7 +1078,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, DMA_FROM_DEVICE); } - if (!bi-skb) { + if (!bi-mapped_as_page !bi-skb) { struct sk_buff *skb; /* netdev_alloc_skb reserves 32 bytes up front!! */ uint bufsz = rx_ring-rx_buf_len + SMP_CACHE_BYTES; @@ -1088,6 +1098,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, rx_ring-rx_buf_len, DMA_FROM_DEVICE); } + + if (bi-mapped_as_page !bi-page_skb) { + bi-page_skb = netdev_alloc_page(adapter-netdev); + if (!bi-page_skb) { + adapter-alloc_rx_page_failed++; + goto no_buffers; + } + bi-page_skb_offset = 0; + bi-dma = dma_map_page(pdev-dev, bi-page_skb, + bi-page_skb_offset, + (PAGE_SIZE / 2), + PCI_DMA_FROMDEVICE); + } /* Refresh the desc even if buffer_addrs didn't change because * each write-back erases this info. */ if (rx_ring-flags IXGBE_RING_RX_PS_ENABLED) { @@ -1165,6 +1188,13 @@ struct ixgbe_rsc_cb { bool delay_unmap; }; +static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info) +{ + return (!rx_buffer_info-skb || + !rx_buffer_info-page_skb) + !rx_buffer_info-page; +} + #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)-cb) static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, @@ -1174,6 +1204,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, struct ixgbe_adapter *adapter = q_vector-adapter; struct net_device *netdev = adapter-netdev; struct pci_dev *pdev = adapter-pdev; + struct napi_struct *napi = q_vector-napi; union ixgbe_adv_rx_desc *rx_desc, *next_rxd; struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer; struct sk_buff *skb; @@ -1211,32 +1242,68 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length);
[PATCH v14 15/17]Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui xiaohui@intel.com The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/vhost/net.c | 355 + drivers/vhost/vhost.c | 78 +++ drivers/vhost/vhost.h | 15 ++- 3 files changed, 423 insertions(+), 25 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 7c80082..17c599a 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -24,6 +24,8 @@ #include linux/if_arp.h #include linux/if_tun.h #include linux/if_macvlan.h +#include linux/mpassthru.h +#include linux/aio.h #include net/sock.h @@ -32,6 +34,7 @@ /* Max number of bytes transferred before requeueing the job. * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +static struct kmem_cache *notify_cache; enum { VHOST_NET_VQ_RX = 0, @@ -49,6 +52,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -109,11 +113,184 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_iocb(struct kiocb *iocb) +{ + struct vhost_virtqueue *vq = iocb-private; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + list_add_tail(iocb-ki_list, vq-notifier); + spin_unlock_irqrestore(vq-notify_lock, flags); +} + +static int is_async_vq(struct vhost_virtqueue *vq) +{ + return (vq-link_state == VHOST_VQ_LINK_ASYNC); +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq, + struct socket *sock) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + unsigned int head, log, in, out; + int size; + + if (!is_async_vq(vq)) + return; + + if (sock-sk-sk_data_ready) + sock-sk-sk_data_ready(sock-sk, 0); + + vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? + vq-log : NULL; + + while ((iocb = notify_dequeue(vq)) != NULL) { + if (!iocb-ki_left) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + /* when log is enabled, recomputing the log is needed, +* since these buffers are in async queue, may not get +* the log info before. +*/ + if (unlikely(vq_log)) { + if (!log) + __vhost_get_vq_desc(net-dev, vq, + vq-iov, + ARRAY_SIZE(vq-iov), + out, in, vq_log, + log, head); + vhost_log_write(vq, vq_log, log, size); + } + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } else { + int i = 0; + int count = iocb-ki_left; + int hc = count; + while (count--) { + if (iocb) { + vq-heads[i].id = iocb-ki_pos; + vq-heads[i].len = iocb-ki_nbytes; + size = iocb-ki_nbytes; + head
[PATCH v14 12/17] Add header file for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/mpassthru.h | 133 + 1 files changed, 133 insertions(+), 0 deletions(-) create mode 100644 include/linux/mpassthru.h diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h new file mode 100644 index 000..1115f55 --- /dev/null +++ b/include/linux/mpassthru.h @@ -0,0 +1,133 @@ +#ifndef __MPASSTHRU_H +#define __MPASSTHRU_H + +#include linux/types.h +#include linux/if_ether.h +#include linux/ioctl.h + +/* ioctl defines */ +#define MPASSTHRU_BINDDEV _IOW('M', 213, int) +#define MPASSTHRU_UNBINDDEV_IO('M', 214) +#define MPASSTHRU_SET_MEM_LOCKED _IOW('M', 215, unsigned long) +#define MPASSTHRU_GET_MEM_LOCKED_NEED _IOR('M', 216, unsigned long) + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +#define DEFAULT_NEED ((8192*2*2)*4096) + +struct frag { + u16 offset; + u16 size; +}; + +#define HASH_BUCKETS(8192*2) +struct page_info { + struct list_headlist; + struct page_info*next; + struct page_info*prev; + struct page *pages[MAX_SKB_FRAGS]; + struct sk_buff *skb; + struct page_pool*pool; + + /* The pointer relayed to skb, to indicate +* it's a external allocated skb or kernel +*/ + struct skb_ext_pageext_page; + /* flag to indicate read or write */ +#define INFO_READ 0 +#define INFO_WRITE 1 + unsignedflags; + /* exact number of locked pages */ + unsignedpnum; + + /* The fields after that is for backend +* driver, now for vhost-net. +*/ + /* the kiocb structure related to */ + struct kiocb*iocb; + /* the ring descriptor index */ + unsigned intdesc_pos; + /* the iovec coming from backend, we only +* need few of them */ + struct iovechdr[2]; + struct ioveciov[2]; +}; + +struct page_pool { + /* the queue for rx side */ + struct list_headreadq; + /* the lock to protect readq */ + spinlock_t read_lock; + /* record the orignal rlimit */ + struct rlimit o_rlim; + /* userspace wants to locked */ + int locked_pages; + /* currently locked pages */ + int cur_pages; + /* the memory locked before */ + unsigned long orig_locked_vm; + /* the device according to */ + struct net_device *dev; + /* the mp_port according to dev */ + struct mp_port port; + /* the hash_table list to find each locked page */ + struct page_info**hash_table; +}; + +static struct kmem_cache *ext_page_info_cache; + +#ifdef __KERNEL__ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +struct socket *mp_get_socket(struct file *); +struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock); +int async_recvmsg(struct kiocb *iocb, struct page_pool *pool, + struct iovec *iov, int count, int flags); +int async_sendmsg(struct sock *sk, struct kiocb *iocb, + struct page_pool *pool, struct iovec *iov, + int count); +void async_data_ready(struct sock *sk, struct page_pool *pool); +void dev_change_state(struct net_device *dev); +void page_pool_destroy(struct mm_struct *mm, struct page_pool *pool); +#else +#include linux/err.h +#include linux/errno.h +struct file; +struct socket; +static inline struct socket *mp_get_socket(struct file *f) +{ + return ERR_PTR(-EINVAL); +} +static inline struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock) +{ + return ERR_PTR(-EINVAL); +} +static inline int async_recvmsg(struct kiocb *iocb, struct page_pool *pool, + struct iovec *iov, int count, int flags) +{ + return -EINVAL; +} +static inline int async_sendmsg(struct sock *sk, struct kiocb *iocb, + struct page_pool *pool, struct iovec *iov, + int count) +{ + return -EINVAL; +} +static inline void async_data_ready(struct sock *sk, struct page_pool *pool) +{ + return; +} +static inline void dev_change_state(struct net_device *dev) +{ + return; +} +static inline void page_pool_destroy(struct mm_struct *mm, +struct page_pool *pool) +{ + return; +} +#endif /* CONFIG_MEDIATE_PASSTHRU */ +#endif /* __KERNEL__ */ +#endif /* __MPASSTHRU_H */ -- 1.7.3 -- To unsubscribe from this list: send the line
[PATCH v14 11/17] Add a hook to intercept external buffers from NIC driver.
From: Xin Xiaohui xiaohui@intel.com The hook is called in __netif_receive_skb(). Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/dev.c | 40 1 files changed, 40 insertions(+), 0 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 84fbb83..bdad1c8 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2814,6 +2814,40 @@ int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master) } EXPORT_SYMBOL(__skb_bond_should_drop); +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +/* Add a hook to intercept mediate passthru(zero-copy) packets, + * and insert it to the socket queue owned by mp_port specially. + */ +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb, + struct packet_type **pt_prev, + int *ret, + struct net_device *orig_dev) +{ + struct mp_port *mp_port = NULL; + struct sock *sk = NULL; + + if (!dev_is_mpassthru(skb-dev) !dev_is_mpassthru(orig_dev)) + return skb; + if (dev_is_mpassthru(skb-dev)) + mp_port = skb-dev-mp_port; + else if (orig_dev-master == skb-dev dev_is_mpassthru(orig_dev)) + mp_port = orig_dev-mp_port; + + if (*pt_prev) { + *ret = deliver_skb(skb, *pt_prev, orig_dev); + *pt_prev = NULL; + } + + sk = mp_port-sock-sk; + skb_queue_tail(sk-sk_receive_queue, skb); + sk-sk_state_change(sk); + + return NULL; +} +#else +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb) +#endif + static int __netif_receive_skb(struct sk_buff *skb) { struct packet_type *ptype, *pt_prev; @@ -2891,6 +2925,11 @@ static int __netif_receive_skb(struct sk_buff *skb) ncls: #endif + /* To intercept mediate passthru(zero-copy) packets here */ + skb = handle_mpassthru(skb, pt_prev, ret, orig_dev); + if (!skb) + goto out; + /* Handle special case of bridge or macvlan */ rx_handler = rcu_dereference(skb-dev-rx_handler); if (rx_handler) { @@ -2983,6 +3022,7 @@ err: EXPORT_SYMBOL(netdev_mp_port_prep); #endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 07/17]Modify netdev_alloc_page() to get external buffer
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5e6d69c..f39d372 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -262,11 +262,38 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, } EXPORT_SYMBOL(__netdev_alloc_skb); +struct page *netdev_alloc_ext_pages(struct net_device *dev, int npages) +{ + struct mp_port *port; + struct skb_ext_page *ext_page = NULL; + + port = dev-mp_port; + if (!port) + goto out; + ext_page = port-ctor(port, NULL, npages); + if (ext_page) + return ext_page-page; +out: + return NULL; + +} +EXPORT_SYMBOL(netdev_alloc_ext_pages); + +struct page *netdev_alloc_ext_page(struct net_device *dev) +{ + return netdev_alloc_ext_pages(dev, 1); + +} +EXPORT_SYMBOL(netdev_alloc_ext_page); + struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) { int node = dev-dev.parent ? dev_to_node(dev-dev.parent) : -1; struct page *page; + if (dev_is_mpassthru(dev)) + return netdev_alloc_ext_page(dev); + page = alloc_pages_node(node, gfp_mask, 0); return page; } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 06/17] Use callback to deal with skb_release_data() specially.
From: Xin Xiaohui xiaohui@intel.com If buffer is external, then use the callback to destruct buffers. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c83b421..5e6d69c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -210,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, /* make sure we initialize shinfo sequentially */ shinfo = skb_shinfo(skb); + shinfo-destructor_arg = NULL; memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); atomic_set(shinfo-dataref, 1); @@ -343,6 +344,13 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + if (skb-dev dev_is_mpassthru(skb-dev)) { + struct skb_ext_page *ext_page = + skb_shinfo(skb)-destructor_arg; + if (ext_page ext_page-dtor) + ext_page-dtor(ext_page); + } + kfree(skb-head); } } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 00/17] Provide a zero-copy method on KVM virtio-net.
We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-11:net core and kernel changes. patch 12-14:new device as interface to mantpulate external buffers. patch 15: for vhost-net. patch 16: An example on modifying NIC driver to using napi_gro_frags(). patch 17: An example how to get guest buffers based on driver who using napi_gro_frags(). The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. What we have not done yet: Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation. Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg. use get_user_pages_fast() and set_page_dirty_lock() when read. Add some comments for netdev_mp_port_prep() and handle_mpassthru(). what we have done in v3: the async write logging is rewritten a drafted synchronous write function for qemu live migration a limit for locked pages from get_user_pages_fast() to prevent Dos by using RLIMIT_MEMLOCK what we have done in v4: add iocb completion callback from vhost-net to queue iocb in mp device replace vq-receiver by mp_sock_data_ready() remove stuff in mp device which access structures from vhost-net modify skb_reserve() to ignore host NIC driver reserved space rebase to the latest vhost tree split large patches into small pieces, especially for net core part. what we have done in v5: address Arnd Bergmann's comments -remove IFF_MPASSTHRU_EXCL flag in mp device -Add CONFIG_COMPAT macro -remove mp_release ops move dev_is_mpassthru() as inline func fix a bug in memory relinquish Apply to current git (2.6.34-rc6) tree. what we have done in v6: move create_iocb() out of page_dtor which may happen in interrupt context -This remove the potential issues which lock called in interrupt context make the cache used by mp, vhost as static, and created/destoryed during modules init/exit functions. -This makes multiple mp guest created at the same time. what we have done in v7: some cleanup prepared to suppprt PS mode what we have done in v8: discarding the modifications to point skb-data to guest buffer directly. Add code to modify driver to support napi_gro_frags() with Herbert's comments. To support PS mode. Add mergeable buffer support in mp device. Add GSO/GRO support in mp deice. Address comments from Eric Dumazet about cache line and rcu usage. what we have done in v9: v8
[PATCH v14 02/17] Add a new struct for device to manipulate external buffer.
From: Xin Xiaohui xiaohui@intel.com Add a structure in structure net_device, the new field is named as mp_port. It's for mediate passthru (zero-copy). It contains the capability for the net device driver, a socket, and an external buffer creator, external means skb buffer belongs to the device may not be allocated from kernel space. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 25 - 1 files changed, 24 insertions(+), 1 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 46c36ff..f6b1870 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -325,6 +325,28 @@ enum netdev_state_t { __LINK_STATE_DORMANT, }; +/*The structure for mediate passthru(zero-copy). */ +struct mp_port { + /* the header len */ + int hdr_len; + /* the max payload len for one descriptor */ + int data_len; + /* the pages for DMA in one time */ + int npages; + /* the socket bind to */ + struct socket *sock; + /* the header len for virtio-net */ + int vnet_hlen; + /* the external buffer page creator */ + struct skb_ext_page *(*ctor)(struct mp_port *, + struct sk_buff *, int); + /* the hash function attached to find according +* backend ring descriptor info for one external +* buffer page. +*/ + struct skb_ext_page *(*hash)(struct net_device *, + struct page *); +}; /* * This structure holds at boot time configured netdevice settings. They @@ -1045,7 +1067,8 @@ struct net_device { /* GARP */ struct garp_port*garp_port; - + /* mpassthru */ + struct mp_port *mp_port; /* class/net/name entry */ struct device dev; /* space for optional device, statistics, and wireless sysfs groups */ -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 03/17] Add a ndo_mp_port_prep pointer to net_device_ops.
From: Xin Xiaohui xiaohui@intel.com If the driver want to allocate external buffers, then it can export it's capability, as the skb buffer header length, the page length can be DMA, etc. The external buffers owner may utilize this. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f6b1870..575777f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -723,6 +723,12 @@ struct netdev_rx_queue { * int (*ndo_set_vf_port)(struct net_device *dev, int vf, * struct nlattr *port[]); * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb); + * + * int (*ndo_mp_port_prep)(struct net_device *dev, struct mp_port *port); + * If the driver want to allocate external buffers, + * then it can export it's capability, as the skb + * buffer header length, the page length can be DMA, etc. + * The external buffers owner may utilize this. */ #define HAVE_NET_DEVICE_OPS struct net_device_ops { @@ -795,6 +801,10 @@ struct net_device_ops { int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) + int (*ndo_mp_port_prep)(struct net_device *dev, + struct mp_port *port); +#endif }; /* -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 04/17] Add a function make external buffer owner to query capability.
From: Xin Xiaohui xiaohui@intel.com The external buffer owner can use the functions to get the capability of the underlying NIC driver. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |2 ++ net/core/dev.c| 41 + 2 files changed, 43 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 575777f..8dcf6de 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1736,6 +1736,8 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi, gro_result_t ret); extern struct sk_buff *napi_frags_skb(struct napi_struct *napi); extern gro_result_tnapi_gro_frags(struct napi_struct *napi); +extern int netdev_mp_port_prep(struct net_device *dev, + struct mp_port *port); static inline void napi_free_frags(struct napi_struct *napi) { diff --git a/net/core/dev.c b/net/core/dev.c index 660dd41..84fbb83 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2942,6 +2942,47 @@ out: return ret; } +/* To support meidate passthru(zero-copy) with NIC driver, + * we'd better query NIC driver for the capability it can + * provide, especially for packet split mode, now we only + * query for the header size, and the payload a descriptor + * may carry. + * Now, it's only called by mpassthru device. + */ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +int netdev_mp_port_prep(struct net_device *dev, + struct mp_port *port) +{ + int rc; + int npages, data_len; + const struct net_device_ops *ops = dev-netdev_ops; + + if (ops-ndo_mp_port_prep) { + rc = ops-ndo_mp_port_prep(dev, port); + if (rc) + return rc; + } else + return -EINVAL; + + if (port-hdr_len = 0) + goto err; + + npages = port-npages; + data_len = port-data_len; + if (npages = 0 || npages MAX_SKB_FRAGS || + (data_len PAGE_SIZE * (npages - 1) || +data_len PAGE_SIZE * npages)) + goto err; + + return 0; +err: + dev_warn(dev-dev, invalid page constructor parameters\n); + + return -EINVAL; +} +EXPORT_SYMBOL(netdev_mp_port_prep); +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v14 01/17] Add a new structure for skb buffer from external.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 77eb60d..696e690 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -211,6 +211,15 @@ struct skb_shared_info { skb_frag_t frags[MAX_SKB_FRAGS]; }; +/* The structure is for a skb which pages may point to + * an external buffer, which is not allocated from kernel space. + * It also contains a destructor for itself. + */ +struct skb_ext_page { + struct page *page; + void(*dtor)(struct skb_ext_page *); +}; + /* We divide dataref into two halves. The higher 16 bits hold references * to the payload part of skb-data. The lower 16 bits hold references to * the entire skb-data. A clone of a headerless skb holds the length of -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v13 07/16] Modify netdev_alloc_page() to get external buffer
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5e6d69c..f39d372 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -262,11 +262,38 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, } EXPORT_SYMBOL(__netdev_alloc_skb); +struct page *netdev_alloc_ext_pages(struct net_device *dev, int npages) +{ + struct mp_port *port; + struct skb_ext_page *ext_page = NULL; + + port = dev-mp_port; + if (!port) + goto out; + ext_page = port-ctor(port, NULL, npages); + if (ext_page) + return ext_page-page; +out: + return NULL; + +} +EXPORT_SYMBOL(netdev_alloc_ext_pages); + +struct page *netdev_alloc_ext_page(struct net_device *dev) +{ + return netdev_alloc_ext_pages(dev, 1); + +} +EXPORT_SYMBOL(netdev_alloc_ext_page); + struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) { int node = dev-dev.parent ? dev_to_node(dev-dev.parent) : -1; struct page *page; + if (dev_is_mpassthru(dev)) + return netdev_alloc_ext_page(dev); + page = alloc_pages_node(node, gfp_mask, 0); return page; } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v13 11/16] Add header file for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/mpassthru.h | 133 + 1 files changed, 133 insertions(+), 0 deletions(-) create mode 100644 include/linux/mpassthru.h diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h new file mode 100644 index 000..1115f55 --- /dev/null +++ b/include/linux/mpassthru.h @@ -0,0 +1,133 @@ +#ifndef __MPASSTHRU_H +#define __MPASSTHRU_H + +#include linux/types.h +#include linux/if_ether.h +#include linux/ioctl.h + +/* ioctl defines */ +#define MPASSTHRU_BINDDEV _IOW('M', 213, int) +#define MPASSTHRU_UNBINDDEV_IO('M', 214) +#define MPASSTHRU_SET_MEM_LOCKED _IOW('M', 215, unsigned long) +#define MPASSTHRU_GET_MEM_LOCKED_NEED _IOR('M', 216, unsigned long) + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +#define DEFAULT_NEED ((8192*2*2)*4096) + +struct frag { + u16 offset; + u16 size; +}; + +#define HASH_BUCKETS(8192*2) +struct page_info { + struct list_headlist; + struct page_info*next; + struct page_info*prev; + struct page *pages[MAX_SKB_FRAGS]; + struct sk_buff *skb; + struct page_pool*pool; + + /* The pointer relayed to skb, to indicate +* it's a external allocated skb or kernel +*/ + struct skb_ext_pageext_page; + /* flag to indicate read or write */ +#define INFO_READ 0 +#define INFO_WRITE 1 + unsignedflags; + /* exact number of locked pages */ + unsignedpnum; + + /* The fields after that is for backend +* driver, now for vhost-net. +*/ + /* the kiocb structure related to */ + struct kiocb*iocb; + /* the ring descriptor index */ + unsigned intdesc_pos; + /* the iovec coming from backend, we only +* need few of them */ + struct iovechdr[2]; + struct ioveciov[2]; +}; + +struct page_pool { + /* the queue for rx side */ + struct list_headreadq; + /* the lock to protect readq */ + spinlock_t read_lock; + /* record the orignal rlimit */ + struct rlimit o_rlim; + /* userspace wants to locked */ + int locked_pages; + /* currently locked pages */ + int cur_pages; + /* the memory locked before */ + unsigned long orig_locked_vm; + /* the device according to */ + struct net_device *dev; + /* the mp_port according to dev */ + struct mp_port port; + /* the hash_table list to find each locked page */ + struct page_info**hash_table; +}; + +static struct kmem_cache *ext_page_info_cache; + +#ifdef __KERNEL__ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +struct socket *mp_get_socket(struct file *); +struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock); +int async_recvmsg(struct kiocb *iocb, struct page_pool *pool, + struct iovec *iov, int count, int flags); +int async_sendmsg(struct sock *sk, struct kiocb *iocb, + struct page_pool *pool, struct iovec *iov, + int count); +void async_data_ready(struct sock *sk, struct page_pool *pool); +void dev_change_state(struct net_device *dev); +void page_pool_destroy(struct mm_struct *mm, struct page_pool *pool); +#else +#include linux/err.h +#include linux/errno.h +struct file; +struct socket; +static inline struct socket *mp_get_socket(struct file *f) +{ + return ERR_PTR(-EINVAL); +} +static inline struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock) +{ + return ERR_PTR(-EINVAL); +} +static inline int async_recvmsg(struct kiocb *iocb, struct page_pool *pool, + struct iovec *iov, int count, int flags) +{ + return -EINVAL; +} +static inline int async_sendmsg(struct sock *sk, struct kiocb *iocb, + struct page_pool *pool, struct iovec *iov, + int count) +{ + return -EINVAL; +} +static inline void async_data_ready(struct sock *sk, struct page_pool *pool) +{ + return; +} +static inline void dev_change_state(struct net_device *dev) +{ + return; +} +static inline void page_pool_destroy(struct mm_struct *mm, +struct page_pool *pool) +{ + return; +} +#endif /* CONFIG_MEDIATE_PASSTHRU */ +#endif /* __KERNEL__ */ +#endif /* __MPASSTHRU_H */ -- 1.7.3 -- To unsubscribe from this list: send the line
[PATCH v13 13/16] Add a kconfig entry and make entry for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/Kconfig | 10 ++ drivers/vhost/Makefile |2 ++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index e4e2fd1..a6b8cbf 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU + tristate mediate passthru network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support, we call it as mediate passthru to + be distiguish with hardare passthru. + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..c18b9fc 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v13 14/16]Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui xiaohui@intel.com The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/vhost/net.c | 355 + drivers/vhost/vhost.c | 78 +++ drivers/vhost/vhost.h | 15 ++- 3 files changed, 423 insertions(+), 25 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 7c80082..17c599a 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -24,6 +24,8 @@ #include linux/if_arp.h #include linux/if_tun.h #include linux/if_macvlan.h +#include linux/mpassthru.h +#include linux/aio.h #include net/sock.h @@ -32,6 +34,7 @@ /* Max number of bytes transferred before requeueing the job. * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +static struct kmem_cache *notify_cache; enum { VHOST_NET_VQ_RX = 0, @@ -49,6 +52,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -109,11 +113,184 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_iocb(struct kiocb *iocb) +{ + struct vhost_virtqueue *vq = iocb-private; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + list_add_tail(iocb-ki_list, vq-notifier); + spin_unlock_irqrestore(vq-notify_lock, flags); +} + +static int is_async_vq(struct vhost_virtqueue *vq) +{ + return (vq-link_state == VHOST_VQ_LINK_ASYNC); +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq, + struct socket *sock) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + unsigned int head, log, in, out; + int size; + + if (!is_async_vq(vq)) + return; + + if (sock-sk-sk_data_ready) + sock-sk-sk_data_ready(sock-sk, 0); + + vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? + vq-log : NULL; + + while ((iocb = notify_dequeue(vq)) != NULL) { + if (!iocb-ki_left) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + /* when log is enabled, recomputing the log is needed, +* since these buffers are in async queue, may not get +* the log info before. +*/ + if (unlikely(vq_log)) { + if (!log) + __vhost_get_vq_desc(net-dev, vq, + vq-iov, + ARRAY_SIZE(vq-iov), + out, in, vq_log, + log, head); + vhost_log_write(vq, vq_log, log, size); + } + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } else { + int i = 0; + int count = iocb-ki_left; + int hc = count; + while (count--) { + if (iocb) { + vq-heads[i].id = iocb-ki_pos; + vq-heads[i].len = iocb-ki_nbytes; + size = iocb-ki_nbytes; + head
[PATCH v13 16/16] An example how to alloc user buffer based on napi_gro_frags() interface.
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver which using napi_gro_frags(). It can get buffers from guest side directly using netdev_alloc_page() and release guest buffers using netdev_free_page(). Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/net/ixgbe/ixgbe_main.c | 24 1 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index a4a5263..47663ac 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1032,7 +1032,14 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, struct net_device *dev) { - return true; + return dev_is_mpassthru(dev); +} + +static u32 get_page_skb_offset(struct net_device *dev) +{ + if (!dev_is_mpassthru(dev)) + return 0; + return dev-mp_port-vnet_hlen; } /** @@ -1105,7 +1112,8 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, adapter-alloc_rx_page_failed++; goto no_buffers; } - bi-page_skb_offset = 0; + bi-page_skb_offset = + get_page_skb_offset(adapter-netdev); bi-dma = dma_map_page(pdev-dev, bi-page_skb, bi-page_skb_offset, (PAGE_SIZE / 2), @@ -1242,8 +1250,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length); } - if (is_no_buffer(rx_buffer_info)) + if (is_no_buffer(rx_buffer_info)) { + printk(no buffers\n); break; + } cleaned = true; if (!rx_buffer_info-mapped_as_page) { @@ -1299,6 +1309,11 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, rx_buffer_info-page_skb, rx_buffer_info-page_skb_offset, len); + if (dev_is_mpassthru(netdev) + netdev-mp_port-hash) + skb_shinfo(skb)-destructor_arg = + netdev-mp_port-hash(netdev, + rx_buffer_info-page_skb); rx_buffer_info-page_skb = NULL; skb-len += len; skb-data_len += len; @@ -1316,7 +1331,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, upper_len); if ((rx_ring-rx_buf_len (PAGE_SIZE / 2)) || - (page_count(rx_buffer_info-page) != 1)) + (page_count(rx_buffer_info-page) != 1) || + dev_is_mpassthru(netdev)) rx_buffer_info-page = NULL; else get_page(rx_buffer_info-page); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v13 15/16] An example how to modifiy NIC driver to use napi_gro_frags() interface
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver. It provides API is_rx_buffer_mapped_as_page() to indicate if the driver use napi_gro_frags() interface or not. The example allocates 2 pages for DMA for one ring descriptor using netdev_alloc_page(). When packets is coming, using napi_gro_frags() to allocate skb and to receive the packets. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/net/ixgbe/ixgbe.h |3 + drivers/net/ixgbe/ixgbe_main.c | 163 +++- 2 files changed, 131 insertions(+), 35 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h index 9e15eb9..89367ca 100644 --- a/drivers/net/ixgbe/ixgbe.h +++ b/drivers/net/ixgbe/ixgbe.h @@ -131,6 +131,9 @@ struct ixgbe_rx_buffer { struct page *page; dma_addr_t page_dma; unsigned int page_offset; + u16 mapped_as_page; + struct page *page_skb; + unsigned int page_skb_offset; }; struct ixgbe_queue_stats { diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index e32af43..a4a5263 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1029,6 +1029,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring-reg_idx), val); } +static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, + struct net_device *dev) +{ + return true; +} + /** * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split * @adapter: address of board private structure @@ -1045,13 +1051,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, i = rx_ring-next_to_use; bi = rx_ring-rx_buffer_info[i]; + while (cleaned_count--) { rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i); + bi-mapped_as_page = + is_rx_buffer_mapped_as_page(bi, adapter-netdev); + if (!bi-page_dma (rx_ring-flags IXGBE_RING_RX_PS_ENABLED)) { if (!bi-page) { - bi-page = alloc_page(GFP_ATOMIC); + bi-page = netdev_alloc_page(adapter-netdev); if (!bi-page) { adapter-alloc_rx_page_failed++; goto no_buffers; @@ -1068,7 +1078,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, DMA_FROM_DEVICE); } - if (!bi-skb) { + if (!bi-mapped_as_page !bi-skb) { struct sk_buff *skb; /* netdev_alloc_skb reserves 32 bytes up front!! */ uint bufsz = rx_ring-rx_buf_len + SMP_CACHE_BYTES; @@ -1088,6 +1098,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, rx_ring-rx_buf_len, DMA_FROM_DEVICE); } + + if (bi-mapped_as_page !bi-page_skb) { + bi-page_skb = netdev_alloc_page(adapter-netdev); + if (!bi-page_skb) { + adapter-alloc_rx_page_failed++; + goto no_buffers; + } + bi-page_skb_offset = 0; + bi-dma = dma_map_page(pdev-dev, bi-page_skb, + bi-page_skb_offset, + (PAGE_SIZE / 2), + PCI_DMA_FROMDEVICE); + } /* Refresh the desc even if buffer_addrs didn't change because * each write-back erases this info. */ if (rx_ring-flags IXGBE_RING_RX_PS_ENABLED) { @@ -1165,6 +1188,13 @@ struct ixgbe_rsc_cb { bool delay_unmap; }; +static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info) +{ + return (!rx_buffer_info-skb || + !rx_buffer_info-page_skb) + !rx_buffer_info-page; +} + #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)-cb) static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, @@ -1174,6 +1204,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, struct ixgbe_adapter *adapter = q_vector-adapter; struct net_device *netdev = adapter-netdev; struct pci_dev *pdev = adapter-pdev; + struct napi_struct *napi = q_vector-napi; union ixgbe_adv_rx_desc *rx_desc, *next_rxd; struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer; struct sk_buff *skb; @@ -1211,32 +1242,68 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length);
[PATCH v13 12/16] Add mp(mediate passthru) device.
From: Xin Xiaohui xiaohui@intel.com The patch add mp(mediate passthru) device, which now based on vhost-net backend driver and provides proto_ops to send/receive guest buffers data from/to guest vitio-net driver. It also exports async functions which can be used by other drivers like macvtap to utilize zero-copy too. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/mpassthru.c | 1380 + 1 files changed, 1380 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..5389f3e --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1380 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/compat.h +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h + +struct mp_struct { + struct mp_file *mfile; + struct net_device *dev; + struct page_pool*pool; + struct socket socket; + struct socket_wqwq; + struct mm_struct*mm; +}; + +struct mp_file { + atomic_t count; + struct mp_struct *mp; + struct net *net; +}; + +struct mp_sock { + struct sock sk; + struct mp_struct*mp; +}; + +/* The main function to allocate external buffers */ +static struct skb_ext_page *page_ctor(struct mp_port *port, + struct sk_buff *skb, + int npages) +{ + int i; + unsigned long flags; + struct page_pool *pool; + struct page_info *info = NULL; + + if (npages != 1) + BUG(); + pool = container_of(port, struct page_pool, port); + + spin_lock_irqsave(pool-read_lock, flags); + if (!list_empty(pool-readq)) { + info = list_first_entry(pool-readq, struct page_info, list); + list_del(info-list); + } + spin_unlock_irqrestore(pool-read_lock, flags); + if (!info) + return NULL; + + for (i = 0; i info-pnum; i++) + get_page(info-pages[i]); + info-skb = skb; + return info-ext_page; +} + +static struct page_info *mp_hash_lookup(struct page_pool *pool, + struct page *page); +static struct page_info *mp_hash_delete(struct page_pool *pool, + struct page_info *info); + +static struct skb_ext_page *mp_lookup(struct net_device *dev, + struct page *page) +{ + struct mp_struct *mp = + container_of(dev-mp_port-sock-sk, struct mp_sock, sk)-mp; + struct page_pool *pool = mp-pool; + struct page_info *info; + + info = mp_hash_lookup(pool, page); + if (!info) + return NULL; + return info-ext_page; +} + +struct page_pool *page_pool_create(struct net_device *dev, + struct socket *sock) +{ + struct page_pool *pool; + int rc; + + pool = kzalloc(sizeof(*pool), GFP_KERNEL); + if (!pool) + return NULL; + rc = netdev_mp_port_prep(dev, pool-port); + if (rc) + goto fail; + + INIT_LIST_HEAD(pool-readq); + spin_lock_init(pool-read_lock); + pool-hash_table = + kzalloc(sizeof(struct page_info *) * HASH_BUCKETS, GFP_KERNEL); +
[PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver.
From: Xin Xiaohui xiaohui@intel.com The hook is called in __netif_receive_skb(). Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/dev.c | 37 + 1 files changed, 37 insertions(+), 0 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index e48639d..235eaab 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2814,6 +2814,37 @@ int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master) } EXPORT_SYMBOL(__skb_bond_should_drop); +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +/* Add a hook to intercept mediate passthru(zero-copy) packets, + * and insert it to the socket queue owned by mp_port specially. + */ +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb, + struct packet_type **pt_prev, + int *ret, + struct net_device *orig_dev) +{ + struct mp_port *mp_port = NULL; + struct sock *sk = NULL; + + if (!dev_is_mpassthru(skb-dev)) + return skb; + mp_port = skb-dev-mp_port; + + if (*pt_prev) { + *ret = deliver_skb(skb, *pt_prev, orig_dev); + *pt_prev = NULL; + } + + sk = mp_port-sock-sk; + skb_queue_tail(sk-sk_receive_queue, skb); + sk-sk_state_change(sk); + + return NULL; +} +#else +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb) +#endif + static int __netif_receive_skb(struct sk_buff *skb) { struct packet_type *ptype, *pt_prev; @@ -2891,6 +2922,11 @@ static int __netif_receive_skb(struct sk_buff *skb) ncls: #endif + /* To intercept mediate passthru(zero-copy) packets here */ + skb = handle_mpassthru(skb, pt_prev, ret, orig_dev); + if (!skb) + goto out; + /* Handle special case of bridge or macvlan */ rx_handler = rcu_dereference(skb-dev-rx_handler); if (rx_handler) { @@ -2991,6 +3027,7 @@ err: EXPORT_SYMBOL(netdev_mp_port_prep); #endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v13 08/16] Modify netdev_free_page() to release external buffer
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |4 +++- net/core/skbuff.c | 24 2 files changed, 27 insertions(+), 1 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 696e690..8cfde3e 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1585,9 +1585,11 @@ static inline struct page *netdev_alloc_page(struct net_device *dev) return __netdev_alloc_page(dev, GFP_ATOMIC); } +extern void __netdev_free_page(struct net_device *dev, struct page *page); + static inline void netdev_free_page(struct net_device *dev, struct page *page) { - __free_page(page); + __netdev_free_page(dev, page); } /** diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f39d372..02439e0 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -299,6 +299,30 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) } EXPORT_SYMBOL(__netdev_alloc_page); +void netdev_free_ext_page(struct net_device *dev, struct page *page) +{ + struct skb_ext_page *ext_page = NULL; + if (dev_is_mpassthru(dev) dev-mp_port-hash) { + ext_page = dev-mp_port-hash(dev, page); + if (ext_page) + ext_page-dtor(ext_page); + else + __free_page(page); + } +} +EXPORT_SYMBOL(netdev_free_ext_page); + +void __netdev_free_page(struct net_device *dev, struct page *page) +{ + if (dev_is_mpassthru(dev)) { + netdev_free_ext_page(dev, page); + return; + } + + __free_page(page); +} +EXPORT_SYMBOL(__netdev_free_page); + void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, int size) { -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v13 00/16] Provide a zero-copy method on KVM virtio-net.
We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-10:net core and kernel changes. patch 11-13:new device as interface to mantpulate external buffers. patch 14: for vhost-net. patch 15: An example on modifying NIC driver to using napi_gro_frags(). patch 16: An example how to get guest buffers based on driver who using napi_gro_frags(). The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. What we have not done yet: Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation. Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg. use get_user_pages_fast() and set_page_dirty_lock() when read. Add some comments for netdev_mp_port_prep() and handle_mpassthru(). what we have done in v3: the async write logging is rewritten a drafted synchronous write function for qemu live migration a limit for locked pages from get_user_pages_fast() to prevent Dos by using RLIMIT_MEMLOCK what we have done in v4: add iocb completion callback from vhost-net to queue iocb in mp device replace vq-receiver by mp_sock_data_ready() remove stuff in mp device which access structures from vhost-net modify skb_reserve() to ignore host NIC driver reserved space rebase to the latest vhost tree split large patches into small pieces, especially for net core part. what we have done in v5: address Arnd Bergmann's comments -remove IFF_MPASSTHRU_EXCL flag in mp device -Add CONFIG_COMPAT macro -remove mp_release ops move dev_is_mpassthru() as inline func fix a bug in memory relinquish Apply to current git (2.6.34-rc6) tree. what we have done in v6: move create_iocb() out of page_dtor which may happen in interrupt context -This remove the potential issues which lock called in interrupt context make the cache used by mp, vhost as static, and created/destoryed during modules init/exit functions. -This makes multiple mp guest created at the same time. what we have done in v7: some cleanup prepared to suppprt PS mode what we have done in v8: discarding the modifications to point skb-data to guest buffer directly. Add code to modify driver to support napi_gro_frags() with Herbert's comments. To support PS mode. Add mergeable buffer support in mp device. Add GSO/GRO support in mp deice. Address comments from Eric Dumazet about cache line and rcu usage. what we have done in v9: v8
[PATCH v13 03/16] Add a ndo_mp_port_prep pointer to net_device_ops.
From: Xin Xiaohui xiaohui@intel.com If the driver want to allocate external buffers, then it can export it's capability, as the skb buffer header length, the page length can be DMA, etc. The external buffers owner may utilize this. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f6b1870..575777f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -723,6 +723,12 @@ struct netdev_rx_queue { * int (*ndo_set_vf_port)(struct net_device *dev, int vf, * struct nlattr *port[]); * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb); + * + * int (*ndo_mp_port_prep)(struct net_device *dev, struct mp_port *port); + * If the driver want to allocate external buffers, + * then it can export it's capability, as the skb + * buffer header length, the page length can be DMA, etc. + * The external buffers owner may utilize this. */ #define HAVE_NET_DEVICE_OPS struct net_device_ops { @@ -795,6 +801,10 @@ struct net_device_ops { int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) + int (*ndo_mp_port_prep)(struct net_device *dev, + struct mp_port *port); +#endif }; /* -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v13 02/16] Add a new struct for device to manipulate external buffer.
From: Xin Xiaohui xiaohui@intel.com Add a structure in structure net_device, the new field is named as mp_port. It's for mediate passthru (zero-copy). It contains the capability for the net device driver, a socket, and an external buffer creator, external means skb buffer belongs to the device may not be allocated from kernel space. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 25 - 1 files changed, 24 insertions(+), 1 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 46c36ff..f6b1870 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -325,6 +325,28 @@ enum netdev_state_t { __LINK_STATE_DORMANT, }; +/*The structure for mediate passthru(zero-copy). */ +struct mp_port { + /* the header len */ + int hdr_len; + /* the max payload len for one descriptor */ + int data_len; + /* the pages for DMA in one time */ + int npages; + /* the socket bind to */ + struct socket *sock; + /* the header len for virtio-net */ + int vnet_hlen; + /* the external buffer page creator */ + struct skb_ext_page *(*ctor)(struct mp_port *, + struct sk_buff *, int); + /* the hash function attached to find according +* backend ring descriptor info for one external +* buffer page. +*/ + struct skb_ext_page *(*hash)(struct net_device *, + struct page *); +}; /* * This structure holds at boot time configured netdevice settings. They @@ -1045,7 +1067,8 @@ struct net_device { /* GARP */ struct garp_port*garp_port; - + /* mpassthru */ + struct mp_port *mp_port; /* class/net/name entry */ struct device dev; /* space for optional device, statistics, and wireless sysfs groups */ -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v13 06/16]Use callback to deal with skb_release_data() specially.
From: Xin Xiaohui xiaohui@intel.com If buffer is external, then use the callback to destruct buffers. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c83b421..5e6d69c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -210,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, /* make sure we initialize shinfo sequentially */ shinfo = skb_shinfo(skb); + shinfo-destructor_arg = NULL; memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); atomic_set(shinfo-dataref, 1); @@ -343,6 +344,13 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + if (skb-dev dev_is_mpassthru(skb-dev)) { + struct skb_ext_page *ext_page = + skb_shinfo(skb)-destructor_arg; + if (ext_page ext_page-dtor) + ext_page-dtor(ext_page); + } + kfree(skb-head); } } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v13 05/16] Add a function to indicate if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 8dcf6de..f91d9bb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1739,6 +1739,11 @@ extern gro_result_t napi_gro_frags(struct napi_struct *napi); extern int netdev_mp_port_prep(struct net_device *dev, struct mp_port *port); +static inline bool dev_is_mpassthru(struct net_device *dev) +{ + return dev dev-mp_port; +} + static inline void napi_free_frags(struct napi_struct *napi) { kfree_skb(napi-skb); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v13 04/16] Add a function make external buffer owner to query capability.
From: Xin Xiaohui xiaohui@intel.com The external buffer owner can use the functions to get the capability of the underlying NIC driver. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |2 + net/core/dev.c| 49 + 2 files changed, 51 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 575777f..8dcf6de 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1736,6 +1736,8 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi, gro_result_t ret); extern struct sk_buff *napi_frags_skb(struct napi_struct *napi); extern gro_result_tnapi_gro_frags(struct napi_struct *napi); +extern int netdev_mp_port_prep(struct net_device *dev, + struct mp_port *port); static inline void napi_free_frags(struct napi_struct *napi) { diff --git a/net/core/dev.c b/net/core/dev.c index 660dd41..e48639d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2942,6 +2942,55 @@ out: return ret; } +/* To support meidate passthru(zero-copy) with NIC driver, + * we'd better query NIC driver for the capability it can + * provide, especially for packet split mode, now we only + * query for the header size, and the payload a descriptor + * may carry. If a driver does not use the API to export, + * then we may try to use a default value, currently, + * we use the default value from an IGB driver. Now, + * it's only called by mpassthru device. + */ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +int netdev_mp_port_prep(struct net_device *dev, + struct mp_port *port) +{ + int rc; + int npages, data_len; + const struct net_device_ops *ops = dev-netdev_ops; + + if (ops-ndo_mp_port_prep) { + rc = ops-ndo_mp_port_prep(dev, port); + if (rc) + return rc; + } else { + /* If the NIC driver did not report this, +* then we try to use default value. +*/ + port-hdr_len = 128; + port-data_len = 2048; + port-npages = 1; + } + + if (port-hdr_len = 0) + goto err; + + npages = port-npages; + data_len = port-data_len; + if (npages = 0 || npages MAX_SKB_FRAGS || + (data_len PAGE_SIZE * (npages - 1) || +data_len PAGE_SIZE * npages)) + goto err; + + return 0; +err: + dev_warn(dev-dev, invalid page constructor parameters\n); + + return -EINVAL; +} +EXPORT_SYMBOL(netdev_mp_port_prep); +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v13 01/16] Add a new structure for skb buffer from external.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 77eb60d..696e690 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -211,6 +211,15 @@ struct skb_shared_info { skb_frag_t frags[MAX_SKB_FRAGS]; }; +/* The structure is for a skb which pages may point to + * an external buffer, which is not allocated from kernel space. + * It also contains a destructor for itself. + */ +struct skb_ext_page { + struct page *page; + void(*dtor)(struct skb_ext_page *); +}; + /* We divide dataref into two halves. The higher 16 bits hold references * to the payload part of skb-data. The lower 16 bits hold references to * the entire skb-data. A clone of a headerless skb holds the length of -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 00/17] Provide a zero-copy method on KVM virtio-net.
We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-10:net core and kernel changes. patch 11-13:new device as interface to mantpulate external buffers. patch 14: for vhost-net. patch 15: An example on modifying NIC driver to using napi_gro_frags(). patch 16: An example how to get guest buffers based on driver who using napi_gro_frags(). patch 17: It's a patch to address comments from Michael S. Thirkin to add 2 new ioctls in mp device. We split it out here to make easier reiewer. Need to revise. The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. What we have not done yet: Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation. Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg. use get_user_pages_fast() and set_page_dirty_lock() when read. Add some comments for netdev_mp_port_prep() and handle_mpassthru(). what we have done in v3: the async write logging is rewritten a drafted synchronous write function for qemu live migration a limit for locked pages from get_user_pages_fast() to prevent Dos by using RLIMIT_MEMLOCK what we have done in v4: add iocb completion callback from vhost-net to queue iocb in mp device replace vq-receiver by mp_sock_data_ready() remove stuff in mp device which access structures from vhost-net modify skb_reserve() to ignore host NIC driver reserved space rebase to the latest vhost tree split large patches into small pieces, especially for net core part. what we have done in v5: address Arnd Bergmann's comments -remove IFF_MPASSTHRU_EXCL flag in mp device -Add CONFIG_COMPAT macro -remove mp_release ops move dev_is_mpassthru() as inline func fix a bug in memory relinquish Apply to current git (2.6.34-rc6) tree. what we have done in v6: move create_iocb() out of page_dtor which may happen in interrupt context -This remove the potential issues which lock called in interrupt context make the cache used by mp, vhost as static, and created/destoryed during modules init/exit functions. -This makes multiple mp guest created at the same time. what we have done in v7: some cleanup prepared to suppprt PS mode what we have done in v8: discarding the modifications to point skb-data to guest buffer directly. Add code to modify driver to support napi_gro_frags() with Herbert's comments. To support
[PATCH v12 01/17] Add a new structure for skb buffer from external.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 124f90c..74af06c 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -203,6 +203,15 @@ struct skb_shared_info { void * destructor_arg; }; +/* The structure is for a skb which pages may point to + * an external buffer, which is not allocated from kernel space. + * It also contains a destructor for itself. + */ +struct skb_ext_page { + struct page *page; + void(*dtor)(struct skb_ext_page *); +}; + /* We divide dataref into two halves. The higher 16 bits hold references * to the payload part of skb-data. The lower 16 bits hold references to * the entire skb-data. A clone of a headerless skb holds the length of -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 02/17] Add a new struct for device to manipulate external buffer.
From: Xin Xiaohui xiaohui@intel.com Add a structure in structure net_device, the new field is named as mp_port. It's for mediate passthru (zero-copy). It contains the capability for the net device driver, a socket, and an external buffer creator, external means skb buffer belongs to the device may not be allocated from kernel space. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 25 - 1 files changed, 24 insertions(+), 1 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index fa8b476..9ef9bf1 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -530,6 +530,28 @@ struct netdev_queue { unsigned long tx_dropped; } cacheline_aligned_in_smp; +/*The structure for mediate passthru(zero-copy). */ +struct mp_port { + /* the header len */ + int hdr_len; + /* the max payload len for one descriptor */ + int data_len; + /* the pages for DMA in one time */ + int npages; + /* the socket bind to */ + struct socket *sock; + /* the header len for virtio-net */ + int vnet_hlen; + /* the external buffer page creator */ + struct skb_ext_page *(*ctor)(struct mp_port *, + struct sk_buff *, int); + /* the hash function attached to find according +* backend ring descriptor info for one external +* buffer page. +*/ + struct skb_ext_page *(*hash)(struct net_device *, + struct page *); +}; /* * This structure defines the management hooks for network devices. @@ -952,7 +974,8 @@ struct net_device { struct macvlan_port *macvlan_port; /* GARP */ struct garp_port*garp_port; - + /* mpassthru */ + struct mp_port *mp_port; /* class/net/name entry */ struct device dev; /* space for optional device, statistics, and wireless sysfs groups */ -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 04/17] Add a function make external buffer owner to query capability.
From: Xin Xiaohui xiaohui@intel.com The external buffer owner can use the functions to get the capability of the underlying NIC driver. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |2 + net/core/dev.c| 49 + 2 files changed, 51 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 24a31e7..27f5024 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1608,6 +1608,8 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi, gro_result_t ret); extern struct sk_buff *napi_frags_skb(struct napi_struct *napi); extern gro_result_tnapi_gro_frags(struct napi_struct *napi); +extern int netdev_mp_port_prep(struct net_device *dev, + struct mp_port *port); static inline void napi_free_frags(struct napi_struct *napi) { diff --git a/net/core/dev.c b/net/core/dev.c index 264137f..c11e32c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2468,6 +2468,55 @@ void netif_nit_deliver(struct sk_buff *skb) rcu_read_unlock(); } +/* To support meidate passthru(zero-copy) with NIC driver, + * we'd better query NIC driver for the capability it can + * provide, especially for packet split mode, now we only + * query for the header size, and the payload a descriptor + * may carry. If a driver does not use the API to export, + * then we may try to use a default value, currently, + * we use the default value from an IGB driver. Now, + * it's only called by mpassthru device. + */ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +int netdev_mp_port_prep(struct net_device *dev, + struct mp_port *port) +{ + int rc; + int npages, data_len; + const struct net_device_ops *ops = dev-netdev_ops; + + if (ops-ndo_mp_port_prep) { + rc = ops-ndo_mp_port_prep(dev, port); + if (rc) + return rc; + } else { + /* If the NIC driver did not report this, +* then we try to use default value. +*/ + port-hdr_len = 128; + port-data_len = 2048; + port-npages = 1; + } + + if (port-hdr_len = 0) + goto err; + + npages = port-npages; + data_len = port-data_len; + if (npages = 0 || npages MAX_SKB_FRAGS || + (data_len PAGE_SIZE * (npages - 1) || +data_len PAGE_SIZE * npages)) + goto err; + + return 0; +err: + dev_warn(dev-dev, invalid page constructor parameters\n); + + return -EINVAL; +} +EXPORT_SYMBOL(netdev_mp_port_prep); +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 05/17] Add a function to indicate if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 27f5024..9ed4fa2 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1611,6 +1611,11 @@ extern gro_result_t napi_gro_frags(struct napi_struct *napi); extern int netdev_mp_port_prep(struct net_device *dev, struct mp_port *port); +static inline bool dev_is_mpassthru(struct net_device *dev) +{ + return dev dev-mp_port; +} + static inline void napi_free_frags(struct napi_struct *napi) { kfree_skb(napi-skb); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 07/17]Modify netdev_alloc_page() to get external buffer
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 117d82b..da02fa1 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -269,11 +269,38 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, } EXPORT_SYMBOL(__netdev_alloc_skb); +struct page *netdev_alloc_ext_pages(struct net_device *dev, int npages) +{ + struct mp_port *port; + struct skb_ext_page *ext_page = NULL; + + port = dev-mp_port; + if (!port) + goto out; + ext_page = port-ctor(port, NULL, npages); + if (ext_page) + return ext_page-page; +out: + return NULL; + +} +EXPORT_SYMBOL(netdev_alloc_ext_pages); + +struct page *netdev_alloc_ext_page(struct net_device *dev) +{ + return netdev_alloc_ext_pages(dev, 1); + +} +EXPORT_SYMBOL(netdev_alloc_ext_page); + struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) { int node = dev-dev.parent ? dev_to_node(dev-dev.parent) : -1; struct page *page; + if (dev_is_mpassthru(dev)) + return netdev_alloc_ext_page(dev); + page = alloc_pages_node(node, gfp_mask, 0); return page; } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 09/17] Don't do skb recycle, if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 76c8301..1aede3a 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -565,6 +565,12 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size) if (skb_shared(skb) || skb_cloned(skb)) return 0; + /* if the device wants to do mediate passthru, the skb may +* get external buffer, so don't recycle +*/ + if (dev_is_mpassthru(skb-dev)) + return 0; + skb_release_head_state(skb); shinfo = skb_shinfo(skb); atomic_set(shinfo-dataref, 1); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 11/17] Add header file for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/mpassthru.h | 26 ++ 1 files changed, 26 insertions(+), 0 deletions(-) create mode 100644 include/linux/mpassthru.h diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h new file mode 100644 index 000..c0973b6 --- /dev/null +++ b/include/linux/mpassthru.h @@ -0,0 +1,26 @@ +#ifndef __MPASSTHRU_H +#define __MPASSTHRU_H + +#include linux/types.h +#include linux/if_ether.h +#include linux/ioctl.h + +/* ioctl defines */ +#define MPASSTHRU_BINDDEV _IOW('M', 213, int) +#define MPASSTHRU_UNBINDDEV_IO('M', 214) + +#ifdef __KERNEL__ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +struct socket *mp_get_socket(struct file *); +#else +#include linux/err.h +#include linux/errno.h +struct file; +struct socket; +static inline struct socket *mp_get_socket(struct file *f) +{ + return ERR_PTR(-EINVAL); +} +#endif /* CONFIG_MEDIATE_PASSTHRU */ +#endif /* __KERNEL__ */ +#endif /* __MPASSTHRU_H */ -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 12/17] Add mp(mediate passthru) device.
From: Xin Xiaohui xiaohui@intel.com The patch add mp(mediate passthru) device, which now based on vhost-net backend driver and provides proto_ops to send/receive guest buffers data from/to guest vitio-net driver. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/mpassthru.c | 1380 + 1 files changed, 1380 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..1a114d1 --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1380 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/compat.h +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +struct frag { + u16 offset; + u16 size; +}; + +#defineHASH_BUCKETS(8192*2) + +struct page_info { + struct list_headlist; + struct page_info*next; + struct page_info*prev; + struct page *pages[MAX_SKB_FRAGS]; + struct sk_buff *skb; + struct page_pool*pool; + + /* The pointer relayed to skb, to indicate +* it's a external allocated skb or kernel +*/ + struct skb_ext_pageext_page; + /* flag to indicate read or write */ +#define INFO_READ 0 +#define INFO_WRITE 1 + unsignedflags; + /* exact number of locked pages */ + unsignedpnum; + + /* The fields after that is for backend +* driver, now for vhost-net. +*/ + /* the kiocb structure related to */ + struct kiocb*iocb; + /* the ring descriptor index */ + unsigned intdesc_pos; + /* the iovec coming from backend, we only + * need few of them */ + struct iovechdr[2]; + struct ioveciov[2]; +}; + +static struct kmem_cache *ext_page_info_cache; + +struct page_pool { + /* the queue for rx side */ + struct list_headreadq; + /* the lock to protect readq */ + spinlock_t read_lock; + /* record the orignal rlimit */ + struct rlimit o_rlim; + /* record the locked pages */ + int lock_pages; + /* the device according to */ + struct net_device *dev; + /* the mp_port according to dev */ + struct mp_port port; + /* the hash_table list to find each locked page */ + struct page_info**hash_table; +}; + +struct mp_struct { + struct mp_file *mfile; + struct net_device *dev; + struct page_pool*pool; + struct socket socket; +}; + +struct mp_file { + atomic_t count; + struct mp_struct *mp; + struct net *net; +}; + +struct mp_sock { + struct sock sk; + struct mp_struct*mp; +}; + +/* The main function to allocate external buffers */ +static struct skb_ext_page *page_ctor(struct mp_port *port, + struct sk_buff *skb, + int npages) +{ + int i; + unsigned long flags; + struct page_pool *pool; + struct page_info *info
[PATCH v12 13/17] Add a kconfig entry and make entry for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/Kconfig | 10 ++ drivers/vhost/Makefile |2 ++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index e4e2fd1..a6b8cbf 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU + tristate mediate passthru network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support, we call it as mediate passthru to + be distiguish with hardare passthru. + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..c18b9fc 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 10/17] Add a hook to intercept external buffers from NIC driver.
From: Xin Xiaohui xiaohui@intel.com The hook is called in netif_receive_skb(). Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/dev.c | 35 +++ 1 files changed, 35 insertions(+), 0 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index c11e32c..83172b8 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2517,6 +2517,37 @@ err: EXPORT_SYMBOL(netdev_mp_port_prep); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +/* Add a hook to intercept mediate passthru(zero-copy) packets, + * and insert it to the socket queue owned by mp_port specially. + */ +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb, + struct packet_type **pt_prev, + int *ret, + struct net_device *orig_dev) +{ + struct mp_port *mp_port = NULL; + struct sock *sk = NULL; + + if (!dev_is_mpassthru(skb-dev)) + return skb; + mp_port = skb-dev-mp_port; + + if (*pt_prev) { + *ret = deliver_skb(skb, *pt_prev, orig_dev); + *pt_prev = NULL; + } + + sk = mp_port-sock-sk; + skb_queue_tail(sk-sk_receive_queue, skb); + sk-sk_state_change(sk); + + return NULL; +} +#else +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb) +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process @@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb) ncls: #endif + /* To intercept mediate passthru(zero-copy) packets here */ + skb = handle_mpassthru(skb, pt_prev, ret, orig_dev); + if (!skb) + goto out; skb = handle_bridge(skb, pt_prev, ret, orig_dev); if (!skb) goto out; -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 14/17] Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui xiaohui@intel.com The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/vhost/net.c | 341 + drivers/vhost/vhost.c | 79 drivers/vhost/vhost.h | 15 ++ 3 files changed, 407 insertions(+), 28 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index b38abc6..44f4b15 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -24,6 +24,8 @@ #include linux/if_arp.h #include linux/if_tun.h #include linux/if_macvlan.h +#include linux/mpassthru.h +#include linux/aio.h #include net/sock.h @@ -39,6 +41,8 @@ enum { VHOST_NET_VQ_MAX = 2, }; +static struct kmem_cache *notify_cache; + enum vhost_net_poll_state { VHOST_NET_POLL_DISABLED = 0, VHOST_NET_POLL_STARTED = 1, @@ -49,6 +53,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -93,11 +98,183 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_iocb(struct kiocb *iocb) +{ + struct vhost_virtqueue *vq = iocb-private; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + list_add_tail(iocb-ki_list, vq-notifier); + spin_unlock_irqrestore(vq-notify_lock, flags); +} + +static int is_async_vq(struct vhost_virtqueue *vq) +{ + return (vq-link_state == VHOST_VQ_LINK_ASYNC); +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq, + struct socket *sock) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + unsigned int head, log, in, out; + int size; + + if (!is_async_vq(vq)) + return; + + if (sock-sk-sk_data_ready) + sock-sk-sk_data_ready(sock-sk, 0); + + vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? + vq-log : NULL; + + while ((iocb = notify_dequeue(vq)) != NULL) { + if (!iocb-ki_left) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + /* when log is enabled, recomputing the log is needed, +* since these buffers are in async queue, may not get +* the log info before. +*/ + if (unlikely(vq_log)) { + if (!log) + __vhost_get_desc(net-dev, vq, vq-iov, + ARRAY_SIZE(vq-iov), + out, in, vq_log, + log, head); + vhost_log_write(vq, vq_log, log, size); + } + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } else { + int i = 0; + int count = iocb-ki_left; + int hc = count; + while (count--) { + if (iocb) { + vq-heads[i].id = iocb-ki_pos; + vq-heads[i].len = iocb-ki_nbytes; + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len +=
[PATCH v12 16/17]An example how to alloc user buffer based on napi_gro_frags() interface.
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver which using napi_gro_frags(). It can get buffers from guest side directly using netdev_alloc_page() and release guest buffers using netdev_free_page(). --- drivers/net/ixgbe/ixgbe_main.c | 25 + 1 files changed, 21 insertions(+), 4 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index 905d6d2..0977f2f 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -691,7 +691,14 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, struct net_device *dev) { - return true; + return dev_is_mpassthru(dev); +} + +static u32 get_page_skb_offset(struct net_device *dev) +{ + if (!dev_is_mpassthru(dev)) + return 0; + return dev-mp_port-vnet_hlen; } /** @@ -764,7 +771,8 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, adapter-alloc_rx_page_failed++; goto no_buffers; } - bi-page_skb_offset = 0; + bi-page_skb_offset = + get_page_skb_offset(adapter-netdev); bi-dma = pci_map_page(pdev, bi-page_skb, bi-page_skb_offset, (PAGE_SIZE / 2), @@ -899,8 +907,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length); } - if (is_no_buffer(rx_buffer_info)) + if (is_no_buffer(rx_buffer_info)) { + printk(no buffers\n); break; + } cleaned = true; @@ -959,6 +969,12 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, rx_buffer_info-page_skb, rx_buffer_info-page_skb_offset, len); + if (dev_is_mpassthru(netdev) + netdev-mp_port-hash) + skb_shinfo(skb)-destructor_arg = + netdev-mp_port-hash(netdev, + rx_buffer_info-page_skb); + rx_buffer_info-page_skb = NULL; skb-len += len; skb-data_len += len; @@ -976,7 +992,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, upper_len); if ((rx_ring-rx_buf_len (PAGE_SIZE / 2)) || - (page_count(rx_buffer_info-page) != 1)) + (page_count(rx_buffer_info-page) != 1) || + dev_is_mpassthru(netdev)) rx_buffer_info-page = NULL; else get_page(rx_buffer_info-page); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 15/17]An example how to modifiy NIC driver to use napi_gro_frags() interface
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver. It provides API is_rx_buffer_mapped_as_page() to indicate if the driver use napi_gro_frags() interface or not. The example allocates 2 pages for DMA for one ring descriptor using netdev_alloc_page(). When packets is coming, using napi_gro_frags() to allocate skb and to receive the packets. --- drivers/net/ixgbe/ixgbe.h |3 + drivers/net/ixgbe/ixgbe_main.c | 151 2 files changed, 125 insertions(+), 29 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h index 79c35ae..fceffc5 100644 --- a/drivers/net/ixgbe/ixgbe.h +++ b/drivers/net/ixgbe/ixgbe.h @@ -131,6 +131,9 @@ struct ixgbe_rx_buffer { struct page *page; dma_addr_t page_dma; unsigned int page_offset; + u16 mapped_as_page; + struct page *page_skb; + unsigned int page_skb_offset; }; struct ixgbe_queue_stats { diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index 6c00ee4..905d6d2 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -688,6 +688,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring-reg_idx), val); } +static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, + struct net_device *dev) +{ + return true; +} + /** * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split * @adapter: address of board private structure @@ -704,13 +710,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, i = rx_ring-next_to_use; bi = rx_ring-rx_buffer_info[i]; + while (cleaned_count--) { rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i); + bi-mapped_as_page = + is_rx_buffer_mapped_as_page(bi, adapter-netdev); + if (!bi-page_dma (rx_ring-flags IXGBE_RING_RX_PS_ENABLED)) { if (!bi-page) { - bi-page = alloc_page(GFP_ATOMIC); + bi-page = netdev_alloc_page(adapter-netdev); if (!bi-page) { adapter-alloc_rx_page_failed++; goto no_buffers; @@ -727,7 +737,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, PCI_DMA_FROMDEVICE); } - if (!bi-skb) { + if (!bi-mapped_as_page !bi-skb) { struct sk_buff *skb; /* netdev_alloc_skb reserves 32 bytes up front!! */ uint bufsz = rx_ring-rx_buf_len + SMP_CACHE_BYTES; @@ -747,6 +757,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, rx_ring-rx_buf_len, PCI_DMA_FROMDEVICE); } + + if (bi-mapped_as_page !bi-page_skb) { + bi-page_skb = netdev_alloc_page(adapter-netdev); + if (!bi-page_skb) { + adapter-alloc_rx_page_failed++; + goto no_buffers; + } + bi-page_skb_offset = 0; + bi-dma = pci_map_page(pdev, bi-page_skb, + bi-page_skb_offset, + (PAGE_SIZE / 2), + PCI_DMA_FROMDEVICE); + } /* Refresh the desc even if buffer_addrs didn't change because * each write-back erases this info. */ if (rx_ring-flags IXGBE_RING_RX_PS_ENABLED) { @@ -823,6 +846,13 @@ struct ixgbe_rsc_cb { dma_addr_t dma; }; +static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info) +{ + return (!rx_buffer_info-skb || + !rx_buffer_info-page_skb) + !rx_buffer_info-page; +} + #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)-cb) static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, @@ -832,6 +862,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, struct ixgbe_adapter *adapter = q_vector-adapter; struct net_device *netdev = adapter-netdev; struct pci_dev *pdev = adapter-pdev; + struct napi_struct *napi = q_vector-napi; union ixgbe_adv_rx_desc *rx_desc, *next_rxd; struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer; struct sk_buff *skb; @@ -868,29 +899,71 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length); } + if
[PATCH v12 17/17]add two new ioctls for mp device.
From: Xin Xiaohui xiaohui@intel.com The patch add two ioctls for mp device. One is for userspace to query how much memory locked to make mp device run smoothly. Another one is for userspace to set how much meory locked it really wants. --- drivers/vhost/mpassthru.c | 109 +++-- include/linux/mpassthru.h |2 + 2 files changed, 58 insertions(+), 53 deletions(-) diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c index 1a114d1..41aa59e 100644 --- a/drivers/vhost/mpassthru.c +++ b/drivers/vhost/mpassthru.c @@ -54,6 +54,8 @@ #define COPY_THRESHOLD (L1_CACHE_BYTES * 4) #define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) +#define DEFAULT_NEED ((8192*2*2)*4096) + struct frag { u16 offset; u16 size; @@ -102,8 +104,10 @@ struct page_pool { spinlock_t read_lock; /* record the orignal rlimit */ struct rlimit o_rlim; - /* record the locked pages */ - int lock_pages; + /* userspace wants to locked */ + int locked_pages; + /* currently locked pages */ + int cur_pages; /* the device according to */ struct net_device *dev; /* the mp_port according to dev */ @@ -117,6 +121,7 @@ struct mp_struct { struct net_device *dev; struct page_pool*pool; struct socket socket; + struct task_struct *user; }; struct mp_file { @@ -207,8 +212,8 @@ static int page_pool_attach(struct mp_struct *mp) pool-port.ctor = page_ctor; pool-port.sock = mp-socket; pool-port.hash = mp_lookup; - pool-lock_pages = 0; - + pool-locked_pages = 0; + pool-cur_pages = 0; /* locked by mp_mutex */ dev-mp_port = pool-port; mp-pool = pool; @@ -236,37 +241,6 @@ struct page_info *info_dequeue(struct page_pool *pool) return info; } -static int set_memlock_rlimit(struct page_pool *pool, int resource, - unsigned long cur, unsigned long max) -{ - struct rlimit new_rlim, *old_rlim; - int retval; - - if (resource != RLIMIT_MEMLOCK) - return -EINVAL; - new_rlim.rlim_cur = cur; - new_rlim.rlim_max = max; - - old_rlim = current-signal-rlim + resource; - - /* remember the old rlimit value when backend enabled */ - pool-o_rlim.rlim_cur = old_rlim-rlim_cur; - pool-o_rlim.rlim_max = old_rlim-rlim_max; - - if ((new_rlim.rlim_max old_rlim-rlim_max) - !capable(CAP_SYS_RESOURCE)) - return -EPERM; - - retval = security_task_setrlimit(resource, new_rlim); - if (retval) - return retval; - - task_lock(current-group_leader); - *old_rlim = new_rlim; - task_unlock(current-group_leader); - return 0; -} - static void mp_ki_dtor(struct kiocb *iocb) { struct page_info *info = (struct page_info *)(iocb-private); @@ -286,7 +260,7 @@ static void mp_ki_dtor(struct kiocb *iocb) } } /* Decrement the number of locked pages */ - info-pool-lock_pages -= info-pnum; + info-pool-cur_pages -= info-pnum; kmem_cache_free(ext_page_info_cache, info); return; @@ -319,6 +293,7 @@ static int page_pool_detach(struct mp_struct *mp) { struct page_pool *pool; struct page_info *info; + struct task_struct *tsk = mp-user; int i; /* locked by mp_mutex */ @@ -334,9 +309,9 @@ static int page_pool_detach(struct mp_struct *mp) kmem_cache_free(ext_page_info_cache, info); } - set_memlock_rlimit(pool, RLIMIT_MEMLOCK, - pool-o_rlim.rlim_cur, - pool-o_rlim.rlim_max); + down_write(tsk-mm-mmap_sem); + tsk-mm-locked_vm -= pool-locked_pages; + up_write(tsk-mm-mmap_sem); /* locked by mp_mutex */ pool-dev-mp_port = NULL; @@ -534,14 +509,11 @@ static struct page_info *alloc_page_info(struct page_pool *pool, int rc; int i, j, n = 0; int len; - unsigned long base, lock_limit; + unsigned long base; struct page_info *info = NULL; - lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur; - lock_limit = PAGE_SHIFT; - - if (pool-lock_pages + count lock_limit npages) { - printk(KERN_INFO exceed the locked memory rlimit.); + if (pool-cur_pages + count pool-locked_pages) { + printk(KERN_INFO Exceed memory lock rlimt.); return NULL; } @@ -603,7 +575,7 @@ static struct page_info *alloc_page_info(struct page_pool *pool, mp_hash_insert(pool, info-pages[i], info); } /* increment the number of locked pages */ - pool-lock_pages += j; +
[PATCH v12 03/17] Add a ndo_mp_port_prep pointer to net_device_ops.
From: Xin Xiaohui xiaohui@intel.com If the driver want to allocate external buffers, then it can export it's capability, as the skb buffer header length, the page length can be DMA, etc. The external buffers owner may utilize this. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 9ef9bf1..24a31e7 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -651,6 +651,12 @@ struct mp_port { * int (*ndo_set_vf_tx_rate)(struct net_device *dev, int vf, int rate); * int (*ndo_get_vf_config)(struct net_device *dev, * int vf, struct ifla_vf_info *ivf); + * + * int (*ndo_mp_port_prep)(struct net_device *dev, struct mp_port *port); + * If the driver want to allocate external buffers, + * then it can export it's capability, as the skb + * buffer header length, the page length can be DMA, etc. + * The external buffers owner may utilize this. */ #define HAVE_NET_DEVICE_OPS struct net_device_ops { @@ -713,6 +719,10 @@ struct net_device_ops { int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) + int (*ndo_mp_port_prep)(struct net_device *dev, + struct mp_port *port); +#endif }; /* -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 06/17] Use callback to deal with skb_release_data() specially.
From: Xin Xiaohui xiaohui@intel.com If buffer is external, then use the callback to destruct buffers. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |3 ++- net/core/skbuff.c |8 2 files changed, 10 insertions(+), 1 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 74af06c..ab29675 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -197,10 +197,11 @@ struct skb_shared_info { union skb_shared_tx tx_flags; struct sk_buff *frag_list; struct skb_shared_hwtstamps hwtstamps; - skb_frag_t frags[MAX_SKB_FRAGS]; /* Intermediate layers must ensure that destructor_arg * remains valid until skb destructor */ void * destructor_arg; + + skb_frag_t frags[MAX_SKB_FRAGS]; }; /* The structure is for a skb which pages may point to diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 93c4e06..117d82b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -217,6 +217,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, shinfo-gso_type = 0; shinfo-ip6_frag_id = 0; shinfo-tx_flags.flags = 0; + shinfo-destructor_arg = NULL; skb_frag_list_init(skb); memset(shinfo-hwtstamps, 0, sizeof(shinfo-hwtstamps)); @@ -350,6 +351,13 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + if (skb-dev dev_is_mpassthru(skb-dev)) { + struct skb_ext_page *ext_page = + skb_shinfo(skb)-destructor_arg; + if (ext_page ext_page-dtor) + ext_page-dtor(ext_page); + } + kfree(skb-head); } } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v11 17/17]add two new ioctls for mp device.
From: Xin Xiaohui xiaohui@intel.com Michael, So here, current might be different from mp-user: many processes might share an fd. The result will be that you will subtract locked_vm from A but add it to B. The right thing to do IMO is to store mm on SET_MEM_LOCKED. Also be careful about multiple callers etc. +locked = limit + current-mm-locked_vm; +lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; + +if ((locked lock_limit) !capable(CAP_IPC_LOCK)) { +up_write(current-mm-mmap_sem); +mp_put(mfile); +return -ENOMEM; +} +current-mm-locked_vm = locked; +up_write(current-mm-mmap_sem); + +mutex_lock(mp_mutex); +mp-ctor-locked_pages = limit; What if a process calls SET_MEM_LOCKED multiple times (or many processes do)? How about the patch followed to fix this? What if it is called when some pages are already locked? Though some pages are already locked, when the ioctl is called, But I think it's not so critical, as we still can set the limit as wanted to ctor-locked_pages, and check with the new limit with ctor-cur_pages. maybe there are several pages more locked, but not too much, the rlimit is still useful after that. Or something I have missed here? --- drivers/vhost/mpassthru.c | 34 -- include/linux/mpassthru.h |4 ++-- 2 files changed, 22 insertions(+), 16 deletions(-) diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c index fc2a073..0965804 100644 --- a/drivers/vhost/mpassthru.c +++ b/drivers/vhost/mpassthru.c @@ -101,6 +101,7 @@ struct page_ctor { /* record the locked pages */ int locked_pages; int cur_pages; + unsigned long orig_locked_vm; struct net_device *dev; struct mpassthru_port port; struct page_info**hash_table; @@ -111,7 +112,7 @@ struct mp_struct { struct net_device *dev; struct page_ctor*ctor; struct socket socket; - struct task_struct *user; + struct mm_struct*mm; }; struct mp_file { @@ -222,7 +223,7 @@ static int page_ctor_attach(struct mp_struct *mp) ctor-port.hash = mp_lookup; ctor-locked_pages = 0; ctor-cur_pages = 0; - + ctor-orig_locked_vm = 0; /* locked by mp_mutex */ dev-mp_port = ctor-port; mp-ctor = ctor; @@ -316,7 +317,6 @@ static int page_ctor_detach(struct mp_struct *mp) { struct page_ctor *ctor; struct page_info *info; - struct task_struct *tsk = mp-user; int i; /* locked by mp_mutex */ @@ -335,9 +335,9 @@ static int page_ctor_detach(struct mp_struct *mp) relinquish_resource(ctor); - down_write(tsk-mm-mmap_sem); - tsk-mm-locked_vm -= ctor-locked_pages; - up_write(tsk-mm-mmap_sem); + down_write(mp-mm-mmap_sem); + mp-mm-locked_vm = ctor-orig_locked_vm; + up_write(mp-mm-mmap_sem); /* locked by mp_mutex */ ctor-dev-mp_port = NULL; @@ -1104,7 +1104,7 @@ static long mp_chr_ioctl(struct file *file, unsigned int cmd, goto err_dev_put; } mp-dev = dev; - mp-user = current; + mp-mm = get_task_mm(current); ret = -ENOMEM; sk = sk_alloc(mfile-net, AF_UNSPEC, GFP_KERNEL, mp_proto); @@ -1154,21 +1154,27 @@ err_dev_put: mp = mp_get(mfile); if (!mp) return -ENODEV; - + mutex_lock(mp_mutex); + if (mp-mm != current-mm) { + mutex_unlock(mp_mutex); + return -EPERM; + } limit = PAGE_ALIGN(limit) PAGE_SHIFT; - down_write(current-mm-mmap_sem); - locked = limit + current-mm-locked_vm; + down_write(mp-mm-mmap_sem); + if (!mp-ctor-locked_pages) + mp-ctor-orig_locked_vm = mp-mm-locked_vm; + locked = limit + mp-ctor-orig_locked_vm; lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; if ((locked lock_limit) !capable(CAP_IPC_LOCK)) { - up_write(current-mm-mmap_sem); + up_write(mp-mm-mmap_sem); + mutex_unlock(mp_mutex); mp_put(mfile); return -ENOMEM; } - current-mm-locked_vm = locked; - up_write(current-mm-mmap_sem); + mp-mm-locked_vm = locked; + up_write(mp-mm-mmap_sem); - mutex_lock(mp_mutex); mp-ctor-locked_pages = limit; mutex_unlock(mp_mutex); -- 1.7.3 -- To unsubscribe from this list: send
[PATCH v11 05/17] Add a function to indicate if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5f192de..23d6ec0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1602,6 +1602,11 @@ extern gro_result_t napi_gro_frags(struct napi_struct *napi); extern int netdev_mp_port_prep(struct net_device *dev, struct mpassthru_port *port); +static inline bool dev_is_mpassthru(struct net_device *dev) +{ + return dev dev-mp_port; +} + static inline void napi_free_frags(struct napi_struct *napi) { kfree_skb(napi-skb); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 13/17] Add mp(mediate passthru) device.
From: Xin Xiaohui xiaohui@intel.com The patch add mp(mediate passthru) device, which now based on vhost-net backend driver and provides proto_ops to send/receive guest buffers data from/to guest vitio-net driver. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/mpassthru.c | 1407 + 1 files changed, 1407 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..d86d94c --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1407 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/compat.h +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h + +/* Uncomment to enable debugging */ +/* #define MPASSTHRU_DEBUG 1 */ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk +#else +#define DBG(a...) +#define DBG1(a...) +#endif + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +struct frag { + u16 offset; + u16 size; +}; + +#defineHASH_BUCKETS(8192*2) + +struct page_info { + struct list_headlist; + struct page_info*next; + struct page_info*prev; + struct page *pages[MAX_SKB_FRAGS]; + struct sk_buff *skb; + struct page_ctor*ctor; + + /* The pointer relayed to skb, to indicate +* it's a external allocated skb or kernel +*/ + struct skb_ext_pageext_page; + +#define INFO_READ 0 +#define INFO_WRITE 1 + unsignedflags; + unsignedpnum; + + /* The fields after that is for backend +* driver, now for vhost-net. +*/ + + struct kiocb*iocb; + unsigned intdesc_pos; + struct iovechdr[2]; + struct ioveciov[MAX_SKB_FRAGS]; +}; + +static struct kmem_cache *ext_page_info_cache; + +struct page_ctor { + struct list_headreadq; + int wq_len; + int rq_len; + spinlock_t read_lock; + /* record the locked pages */ + int lock_pages; + struct rlimit o_rlim; + struct net_device *dev; + struct mpassthru_port port; + struct page_info**hash_table; +}; + +struct mp_struct { + struct mp_file *mfile; + struct net_device *dev; + struct page_ctor*ctor; + struct socket socket; + +#ifdef MPASSTHRU_DEBUG + int debug; +#endif +}; + +struct mp_file { + atomic_t count; + struct mp_struct *mp; + struct net *net; +}; + +struct mp_sock { + struct sock sk; + struct mp_struct*mp; +}; + +static int mp_dev_change_flags(struct net_device *dev, unsigned flags) +{ + int ret = 0; + + rtnl_lock(); + ret = dev_change_flags(dev, flags); + rtnl_unlock(); + + if (ret 0) + printk(KERN_ERR failed to change dev state of %s, dev-name); + + return ret; +} + +/* The main function to allocate external buffers */ +static struct skb_ext_page *page_ctor(struct mpassthru_port *port, +
[PATCH v11 12/17] Add a kconfig entry and make entry for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/Kconfig | 10 ++ drivers/vhost/Makefile |2 ++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index e4e2fd1..a6b8cbf 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU + tristate mediate passthru network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support, we call it as mediate passthru to + be distiguish with hardare passthru. + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..c18b9fc 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 14/17]Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui xiaohui@intel.com The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/vhost/net.c | 341 + drivers/vhost/vhost.c | 79 drivers/vhost/vhost.h | 15 ++ 3 files changed, 407 insertions(+), 28 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index b38abc6..44f4b15 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -24,6 +24,8 @@ #include linux/if_arp.h #include linux/if_tun.h #include linux/if_macvlan.h +#include linux/mpassthru.h +#include linux/aio.h #include net/sock.h @@ -39,6 +41,8 @@ enum { VHOST_NET_VQ_MAX = 2, }; +static struct kmem_cache *notify_cache; + enum vhost_net_poll_state { VHOST_NET_POLL_DISABLED = 0, VHOST_NET_POLL_STARTED = 1, @@ -49,6 +53,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -93,11 +98,183 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_iocb(struct kiocb *iocb) +{ + struct vhost_virtqueue *vq = iocb-private; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + list_add_tail(iocb-ki_list, vq-notifier); + spin_unlock_irqrestore(vq-notify_lock, flags); +} + +static int is_async_vq(struct vhost_virtqueue *vq) +{ + return (vq-link_state == VHOST_VQ_LINK_ASYNC); +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq, + struct socket *sock) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + unsigned int head, log, in, out; + int size; + + if (!is_async_vq(vq)) + return; + + if (sock-sk-sk_data_ready) + sock-sk-sk_data_ready(sock-sk, 0); + + vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? + vq-log : NULL; + + while ((iocb = notify_dequeue(vq)) != NULL) { + if (!iocb-ki_left) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + /* when log is enabled, recomputing the log is needed, +* since these buffers are in async queue, may not get +* the log info before. +*/ + if (unlikely(vq_log)) { + if (!log) + __vhost_get_desc(net-dev, vq, vq-iov, + ARRAY_SIZE(vq-iov), + out, in, vq_log, + log, head); + vhost_log_write(vq, vq_log, log, size); + } + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } else { + int i = 0; + int count = iocb-ki_left; + int hc = count; + while (count--) { + if (iocb) { + vq-heads[i].id = iocb-ki_pos; + vq-heads[i].len = iocb-ki_nbytes; + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len +=
[PATCH v11 17/17]add two new ioctls for mp device.
From: Xin Xiaohui xiaohui@intel.com The patch add two ioctls for mp device. One is for userspace to query how much memory locked to make mp device run smoothly. Another one is for userspace to set how much meory locked it really wants. --- drivers/vhost/mpassthru.c | 103 +++-- include/linux/mpassthru.h |2 + 2 files changed, 54 insertions(+), 51 deletions(-) diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c index d86d94c..e3a0199 100644 --- a/drivers/vhost/mpassthru.c +++ b/drivers/vhost/mpassthru.c @@ -67,6 +67,8 @@ static int debug; #define COPY_THRESHOLD (L1_CACHE_BYTES * 4) #define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) +#define DEFAULT_NEED ((8192*2*2)*4096) + struct frag { u16 offset; u16 size; @@ -110,7 +112,8 @@ struct page_ctor { int rq_len; spinlock_t read_lock; /* record the locked pages */ - int lock_pages; + int locked_pages; + int cur_pages; struct rlimit o_rlim; struct net_device *dev; struct mpassthru_port port; @@ -122,6 +125,7 @@ struct mp_struct { struct net_device *dev; struct page_ctor*ctor; struct socket socket; + struct task_struct *user; #ifdef MPASSTHRU_DEBUG int debug; @@ -231,7 +235,8 @@ static int page_ctor_attach(struct mp_struct *mp) ctor-port.ctor = page_ctor; ctor-port.sock = mp-socket; ctor-port.hash = mp_lookup; - ctor-lock_pages = 0; + ctor-locked_pages = 0; + ctor-cur_pages = 0; /* locked by mp_mutex */ dev-mp_port = ctor-port; @@ -264,37 +269,6 @@ struct page_info *info_dequeue(struct page_ctor *ctor) return info; } -static int set_memlock_rlimit(struct page_ctor *ctor, int resource, - unsigned long cur, unsigned long max) -{ - struct rlimit new_rlim, *old_rlim; - int retval; - - if (resource != RLIMIT_MEMLOCK) - return -EINVAL; - new_rlim.rlim_cur = cur; - new_rlim.rlim_max = max; - - old_rlim = current-signal-rlim + resource; - - /* remember the old rlimit value when backend enabled */ - ctor-o_rlim.rlim_cur = old_rlim-rlim_cur; - ctor-o_rlim.rlim_max = old_rlim-rlim_max; - - if ((new_rlim.rlim_max old_rlim-rlim_max) - !capable(CAP_SYS_RESOURCE)) - return -EPERM; - - retval = security_task_setrlimit(resource, new_rlim); - if (retval) - return retval; - - task_lock(current-group_leader); - *old_rlim = new_rlim; - task_unlock(current-group_leader); - return 0; -} - static void relinquish_resource(struct page_ctor *ctor) { if (!(ctor-dev-flags IFF_UP) @@ -323,7 +297,7 @@ static void mp_ki_dtor(struct kiocb *iocb) } else info-ctor-wq_len--; /* Decrement the number of locked pages */ - info-ctor-lock_pages -= info-pnum; + info-ctor-cur_pages -= info-pnum; kmem_cache_free(ext_page_info_cache, info); relinquish_resource(info-ctor); @@ -357,6 +331,7 @@ static int page_ctor_detach(struct mp_struct *mp) { struct page_ctor *ctor; struct page_info *info; + struct task_struct *tsk = mp-user; int i; /* locked by mp_mutex */ @@ -375,9 +350,9 @@ static int page_ctor_detach(struct mp_struct *mp) relinquish_resource(ctor); - set_memlock_rlimit(ctor, RLIMIT_MEMLOCK, - ctor-o_rlim.rlim_cur, - ctor-o_rlim.rlim_max); + down_write(tsk-mm-mmap_sem); + tsk-mm-locked_vm -= ctor-locked_pages; + up_write(tsk-mm-mmap_sem); /* locked by mp_mutex */ ctor-dev-mp_port = NULL; @@ -514,7 +489,6 @@ static struct page_info *mp_hash_delete(struct page_ctor *ctor, { key_mp_t key = mp_hash(info-pages[0], HASH_BUCKETS); struct page_info *tmp = NULL; - int i; tmp = ctor-hash_table[key]; while (tmp) { @@ -565,14 +539,11 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor, int rc; int i, j, n = 0; int len; - unsigned long base, lock_limit; + unsigned long base; struct page_info *info = NULL; - lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur; - lock_limit = PAGE_SHIFT; - - if (ctor-lock_pages + count lock_limit npages) { - printk(KERN_INFO exceed the locked memory rlimit.); + if (ctor-cur_pages + count ctor-locked_pages) { + printk(KERN_INFO Exceed memory lock rlimt.); return NULL; } @@ -634,7 +605,7 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor,
[PATCH v11 15/17]An example how to modifiy NIC driver to use napi_gro_frags() interface
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver. It provides API is_rx_buffer_mapped_as_page() to indicate if the driver use napi_gro_frags() interface or not. The example allocates 2 pages for DMA for one ring descriptor using netdev_alloc_page(). When packets is coming, using napi_gro_frags() to allocate skb and to receive the packets. --- drivers/net/ixgbe/ixgbe.h |3 + drivers/net/ixgbe/ixgbe_main.c | 151 2 files changed, 125 insertions(+), 29 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h index 79c35ae..fceffc5 100644 --- a/drivers/net/ixgbe/ixgbe.h +++ b/drivers/net/ixgbe/ixgbe.h @@ -131,6 +131,9 @@ struct ixgbe_rx_buffer { struct page *page; dma_addr_t page_dma; unsigned int page_offset; + u16 mapped_as_page; + struct page *page_skb; + unsigned int page_skb_offset; }; struct ixgbe_queue_stats { diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index 6c00ee4..905d6d2 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -688,6 +688,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring-reg_idx), val); } +static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, + struct net_device *dev) +{ + return true; +} + /** * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split * @adapter: address of board private structure @@ -704,13 +710,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, i = rx_ring-next_to_use; bi = rx_ring-rx_buffer_info[i]; + while (cleaned_count--) { rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i); + bi-mapped_as_page = + is_rx_buffer_mapped_as_page(bi, adapter-netdev); + if (!bi-page_dma (rx_ring-flags IXGBE_RING_RX_PS_ENABLED)) { if (!bi-page) { - bi-page = alloc_page(GFP_ATOMIC); + bi-page = netdev_alloc_page(adapter-netdev); if (!bi-page) { adapter-alloc_rx_page_failed++; goto no_buffers; @@ -727,7 +737,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, PCI_DMA_FROMDEVICE); } - if (!bi-skb) { + if (!bi-mapped_as_page !bi-skb) { struct sk_buff *skb; /* netdev_alloc_skb reserves 32 bytes up front!! */ uint bufsz = rx_ring-rx_buf_len + SMP_CACHE_BYTES; @@ -747,6 +757,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, rx_ring-rx_buf_len, PCI_DMA_FROMDEVICE); } + + if (bi-mapped_as_page !bi-page_skb) { + bi-page_skb = netdev_alloc_page(adapter-netdev); + if (!bi-page_skb) { + adapter-alloc_rx_page_failed++; + goto no_buffers; + } + bi-page_skb_offset = 0; + bi-dma = pci_map_page(pdev, bi-page_skb, + bi-page_skb_offset, + (PAGE_SIZE / 2), + PCI_DMA_FROMDEVICE); + } /* Refresh the desc even if buffer_addrs didn't change because * each write-back erases this info. */ if (rx_ring-flags IXGBE_RING_RX_PS_ENABLED) { @@ -823,6 +846,13 @@ struct ixgbe_rsc_cb { dma_addr_t dma; }; +static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info) +{ + return (!rx_buffer_info-skb || + !rx_buffer_info-page_skb) + !rx_buffer_info-page; +} + #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)-cb) static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, @@ -832,6 +862,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, struct ixgbe_adapter *adapter = q_vector-adapter; struct net_device *netdev = adapter-netdev; struct pci_dev *pdev = adapter-pdev; + struct napi_struct *napi = q_vector-napi; union ixgbe_adv_rx_desc *rx_desc, *next_rxd; struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer; struct sk_buff *skb; @@ -868,29 +899,71 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length); } + if
[PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net.
We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-10:net core and kernel changes. patch 11-13:new device as interface to mantpulate external buffers. patch 14: for vhost-net. patch 15: An example on modifying NIC driver to using napi_gro_frags(). patch 16: An example how to get guest buffers based on driver who using napi_gro_frags(). patch 17: It's a patch to address comments from Michael S. Thirkin to add 2 new ioctls in mp device. We split it out here to make easier reiewer. The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. What we have not done yet: Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation. Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg. use get_user_pages_fast() and set_page_dirty_lock() when read. Add some comments for netdev_mp_port_prep() and handle_mpassthru(). what we have done in v3: the async write logging is rewritten a drafted synchronous write function for qemu live migration a limit for locked pages from get_user_pages_fast() to prevent Dos by using RLIMIT_MEMLOCK what we have done in v4: add iocb completion callback from vhost-net to queue iocb in mp device replace vq-receiver by mp_sock_data_ready() remove stuff in mp device which access structures from vhost-net modify skb_reserve() to ignore host NIC driver reserved space rebase to the latest vhost tree split large patches into small pieces, especially for net core part. what we have done in v5: address Arnd Bergmann's comments -remove IFF_MPASSTHRU_EXCL flag in mp device -Add CONFIG_COMPAT macro -remove mp_release ops move dev_is_mpassthru() as inline func fix a bug in memory relinquish Apply to current git (2.6.34-rc6) tree. what we have done in v6: move create_iocb() out of page_dtor which may happen in interrupt context -This remove the potential issues which lock called in interrupt context make the cache used by mp, vhost as static, and created/destoryed during modules init/exit functions. -This makes multiple mp guest created at the same time. what we have done in v7: some cleanup prepared to suppprt PS mode what we have done in v8: discarding the modifications to point skb-data to guest buffer directly. Add code to modify driver to support napi_gro_frags() with Herbert's comments. To support PS mode. Add mergeable
[PATCH v11 01/17] Add a new structure for skb buffer from external.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 124f90c..74af06c 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -203,6 +203,15 @@ struct skb_shared_info { void * destructor_arg; }; +/* The structure is for a skb which pages may point to + * an external buffer, which is not allocated from kernel space. + * It also contains a destructor for itself. + */ +struct skb_ext_page { + struct page *page; + void(*dtor)(struct skb_ext_page *); +}; + /* We divide dataref into two halves. The higher 16 bits hold references * to the payload part of skb-data. The lower 16 bits hold references to * the entire skb-data. A clone of a headerless skb holds the length of -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 11/17] Add header file for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/mpassthru.h | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) create mode 100644 include/linux/mpassthru.h diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h new file mode 100644 index 000..ba8f320 --- /dev/null +++ b/include/linux/mpassthru.h @@ -0,0 +1,25 @@ +#ifndef __MPASSTHRU_H +#define __MPASSTHRU_H + +#include linux/types.h +#include linux/if_ether.h + +/* ioctl defines */ +#define MPASSTHRU_BINDDEV _IOW('M', 213, int) +#define MPASSTHRU_UNBINDDEV_IO('M', 214) + +#ifdef __KERNEL__ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +struct socket *mp_get_socket(struct file *); +#else +#include linux/err.h +#include linux/errno.h +struct file; +struct socket; +static inline struct socket *mp_get_socket(struct file *f) +{ + return ERR_PTR(-EINVAL); +} +#endif /* CONFIG_MEDIATE_PASSTHRU */ +#endif /* __KERNEL__ */ +#endif /* __MPASSTHRU_H */ -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 10/17] Add a hook to intercept external buffers from NIC driver.
From: Xin Xiaohui xiaohui@intel.com The hook is called in netif_receive_skb(). Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/dev.c | 35 +++ 1 files changed, 35 insertions(+), 0 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 636f11b..4b379b1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2517,6 +2517,37 @@ err: EXPORT_SYMBOL(netdev_mp_port_prep); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +/* Add a hook to intercept mediate passthru(zero-copy) packets, + * and insert it to the socket queue owned by mp_port specially. + */ +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb, + struct packet_type **pt_prev, + int *ret, + struct net_device *orig_dev) +{ + struct mpassthru_port *mp_port = NULL; + struct sock *sk = NULL; + + if (!dev_is_mpassthru(skb-dev)) + return skb; + mp_port = skb-dev-mp_port; + + if (*pt_prev) { + *ret = deliver_skb(skb, *pt_prev, orig_dev); + *pt_prev = NULL; + } + + sk = mp_port-sock-sk; + skb_queue_tail(sk-sk_receive_queue, skb); + sk-sk_state_change(sk); + + return NULL; +} +#else +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb) +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process @@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb) ncls: #endif + /* To intercept mediate passthru(zero-copy) packets here */ + skb = handle_mpassthru(skb, pt_prev, ret, orig_dev); + if (!skb) + goto out; skb = handle_bridge(skb, pt_prev, ret, orig_dev); if (!skb) goto out; -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html