[PATCH net-next] virtio-net: fix build error when CONFIG_AVERAGE is not enabled
Commit ab7db91705e9 (virtio-net: auto-tune mergeable rx buffer size for improved performance) introduced a virtio-net dependency on EWMA. The inclusion of EWMA is controlled by CONFIG_AVERAGE. Fix build error when CONFIG_AVERAGE is not enabled by adding select AVERAGE to virtio-net's Kconfig entry. Build failure reported using config make ARCH=s390 defconfig. Signed-off-by: Michael Dalton mwdal...@google.com --- drivers/net/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index b45b240..f342278 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -236,6 +236,7 @@ config VETH config VIRTIO_NET tristate Virtio network driver depends on VIRTIO + select AVERAGE ---help--- This is the virtual network driver for virtio. It can be used with lguest or QEMU based VMMs (like KVM or Xen). Say Y or M. -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v3 1/5] net: allow 0 order atomic page alloc in skb_page_frag_refill
skb_page_frag_refill currently permits only order-0 page allocs unless GFP_WAIT is used. Change skb_page_frag_refill to attempt higher-order page allocations whether or not GFP_WAIT is used. If memory cannot be allocated, the allocator will fall back to successively smaller page allocs (down to order-0 page allocs). This change brings skb_page_frag_refill in line with the existing page allocation strategy employed by netdev_alloc_frag, which attempts higher-order page allocations whether or not GFP_WAIT is set, falling back to successively lower-order page allocations on failure. Part of migration of virtio-net to per-receive queue page frag allocators. Acked-by: Michael S. Tsirkin m...@redhat.com Acked-by: Eric Dumazet eduma...@google.com Signed-off-by: Michael Dalton mwdal...@google.com --- net/core/sock.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 85ad6f0..b3f7ee3 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1836,9 +1836,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio) put_page(pfrag-page); } - /* We restrict high order allocations to users that can afford to wait */ - order = (prio __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0; - + order = SKB_FRAG_PAGE_ORDER; do { gfp_t gfp = prio; -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v3 3/5] virtio-net: auto-tune mergeable rx buffer size for improved performance
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators) changed the mergeable receive buffer size from PAGE_SIZE to MTU-size, introducing a single-stream regression for benchmarks with large average packet size. There is no single optimal buffer size for all workloads. For workloads with packet size = MTU bytes, MTU + virtio-net header-sized buffers are preferred as larger buffers reduce the TCP window due to SKB truesize. However, single-stream workloads with large average packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers are used. This commit auto-tunes the mergeable receiver buffer packet size by choosing the packet buffer size based on an EWMA of the recent packet sizes for the receive queue. Packet buffer sizes range from MTU_SIZE + virtio-net header len to PAGE_SIZE. This improves throughput for large packet workloads, as any workload with average packet size = PAGE_SIZE will use PAGE_SIZE buffers. These optimizations interact positively with recent commit ba275241030c (virtio-net: coalesce rx frags when possible during rx), which coalesces adjacent RX SKB fragments in virtio_net. The coalescing optimizations benefit buffers of any size. Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs between two QEMU VMs on a single physical machine. Each VM has two VCPUs with all offloads vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup cpuset, using cgroups to ensure that other processes in the system will not be scheduled on the benchmark CPUs. Trunk includes SKB rx frag coalescing. net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s net-next (MTU-size bufs): 13170.01Gb/s net-next + auto-tune: 14555.94Gb/s Jason Wang also reported a throughput increase on mlx4 from 22Gb/s using MTU-sized buffers to about 26Gb/s using auto-tuning. Signed-off-by: Michael Dalton mwdal...@google.com --- v2-v3: Remove per-receive queue metadata ring. Encode packet buffer base address and truesize into an unsigned long by requiring a minimum packet size alignment of 256. Permit attempts to fill an already-full RX ring (reverting the change in v2). v1-v2: Add per-receive queue metadata ring to track precise truesize for mergeable receive buffers. Remove all truesize approximation. Never try to fill a full RX ring (required for metadata ring in v2). drivers/net/virtio_net.c | 99 1 file changed, 74 insertions(+), 25 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 36cbf06..3e82311 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -26,6 +26,7 @@ #include linux/if_vlan.h #include linux/slab.h #include linux/cpu.h +#include linux/average.h static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -36,11 +37,18 @@ module_param(gso, bool, 0444); /* FIXME: MTU in config. */ #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \ -sizeof(struct virtio_net_hdr_mrg_rxbuf), \ -L1_CACHE_BYTES)) #define GOOD_COPY_LEN 128 +/* Weight used for the RX packet size EWMA. The average packet size is used to + * determine the packet buffer size when refilling RX rings. As the entire RX + * ring may be refilled at once, the weight is chosen so that the EWMA will be + * insensitive to short-term, transient changes in packet size. + */ +#define RECEIVE_AVG_WEIGHT 64 + +/* Minimum alignment for mergeable packet buffers. */ +#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256) + #define VIRTNET_DRIVER_VERSION 1.0.0 struct virtnet_stats { @@ -78,6 +86,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Average packet length for mergeable receive buffers. */ + struct ewma mrg_avg_pkt_len; + /* Page frag for packet buffer allocation. */ struct page_frag alloc_frag; @@ -219,6 +230,23 @@ static void skb_xmit_done(struct virtqueue *vq) netif_wake_subqueue(vi-dev, vq2txq(vq)); } +static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx) +{ + unsigned int truesize = mrg_ctx (MERGEABLE_BUFFER_ALIGN - 1); + return truesize * MERGEABLE_BUFFER_ALIGN; +} + +static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx) +{ + return (void *)(mrg_ctx -MERGEABLE_BUFFER_ALIGN); + +} + +static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize) +{ + return (unsigned long)buf | (truesize / MERGEABLE_BUFFER_ALIGN); +} + /* Called from bottom half context */ static struct sk_buff *page_to_skb(struct receive_queue *rq, struct page *page, unsigned int offset, @@ -327,31 +355,33 @@ err: static struct sk_buff *receive_mergeable(struct net_device *dev
[PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes
Extend existing support for netdevice receive queue sysfs attributes to permit a device-specific attribute group. Initial use case for this support will be to allow the virtio-net device to export per-receive queue mergeable receive buffer size. Signed-off-by: Michael Dalton mwdal...@google.com --- include/linux/netdevice.h | 40 net/core/dev.c| 12 ++-- net/core/net-sysfs.c | 33 - 3 files changed, 58 insertions(+), 27 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5c88ab1..71b8bc4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu *rps_sock_flow_table; bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id, u16 filter_id); #endif +#endif /* CONFIG_RPS */ /* This structure contains an instance of an RX queue. */ struct netdev_rx_queue { +#ifdef CONFIG_RPS struct rps_map __rcu*rps_map; struct rps_dev_flow_table __rcu *rps_flow_table; +#endif struct kobject kobj; struct net_device *dev; } cacheline_aligned_in_smp; -#endif /* CONFIG_RPS */ + +/* + * RX queue sysfs structures and functions. + */ +struct rx_queue_attribute { + struct attribute attr; + ssize_t (*show)(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attr, char *buf); + ssize_t (*store)(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attr, const char *buf, size_t len); +}; #ifdef CONFIG_XPS /* @@ -1313,7 +1326,7 @@ struct net_device { unicast) */ -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS struct netdev_rx_queue *_rx; /* Number of RX queues allocated at register_netdev() time */ @@ -1424,6 +1437,8 @@ struct net_device { struct device dev; /* space for optional device, statistics, and wireless sysfs groups */ const struct attribute_group *sysfs_groups[4]; + /* space for optional per-rx queue attributes */ + const struct attribute_group *sysfs_rx_queue_group; /* rtnetlink link ops */ const struct rtnl_link_ops *rtnl_link_ops; @@ -2374,7 +2389,7 @@ static inline bool netif_is_multiqueue(const struct net_device *dev) int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq); #else static inline int netif_set_real_num_rx_queues(struct net_device *dev, @@ -2393,7 +2408,7 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev, from_dev-real_num_tx_queues); if (err) return err; -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS return netif_set_real_num_rx_queues(to_dev, from_dev-real_num_rx_queues); #else @@ -2401,6 +2416,23 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev, #endif } +#ifdef CONFIG_SYSFS +static inline unsigned int get_netdev_rx_queue_index( + struct netdev_rx_queue *queue) +{ + struct net_device *dev = queue-dev; + int i; + + for (i = 0; i dev-num_rx_queues; i++) + if (queue == dev-_rx[i]) + break; + + BUG_ON(i = dev-num_rx_queues); + + return i; +} +#endif + #define DEFAULT_MAX_NUM_RSS_QUEUES (8) int netif_get_num_default_rss_queues(void); diff --git a/net/core/dev.c b/net/core/dev.c index 20c834e..4be7931 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2080,7 +2080,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq) } EXPORT_SYMBOL(netif_set_real_num_tx_queues); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS /** * netif_set_real_num_rx_queues - set actual number of RX queues used * @dev: Network device @@ -5727,7 +5727,7 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev, } EXPORT_SYMBOL(netif_stacked_transfer_operstate); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS static int netif_alloc_rx_queues(struct net_device *dev) { unsigned int i, count = dev-num_rx_queues; @@ -6272,7 +6272,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, return NULL; } -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS if (rxqs 1) { pr_err(alloc_netdev: Unable to allocate device with zero RX queues\n); return NULL; @@ -6328,7 +6328,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, if (netif_alloc_netdev_queues(dev)) goto free_all; -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS dev-num_rx_queues = rxqs; dev
[PATCH net-next v3 5/5] virtio-net: initial rx sysfs support, export mergeable rx buffer size
Add initial support for per-rx queue sysfs attributes to virtio-net. If mergeable packet buffers are enabled, adds a read-only mergeable packet buffer size sysfs attribute for each RX queue. Signed-off-by: Michael Dalton mwdal...@google.com --- drivers/net/virtio_net.c | 66 +--- 1 file changed, 62 insertions(+), 4 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 3e82311..f315cbb 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -27,6 +27,7 @@ #include linux/slab.h #include linux/cpu.h #include linux/average.h +#include linux/seqlock.h static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -89,6 +90,12 @@ struct receive_queue { /* Average packet length for mergeable receive buffers. */ struct ewma mrg_avg_pkt_len; + /* Sequence counter to allow sysfs readers to safely access stats. +* Assumes a single virtio-net writer, which is enforced by virtio-net +* and NAPI. +*/ + seqcount_t sysfs_seq; + /* Page frag for packet buffer allocation. */ struct page_frag alloc_frag; @@ -416,7 +423,9 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, } } + write_seqcount_begin(rq-sysfs_seq); ewma_add(rq-mrg_avg_pkt_len, head_skb-len); + write_seqcount_end(rq-sysfs_seq); return head_skb; err_skb: @@ -604,18 +613,29 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) return err; } -static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) +static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len) { const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); + unsigned int len; + + len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len), + GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); + return ALIGN(len, MERGEABLE_BUFFER_ALIGN); +} + +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) +{ struct page_frag *alloc_frag = rq-alloc_frag; char *buf; unsigned long ctx; int err; unsigned int len, hole; - len = hdr_len + clamp_t(unsigned int, ewma_read(rq-mrg_avg_pkt_len), - GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); - len = ALIGN(len, MERGEABLE_BUFFER_ALIGN); + /* avg_pkt_len is written only in NAPI rx softirq context. We may +* read avg_pkt_len without using the sysfs_seq seqcount, as this code +* is called only in NAPI rx softirq context or when NAPI is disabled. +*/ + len = get_mergeable_buf_len(rq-mrg_avg_pkt_len); if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp))) return -ENOMEM; @@ -1557,6 +1577,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi) napi_weight); sg_init_table(vi-rq[i].sg, ARRAY_SIZE(vi-rq[i].sg)); + seqcount_init(vi-rq[i].sysfs_seq); ewma_init(vi-rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT); sg_init_table(vi-sq[i].sg, ARRAY_SIZE(vi-sq[i].sg)); } @@ -1594,6 +1615,39 @@ err: return ret; } +#ifdef CONFIG_SYSFS +static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attribute, char *buf) +{ + struct virtnet_info *vi = netdev_priv(queue-dev); + unsigned int queue_index = get_netdev_rx_queue_index(queue); + struct receive_queue *rq; + struct ewma avg; + unsigned int start; + + BUG_ON(queue_index = vi-max_queue_pairs); + rq = vi-rq[queue_index]; + do { + start = read_seqcount_begin(rq-sysfs_seq); + avg = rq-mrg_avg_pkt_len; + } while (read_seqcount_retry(rq-sysfs_seq, start)); + return sprintf(buf, %u\n, get_mergeable_buf_len(avg)); +} + +static struct rx_queue_attribute mergeable_rx_buffer_size_attribute = + __ATTR_RO(mergeable_rx_buffer_size); + +static struct attribute *virtio_net_mrg_rx_attrs[] = { + mergeable_rx_buffer_size_attribute.attr, + NULL +}; + +static const struct attribute_group virtio_net_mrg_rx_group = { + .name = virtio_net, + .attrs = virtio_net_mrg_rx_attrs +}; +#endif + static int virtnet_probe(struct virtio_device *vdev) { int i, err; @@ -1708,6 +1762,10 @@ static int virtnet_probe(struct virtio_device *vdev) if (err) goto free_stats; +#ifdef CONFIG_SYSFS + if (vi-mergeable_rx_bufs) + dev-sysfs_rx_queue_group = virtio_net_mrg_rx_group; +#endif netif_set_real_num_tx_queues(dev, vi-curr_queue_pairs); netif_set_real_num_rx_queues(dev, vi-curr_queue_pairs); -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux
Re: [PATCH net-next v3 5/5] virtio-net: initial rx sysfs support, export mergeable rx buffer size
Sorry, just realized - I think disabling NAPI is necessary but not sufficient. There is also the issue that refill_work() could be scheduled. If refill_work() executes, it will re-enable NAPI. We'd need to cancel the vi-refill delayed work to prevent this AFAICT, and also ensure that no other function re-schedules vi-refill or re-enables NAPI (virtnet_open/close, virtnet_set_queues, and virtnet_freeze/restore). How is the following sequence of operations: rtnl_lock(); cancel_delayed_work_sync(vi-refill); napi_disable(rq-napi); read rq-mrg_avg_pkt_len virtnet_enable_napi(); rtnl_unlock(); Additionally, if we disable NAPI when reading this file, perhaps the permissions should be changed to 400 so that an unprivileged user cannot temporarily disable network RX processing by reading these sysfs files. Does that sound reasonable? Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes
On Jan 16, 2014 at 10:57 AM, Ben Hutchings bhutchi...@solarflare.com wrote: Why write a loop when you can do: i = queue - dev-_rx; Good point, the loop approach was done in get_netdev_queue_index -- I agree your fix is faster and simpler. I'll fix in next patchset. Thanks! Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v4 2/6] virtio-net: use per-receive queue page frag alloc for mergeable bufs
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC mergeable rx buffer allocations. This commit migrates virtio-net to use per-receive queue page frags for GFP_ATOMIC allocation. This change unifies mergeable rx buffer memory allocation, which now will use skb_refill_frag() for both atomic and GFP-WAIT buffer allocations. To address fragmentation concerns, if after buffer allocation there is too little space left in the page frag to allocate a subsequent buffer, the remaining space is added to the current allocated buffer so that the remaining space can be used to store packet data. Signed-off-by: Michael Dalton mwdal...@google.com --- v1-v2: Use GFP_COLD for RX buffer allocations (as in netdev_alloc_frag()). Remove per-netdev GFP_KERNEL page_frag allocator. drivers/net/virtio_net.c | 69 1 file changed, 35 insertions(+), 34 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 7b17240..36cbf06 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -78,6 +78,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Page frag for packet buffer allocation. */ + struct page_frag alloc_frag; + /* RX: fragments + linear part + virtio header */ struct scatterlist sg[MAX_SKB_FRAGS + 2]; @@ -126,11 +129,6 @@ struct virtnet_info { /* Lock for config space updates */ struct mutex config_lock; - /* Page_frag for GFP_KERNEL packet buffer allocation when we run -* low on memory. -*/ - struct page_frag alloc_frag; - /* Does the affinity hint is set for virtqueues? */ bool affinity_hint_set; @@ -336,8 +334,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, int num_buf = hdr-mhdr.num_buffers; struct page *page = virt_to_head_page(buf); int offset = buf - page_address(page); - struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, - MERGE_BUFFER_LEN); + unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN); + struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize); struct sk_buff *curr_skb = head_skb; if (unlikely(!curr_skb)) @@ -353,11 +351,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, dev-stats.rx_length_errors++; goto err_buf; } - if (unlikely(len MERGE_BUFFER_LEN)) { - pr_debug(%s: rx error: merge buffer too long\n, -dev-name); - len = MERGE_BUFFER_LEN; - } page = virt_to_head_page(buf); --rq-num; @@ -376,19 +369,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, head_skb-truesize += nskb-truesize; num_skb_frags = 0; } + truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN); if (curr_skb != head_skb) { head_skb-data_len += len; head_skb-len += len; - head_skb-truesize += MERGE_BUFFER_LEN; + head_skb-truesize += truesize; } offset = buf - page_address(page); if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { put_page(page); skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, -len, MERGE_BUFFER_LEN); +len, truesize); } else { skb_add_rx_frag(curr_skb, num_skb_frags, page, - offset, len, MERGE_BUFFER_LEN); + offset, len, truesize); } } @@ -578,25 +572,24 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) { - struct virtnet_info *vi = rq-vq-vdev-priv; - char *buf = NULL; + struct page_frag *alloc_frag = rq-alloc_frag; + char *buf; int err; + unsigned int len, hole; - if (gfp __GFP_WAIT) { - if (skb_page_frag_refill(MERGE_BUFFER_LEN, vi-alloc_frag, -gfp)) { - buf = (char *)page_address(vi-alloc_frag.page) + - vi-alloc_frag.offset; - get_page(vi-alloc_frag.page); - vi-alloc_frag.offset += MERGE_BUFFER_LEN; - } - } else { - buf = netdev_alloc_frag(MERGE_BUFFER_LEN); - } - if (!buf) + if (unlikely
[PATCH net-next v4 3/6] virtio-net: auto-tune mergeable rx buffer size for improved performance
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators) changed the mergeable receive buffer size from PAGE_SIZE to MTU-size, introducing a single-stream regression for benchmarks with large average packet size. There is no single optimal buffer size for all workloads. For workloads with packet size = MTU bytes, MTU + virtio-net header-sized buffers are preferred as larger buffers reduce the TCP window due to SKB truesize. However, single-stream workloads with large average packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers are used. This commit auto-tunes the mergeable receiver buffer packet size by choosing the packet buffer size based on an EWMA of the recent packet sizes for the receive queue. Packet buffer sizes range from MTU_SIZE + virtio-net header len to PAGE_SIZE. This improves throughput for large packet workloads, as any workload with average packet size = PAGE_SIZE will use PAGE_SIZE buffers. These optimizations interact positively with recent commit ba275241030c (virtio-net: coalesce rx frags when possible during rx), which coalesces adjacent RX SKB fragments in virtio_net. The coalescing optimizations benefit buffers of any size. Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs between two QEMU VMs on a single physical machine. Each VM has two VCPUs with all offloads vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup cpuset, using cgroups to ensure that other processes in the system will not be scheduled on the benchmark CPUs. Trunk includes SKB rx frag coalescing. net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s net-next (MTU-size bufs): 13170.01Gb/s net-next + auto-tune: 14555.94Gb/s Jason Wang also reported a throughput increase on mlx4 from 22Gb/s using MTU-sized buffers to about 26Gb/s using auto-tuning. Signed-off-by: Michael Dalton mwdal...@google.com --- v2-v3: Remove per-receive queue metadata ring. Encode packet buffer base address and truesize into an unsigned long by requiring a minimum packet size alignment of 256. Permit attempts to fill an already-full RX ring (reverting the change in v2). v1-v2: Add per-receive queue metadata ring to track precise truesize for mergeable receive buffers. Remove all truesize approximation. Never try to fill a full RX ring (required for metadata ring in v2). drivers/net/virtio_net.c | 99 1 file changed, 74 insertions(+), 25 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 36cbf06..3e82311 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -26,6 +26,7 @@ #include linux/if_vlan.h #include linux/slab.h #include linux/cpu.h +#include linux/average.h static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -36,11 +37,18 @@ module_param(gso, bool, 0444); /* FIXME: MTU in config. */ #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \ -sizeof(struct virtio_net_hdr_mrg_rxbuf), \ -L1_CACHE_BYTES)) #define GOOD_COPY_LEN 128 +/* Weight used for the RX packet size EWMA. The average packet size is used to + * determine the packet buffer size when refilling RX rings. As the entire RX + * ring may be refilled at once, the weight is chosen so that the EWMA will be + * insensitive to short-term, transient changes in packet size. + */ +#define RECEIVE_AVG_WEIGHT 64 + +/* Minimum alignment for mergeable packet buffers. */ +#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256) + #define VIRTNET_DRIVER_VERSION 1.0.0 struct virtnet_stats { @@ -78,6 +86,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Average packet length for mergeable receive buffers. */ + struct ewma mrg_avg_pkt_len; + /* Page frag for packet buffer allocation. */ struct page_frag alloc_frag; @@ -219,6 +230,23 @@ static void skb_xmit_done(struct virtqueue *vq) netif_wake_subqueue(vi-dev, vq2txq(vq)); } +static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx) +{ + unsigned int truesize = mrg_ctx (MERGEABLE_BUFFER_ALIGN - 1); + return truesize * MERGEABLE_BUFFER_ALIGN; +} + +static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx) +{ + return (void *)(mrg_ctx -MERGEABLE_BUFFER_ALIGN); + +} + +static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize) +{ + return (unsigned long)buf | (truesize / MERGEABLE_BUFFER_ALIGN); +} + /* Called from bottom half context */ static struct sk_buff *page_to_skb(struct receive_queue *rq, struct page *page, unsigned int offset, @@ -327,31 +355,33 @@ err: static struct sk_buff *receive_mergeable(struct net_device *dev
[PATCH net-next v4 1/6] net: allow 0 order atomic page alloc in skb_page_frag_refill
skb_page_frag_refill currently permits only order-0 page allocs unless GFP_WAIT is used. Change skb_page_frag_refill to attempt higher-order page allocations whether or not GFP_WAIT is used. If memory cannot be allocated, the allocator will fall back to successively smaller page allocs (down to order-0 page allocs). This change brings skb_page_frag_refill in line with the existing page allocation strategy employed by netdev_alloc_frag, which attempts higher-order page allocations whether or not GFP_WAIT is set, falling back to successively lower-order page allocations on failure. Part of migration of virtio-net to per-receive queue page frag allocators. Acked-by: Michael S. Tsirkin m...@redhat.com Acked-by: Eric Dumazet eduma...@google.com Signed-off-by: Michael Dalton mwdal...@google.com --- net/core/sock.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 85ad6f0..b3f7ee3 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1836,9 +1836,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio) put_page(pfrag-page); } - /* We restrict high order allocations to users that can afford to wait */ - order = (prio __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0; - + order = SKB_FRAG_PAGE_ORDER; do { gfp_t gfp = prio; -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v4 5/6] lib: Ensure EWMA does not store wrong intermediate values
To ensure ewma_read() without a lock returns a valid but possibly out of date average, modify ewma_add() by using ACCESS_ONCE to prevent intermediate wrong values from being written to avg-internal. Suggested-by: Eric Dumazet eric.duma...@gmail.com Signed-off-by: Michael Dalton mwdal...@google.com --- lib/average.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/lib/average.c b/lib/average.c index 99a67e6..114d1be 100644 --- a/lib/average.c +++ b/lib/average.c @@ -53,8 +53,10 @@ EXPORT_SYMBOL(ewma_init); */ struct ewma *ewma_add(struct ewma *avg, unsigned long val) { - avg-internal = avg-internal ? - (((avg-internal avg-weight) - avg-internal) + + unsigned long internal = ACCESS_ONCE(avg-internal); + + ACCESS_ONCE(avg-internal) = internal ? + (((internal avg-weight) - internal) + (val avg-factor)) avg-weight : (val avg-factor); return avg; -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v4 4/6] net-sysfs: add support for device-specific rx queue sysfs attributes
Extend existing support for netdevice receive queue sysfs attributes to permit a device-specific attribute group. Initial use case for this support will be to allow the virtio-net device to export per-receive queue mergeable receive buffer size. Signed-off-by: Michael Dalton mwdal...@google.com --- v3-v4: Simplify by removing loop in get_netdev_rx_queue_index. include/linux/netdevice.h | 35 +++ net/core/dev.c| 12 ++-- net/core/net-sysfs.c | 33 - 3 files changed, 53 insertions(+), 27 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5c88ab1..38929bc 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu *rps_sock_flow_table; bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id, u16 filter_id); #endif +#endif /* CONFIG_RPS */ /* This structure contains an instance of an RX queue. */ struct netdev_rx_queue { +#ifdef CONFIG_RPS struct rps_map __rcu*rps_map; struct rps_dev_flow_table __rcu *rps_flow_table; +#endif struct kobject kobj; struct net_device *dev; } cacheline_aligned_in_smp; -#endif /* CONFIG_RPS */ + +/* + * RX queue sysfs structures and functions. + */ +struct rx_queue_attribute { + struct attribute attr; + ssize_t (*show)(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attr, char *buf); + ssize_t (*store)(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attr, const char *buf, size_t len); +}; #ifdef CONFIG_XPS /* @@ -1313,7 +1326,7 @@ struct net_device { unicast) */ -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS struct netdev_rx_queue *_rx; /* Number of RX queues allocated at register_netdev() time */ @@ -1424,6 +1437,8 @@ struct net_device { struct device dev; /* space for optional device, statistics, and wireless sysfs groups */ const struct attribute_group *sysfs_groups[4]; + /* space for optional per-rx queue attributes */ + const struct attribute_group *sysfs_rx_queue_group; /* rtnetlink link ops */ const struct rtnl_link_ops *rtnl_link_ops; @@ -2374,7 +2389,7 @@ static inline bool netif_is_multiqueue(const struct net_device *dev) int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq); #else static inline int netif_set_real_num_rx_queues(struct net_device *dev, @@ -2393,7 +2408,7 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev, from_dev-real_num_tx_queues); if (err) return err; -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS return netif_set_real_num_rx_queues(to_dev, from_dev-real_num_rx_queues); #else @@ -2401,6 +2416,18 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev, #endif } +#ifdef CONFIG_SYSFS +static inline unsigned int get_netdev_rx_queue_index( + struct netdev_rx_queue *queue) +{ + struct net_device *dev = queue-dev; + int index = queue - dev-_rx; + + BUG_ON(index = dev-num_rx_queues); + return index; +} +#endif + #define DEFAULT_MAX_NUM_RSS_QUEUES (8) int netif_get_num_default_rss_queues(void); diff --git a/net/core/dev.c b/net/core/dev.c index 20c834e..4be7931 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2080,7 +2080,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq) } EXPORT_SYMBOL(netif_set_real_num_tx_queues); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS /** * netif_set_real_num_rx_queues - set actual number of RX queues used * @dev: Network device @@ -5727,7 +5727,7 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev, } EXPORT_SYMBOL(netif_stacked_transfer_operstate); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS static int netif_alloc_rx_queues(struct net_device *dev) { unsigned int i, count = dev-num_rx_queues; @@ -6272,7 +6272,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, return NULL; } -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS if (rxqs 1) { pr_err(alloc_netdev: Unable to allocate device with zero RX queues\n); return NULL; @@ -6328,7 +6328,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, if (netif_alloc_netdev_queues(dev)) goto free_all; -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS dev-num_rx_queues = rxqs; dev-real_num_rx_queues = rxqs
[PATCH net-next v4 6/6] virtio-net: initial rx sysfs support, export mergeable rx buffer size
Add initial support for per-rx queue sysfs attributes to virtio-net. If mergeable packet buffers are enabled, adds a read-only mergeable packet buffer size sysfs attribute for each RX queue. Suggested-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Michael Dalton mwdal...@google.com --- v3-v4: Remove seqcount due to EWMA changes in patch 5. Add missing Suggested-By. drivers/net/virtio_net.c | 46 ++ 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 3e82311..968eacd 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -604,18 +604,25 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) return err; } -static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) +static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len) { const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); + unsigned int len; + + len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len), + GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); + return ALIGN(len, MERGEABLE_BUFFER_ALIGN); +} + +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) +{ struct page_frag *alloc_frag = rq-alloc_frag; char *buf; unsigned long ctx; int err; unsigned int len, hole; - len = hdr_len + clamp_t(unsigned int, ewma_read(rq-mrg_avg_pkt_len), - GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); - len = ALIGN(len, MERGEABLE_BUFFER_ALIGN); + len = get_mergeable_buf_len(rq-mrg_avg_pkt_len); if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp))) return -ENOMEM; @@ -1594,6 +1601,33 @@ err: return ret; } +#ifdef CONFIG_SYSFS +static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attribute, char *buf) +{ + struct virtnet_info *vi = netdev_priv(queue-dev); + unsigned int queue_index = get_netdev_rx_queue_index(queue); + struct ewma *avg; + + BUG_ON(queue_index = vi-max_queue_pairs); + avg = vi-rq[queue_index].mrg_avg_pkt_len; + return sprintf(buf, %u\n, get_mergeable_buf_len(avg)); +} + +static struct rx_queue_attribute mergeable_rx_buffer_size_attribute = + __ATTR_RO(mergeable_rx_buffer_size); + +static struct attribute *virtio_net_mrg_rx_attrs[] = { + mergeable_rx_buffer_size_attribute.attr, + NULL +}; + +static const struct attribute_group virtio_net_mrg_rx_group = { + .name = virtio_net, + .attrs = virtio_net_mrg_rx_attrs +}; +#endif + static int virtnet_probe(struct virtio_device *vdev) { int i, err; @@ -1708,6 +1742,10 @@ static int virtnet_probe(struct virtio_device *vdev) if (err) goto free_stats; +#ifdef CONFIG_SYSFS + if (vi-mergeable_rx_bufs) + dev-sysfs_rx_queue_group = virtio_net_mrg_rx_group; +#endif netif_set_real_num_tx_queues(dev, vi-curr_queue_pairs); netif_set_real_num_rx_queues(dev, vi-curr_queue_pairs); -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v4 1/6] net: allow 0 order atomic page alloc in skb_page_frag_refill
On Thu, Jan 16, 2014 at 3:30 PM, David Miller da...@davemloft.net wrote: Actually, I reverted, please resubmit this series with the following build warning corrected: Thanks David, I will send out another patchset shortly with the warning resolved and a header e-mail (and one other sysfs group fix that I just found in the same file). Sorry I didn't include a header e-mail initially. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 2/6] virtio-net: use per-receive queue page frag alloc for mergeable bufs
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC mergeable rx buffer allocations. This commit migrates virtio-net to use per-receive queue page frags for GFP_ATOMIC allocation. This change unifies mergeable rx buffer memory allocation, which now will use skb_refill_frag() for both atomic and GFP-WAIT buffer allocations. To address fragmentation concerns, if after buffer allocation there is too little space left in the page frag to allocate a subsequent buffer, the remaining space is added to the current allocated buffer so that the remaining space can be used to store packet data. Acked-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Michael Dalton mwdal...@google.com --- v1-v2: Use GFP_COLD for RX buffer allocations (as in netdev_alloc_frag()). Remove per-netdev GFP_KERNEL page_frag allocator. drivers/net/virtio_net.c | 69 1 file changed, 35 insertions(+), 34 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 7b17240..36cbf06 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -78,6 +78,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Page frag for packet buffer allocation. */ + struct page_frag alloc_frag; + /* RX: fragments + linear part + virtio header */ struct scatterlist sg[MAX_SKB_FRAGS + 2]; @@ -126,11 +129,6 @@ struct virtnet_info { /* Lock for config space updates */ struct mutex config_lock; - /* Page_frag for GFP_KERNEL packet buffer allocation when we run -* low on memory. -*/ - struct page_frag alloc_frag; - /* Does the affinity hint is set for virtqueues? */ bool affinity_hint_set; @@ -336,8 +334,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, int num_buf = hdr-mhdr.num_buffers; struct page *page = virt_to_head_page(buf); int offset = buf - page_address(page); - struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, - MERGE_BUFFER_LEN); + unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN); + struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize); struct sk_buff *curr_skb = head_skb; if (unlikely(!curr_skb)) @@ -353,11 +351,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, dev-stats.rx_length_errors++; goto err_buf; } - if (unlikely(len MERGE_BUFFER_LEN)) { - pr_debug(%s: rx error: merge buffer too long\n, -dev-name); - len = MERGE_BUFFER_LEN; - } page = virt_to_head_page(buf); --rq-num; @@ -376,19 +369,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, head_skb-truesize += nskb-truesize; num_skb_frags = 0; } + truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN); if (curr_skb != head_skb) { head_skb-data_len += len; head_skb-len += len; - head_skb-truesize += MERGE_BUFFER_LEN; + head_skb-truesize += truesize; } offset = buf - page_address(page); if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { put_page(page); skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, -len, MERGE_BUFFER_LEN); +len, truesize); } else { skb_add_rx_frag(curr_skb, num_skb_frags, page, - offset, len, MERGE_BUFFER_LEN); + offset, len, truesize); } } @@ -578,25 +572,24 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) { - struct virtnet_info *vi = rq-vq-vdev-priv; - char *buf = NULL; + struct page_frag *alloc_frag = rq-alloc_frag; + char *buf; int err; + unsigned int len, hole; - if (gfp __GFP_WAIT) { - if (skb_page_frag_refill(MERGE_BUFFER_LEN, vi-alloc_frag, -gfp)) { - buf = (char *)page_address(vi-alloc_frag.page) + - vi-alloc_frag.offset; - get_page(vi-alloc_frag.page); - vi-alloc_frag.offset += MERGE_BUFFER_LEN; - } - } else { - buf = netdev_alloc_frag(MERGE_BUFFER_LEN
[PATCH net-next v5 3/6] virtio-net: auto-tune mergeable rx buffer size for improved performance
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators) changed the mergeable receive buffer size from PAGE_SIZE to MTU-size, introducing a single-stream regression for benchmarks with large average packet size. There is no single optimal buffer size for all workloads. For workloads with packet size = MTU bytes, MTU + virtio-net header-sized buffers are preferred as larger buffers reduce the TCP window due to SKB truesize. However, single-stream workloads with large average packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers are used. This commit auto-tunes the mergeable receiver buffer packet size by choosing the packet buffer size based on an EWMA of the recent packet sizes for the receive queue. Packet buffer sizes range from MTU_SIZE + virtio-net header len to PAGE_SIZE. This improves throughput for large packet workloads, as any workload with average packet size = PAGE_SIZE will use PAGE_SIZE buffers. These optimizations interact positively with recent commit ba275241030c (virtio-net: coalesce rx frags when possible during rx), which coalesces adjacent RX SKB fragments in virtio_net. The coalescing optimizations benefit buffers of any size. Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs between two QEMU VMs on a single physical machine. Each VM has two VCPUs with all offloads vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup cpuset, using cgroups to ensure that other processes in the system will not be scheduled on the benchmark CPUs. Trunk includes SKB rx frag coalescing. net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s net-next (MTU-size bufs): 13170.01Gb/s net-next + auto-tune: 14555.94Gb/s Jason Wang also reported a throughput increase on mlx4 from 22Gb/s using MTU-sized buffers to about 26Gb/s using auto-tuning. Acked-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Michael Dalton mwdal...@google.com --- v2-v3: Remove per-receive queue metadata ring. Encode packet buffer base address and truesize into an unsigned long by requiring a minimum packet size alignment of 256. Permit attempts to fill an already-full RX ring (reverting the change in v2). v1-v2: Add per-receive queue metadata ring to track precise truesize for mergeable receive buffers. Remove all truesize approximation. Never try to fill a full RX ring (required for metadata ring in v2). drivers/net/virtio_net.c | 99 1 file changed, 74 insertions(+), 25 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 36cbf06..3e82311 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -26,6 +26,7 @@ #include linux/if_vlan.h #include linux/slab.h #include linux/cpu.h +#include linux/average.h static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -36,11 +37,18 @@ module_param(gso, bool, 0444); /* FIXME: MTU in config. */ #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \ -sizeof(struct virtio_net_hdr_mrg_rxbuf), \ -L1_CACHE_BYTES)) #define GOOD_COPY_LEN 128 +/* Weight used for the RX packet size EWMA. The average packet size is used to + * determine the packet buffer size when refilling RX rings. As the entire RX + * ring may be refilled at once, the weight is chosen so that the EWMA will be + * insensitive to short-term, transient changes in packet size. + */ +#define RECEIVE_AVG_WEIGHT 64 + +/* Minimum alignment for mergeable packet buffers. */ +#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256) + #define VIRTNET_DRIVER_VERSION 1.0.0 struct virtnet_stats { @@ -78,6 +86,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Average packet length for mergeable receive buffers. */ + struct ewma mrg_avg_pkt_len; + /* Page frag for packet buffer allocation. */ struct page_frag alloc_frag; @@ -219,6 +230,23 @@ static void skb_xmit_done(struct virtqueue *vq) netif_wake_subqueue(vi-dev, vq2txq(vq)); } +static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx) +{ + unsigned int truesize = mrg_ctx (MERGEABLE_BUFFER_ALIGN - 1); + return truesize * MERGEABLE_BUFFER_ALIGN; +} + +static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx) +{ + return (void *)(mrg_ctx -MERGEABLE_BUFFER_ALIGN); + +} + +static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize) +{ + return (unsigned long)buf | (truesize / MERGEABLE_BUFFER_ALIGN); +} + /* Called from bottom half context */ static struct sk_buff *page_to_skb(struct receive_queue *rq, struct page *page, unsigned int offset, @@ -327,31 +355,33 @@ err: static struct sk_buff
[PATCH net-next v5 1/6] net: allow 0 order atomic page alloc in skb_page_frag_refill
skb_page_frag_refill currently permits only order-0 page allocs unless GFP_WAIT is used. Change skb_page_frag_refill to attempt higher-order page allocations whether or not GFP_WAIT is used. If memory cannot be allocated, the allocator will fall back to successively smaller page allocs (down to order-0 page allocs). This change brings skb_page_frag_refill in line with the existing page allocation strategy employed by netdev_alloc_frag, which attempts higher-order page allocations whether or not GFP_WAIT is set, falling back to successively lower-order page allocations on failure. Part of migration of virtio-net to per-receive queue page frag allocators. Acked-by: Michael S. Tsirkin m...@redhat.com Acked-by: Eric Dumazet eduma...@google.com Signed-off-by: Michael Dalton mwdal...@google.com --- net/core/sock.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 85ad6f0..b3f7ee3 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1836,9 +1836,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio) put_page(pfrag-page); } - /* We restrict high order allocations to users that can afford to wait */ - order = (prio __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0; - + order = SKB_FRAG_PAGE_ORDER; do { gfp_t gfp = prio; -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 0/6] virtio-net: mergeable rx buffer size auto-tuning
The virtio-net device currently uses aligned MTU-sized mergeable receive packet buffers. Network throughput for workloads with large average packet size can be improved by posting larger receive packet buffers. However, due to SKB truesize effects, posting large (e.g, PAGE_SIZE) buffers reduces the throughput of workloads that do not benefit from GRO and have no large inbound packets. This patchset introduces virtio-net mergeable buffer size auto-tuning, with buffer sizes ranging from aligned MTU-size to PAGE_SIZE. Packet buffer size is chosen based on a per-receive queue EWMA of incoming packet size. To unify mergeable receive buffer memory allocation and improve SKB frag coalescing, all mergeable buffer memory allocation is migrated to per-receive queue page frag allocators. The per-receive queue mergeable packet buffer size is exported via sysfs, and the network device sysfs layer has been extended to add support for device-specific per-receive queue sysfs attribute groups. Michael Dalton (6): net: allow 0 order atomic page alloc in skb_page_frag_refill virtio-net: use per-receive queue page frag alloc for mergeable bufs virtio-net: auto-tune mergeable rx buffer size for improved performance net-sysfs: add support for device-specific rx queue sysfs attributes lib: Ensure EWMA does not store wrong intermediate values virtio-net: initial rx sysfs support, export mergeable rx buffer size drivers/net/virtio_net.c | 196 +- include/linux/netdevice.h | 35 - lib/average.c | 6 +- net/core/dev.c| 12 +-- net/core/net-sysfs.c | 50 +++- net/core/sock.c | 4 +- 6 files changed, 213 insertions(+), 90 deletions(-) -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 4/6] net-sysfs: add support for device-specific rx queue sysfs attributes
Extend existing support for netdevice receive queue sysfs attributes to permit a device-specific attribute group. Initial use case for this support will be to allow the virtio-net device to export per-receive queue mergeable receive buffer size. Signed-off-by: Michael Dalton mwdal...@google.com --- v4-v5: Handle sysfs_create_group failure. Call sysfs_remove_group when removing a RX queue kobj if a device-specific group exists. v3-v4: Simplify by removing loop in get_netdev_rx_queue_index. include/linux/netdevice.h | 35 + net/core/dev.c| 12 ++-- net/core/net-sysfs.c | 50 +++ 3 files changed, 66 insertions(+), 31 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5c88ab1..38929bc 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu *rps_sock_flow_table; bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id, u16 filter_id); #endif +#endif /* CONFIG_RPS */ /* This structure contains an instance of an RX queue. */ struct netdev_rx_queue { +#ifdef CONFIG_RPS struct rps_map __rcu*rps_map; struct rps_dev_flow_table __rcu *rps_flow_table; +#endif struct kobject kobj; struct net_device *dev; } cacheline_aligned_in_smp; -#endif /* CONFIG_RPS */ + +/* + * RX queue sysfs structures and functions. + */ +struct rx_queue_attribute { + struct attribute attr; + ssize_t (*show)(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attr, char *buf); + ssize_t (*store)(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attr, const char *buf, size_t len); +}; #ifdef CONFIG_XPS /* @@ -1313,7 +1326,7 @@ struct net_device { unicast) */ -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS struct netdev_rx_queue *_rx; /* Number of RX queues allocated at register_netdev() time */ @@ -1424,6 +1437,8 @@ struct net_device { struct device dev; /* space for optional device, statistics, and wireless sysfs groups */ const struct attribute_group *sysfs_groups[4]; + /* space for optional per-rx queue attributes */ + const struct attribute_group *sysfs_rx_queue_group; /* rtnetlink link ops */ const struct rtnl_link_ops *rtnl_link_ops; @@ -2374,7 +2389,7 @@ static inline bool netif_is_multiqueue(const struct net_device *dev) int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq); #else static inline int netif_set_real_num_rx_queues(struct net_device *dev, @@ -2393,7 +2408,7 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev, from_dev-real_num_tx_queues); if (err) return err; -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS return netif_set_real_num_rx_queues(to_dev, from_dev-real_num_rx_queues); #else @@ -2401,6 +2416,18 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev, #endif } +#ifdef CONFIG_SYSFS +static inline unsigned int get_netdev_rx_queue_index( + struct netdev_rx_queue *queue) +{ + struct net_device *dev = queue-dev; + int index = queue - dev-_rx; + + BUG_ON(index = dev-num_rx_queues); + return index; +} +#endif + #define DEFAULT_MAX_NUM_RSS_QUEUES (8) int netif_get_num_default_rss_queues(void); diff --git a/net/core/dev.c b/net/core/dev.c index 20c834e..4be7931 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2080,7 +2080,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq) } EXPORT_SYMBOL(netif_set_real_num_tx_queues); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS /** * netif_set_real_num_rx_queues - set actual number of RX queues used * @dev: Network device @@ -5727,7 +5727,7 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev, } EXPORT_SYMBOL(netif_stacked_transfer_operstate); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS static int netif_alloc_rx_queues(struct net_device *dev) { unsigned int i, count = dev-num_rx_queues; @@ -6272,7 +6272,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, return NULL; } -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS if (rxqs 1) { pr_err(alloc_netdev: Unable to allocate device with zero RX queues\n); return NULL; @@ -6328,7 +6328,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, if (netif_alloc_netdev_queues(dev
[PATCH net-next v5 5/6] lib: Ensure EWMA does not store wrong intermediate values
To ensure ewma_read() without a lock returns a valid but possibly out of date average, modify ewma_add() by using ACCESS_ONCE to prevent intermediate wrong values from being written to avg-internal. Suggested-by: Eric Dumazet eric.duma...@gmail.com Acked-by: Michael S. Tsirkin m...@redhat.com Acked-by: Eric Dumazet eduma...@google.com Signed-off-by: Michael Dalton mwdal...@google.com --- lib/average.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/lib/average.c b/lib/average.c index 99a67e6..114d1be 100644 --- a/lib/average.c +++ b/lib/average.c @@ -53,8 +53,10 @@ EXPORT_SYMBOL(ewma_init); */ struct ewma *ewma_add(struct ewma *avg, unsigned long val) { - avg-internal = avg-internal ? - (((avg-internal avg-weight) - avg-internal) + + unsigned long internal = ACCESS_ONCE(avg-internal); + + ACCESS_ONCE(avg-internal) = internal ? + (((internal avg-weight) - internal) + (val avg-factor)) avg-weight : (val avg-factor); return avg; -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v5 6/6] virtio-net: initial rx sysfs support, export mergeable rx buffer size
Add initial support for per-rx queue sysfs attributes to virtio-net. If mergeable packet buffers are enabled, adds a read-only mergeable packet buffer size sysfs attribute for each RX queue. Suggested-by: Michael S. Tsirkin m...@redhat.com Acked-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Michael Dalton mwdal...@google.com --- v3-v4: Remove seqcount due to EWMA changes in patch 5. Add missing Suggested-By. drivers/net/virtio_net.c | 46 ++ 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 3e82311..968eacd 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -604,18 +604,25 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) return err; } -static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) +static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len) { const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); + unsigned int len; + + len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len), + GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); + return ALIGN(len, MERGEABLE_BUFFER_ALIGN); +} + +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) +{ struct page_frag *alloc_frag = rq-alloc_frag; char *buf; unsigned long ctx; int err; unsigned int len, hole; - len = hdr_len + clamp_t(unsigned int, ewma_read(rq-mrg_avg_pkt_len), - GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); - len = ALIGN(len, MERGEABLE_BUFFER_ALIGN); + len = get_mergeable_buf_len(rq-mrg_avg_pkt_len); if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp))) return -ENOMEM; @@ -1594,6 +1601,33 @@ err: return ret; } +#ifdef CONFIG_SYSFS +static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attribute, char *buf) +{ + struct virtnet_info *vi = netdev_priv(queue-dev); + unsigned int queue_index = get_netdev_rx_queue_index(queue); + struct ewma *avg; + + BUG_ON(queue_index = vi-max_queue_pairs); + avg = vi-rq[queue_index].mrg_avg_pkt_len; + return sprintf(buf, %u\n, get_mergeable_buf_len(avg)); +} + +static struct rx_queue_attribute mergeable_rx_buffer_size_attribute = + __ATTR_RO(mergeable_rx_buffer_size); + +static struct attribute *virtio_net_mrg_rx_attrs[] = { + mergeable_rx_buffer_size_attribute.attr, + NULL +}; + +static const struct attribute_group virtio_net_mrg_rx_group = { + .name = virtio_net, + .attrs = virtio_net_mrg_rx_attrs +}; +#endif + static int virtnet_probe(struct virtio_device *vdev) { int i, err; @@ -1708,6 +1742,10 @@ static int virtnet_probe(struct virtio_device *vdev) if (err) goto free_stats; +#ifdef CONFIG_SYSFS + if (vi-mergeable_rx_bufs) + dev-sysfs_rx_queue_group = virtio_net_mrg_rx_group; +#endif netif_set_real_num_tx_queues(dev, vi-curr_queue_pairs); netif_set_real_num_rx_queues(dev, vi-curr_queue_pairs); -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v6 0/6] virtio-net: mergeable rx buffer size auto-tuning
The virtio-net device currently uses aligned MTU-sized mergeable receive packet buffers. Network throughput for workloads with large average packet size can be improved by posting larger receive packet buffers. However, due to SKB truesize effects, posting large (e.g, PAGE_SIZE) buffers reduces the throughput of workloads that do not benefit from GRO and have no large inbound packets. This patchset introduces virtio-net mergeable buffer size auto-tuning, with buffer sizes ranging from aligned MTU-size to PAGE_SIZE. Packet buffer size is chosen based on a per-receive queue EWMA of incoming packet size. To unify mergeable receive buffer memory allocation and improve SKB frag coalescing, all mergeable buffer memory allocation is migrated to per-receive queue page frag allocators. The per-receive queue mergeable packet buffer size is exported via sysfs, and the network device sysfs layer has been extended to add support for device-specific per-receive queue sysfs attribute groups. Michael Dalton (6): net: allow 0 order atomic page alloc in skb_page_frag_refill virtio-net: use per-receive queue page frag alloc for mergeable bufs virtio-net: auto-tune mergeable rx buffer size for improved performance net-sysfs: add support for device-specific rx queue sysfs attributes lib: Ensure EWMA does not store wrong intermediate values virtio-net: initial rx sysfs support, export mergeable rx buffer size drivers/net/virtio_net.c | 197 +- include/linux/netdevice.h | 35 +++- lib/average.c | 6 +- net/core/dev.c| 12 +-- net/core/net-sysfs.c | 50 +++- net/core/sock.c | 4 +- 6 files changed, 214 insertions(+), 90 deletions(-) -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v6 1/6] net: allow 0 order atomic page alloc in skb_page_frag_refill
skb_page_frag_refill currently permits only order-0 page allocs unless GFP_WAIT is used. Change skb_page_frag_refill to attempt higher-order page allocations whether or not GFP_WAIT is used. If memory cannot be allocated, the allocator will fall back to successively smaller page allocs (down to order-0 page allocs). This change brings skb_page_frag_refill in line with the existing page allocation strategy employed by netdev_alloc_frag, which attempts higher-order page allocations whether or not GFP_WAIT is set, falling back to successively lower-order page allocations on failure. Part of migration of virtio-net to per-receive queue page frag allocators. Acked-by: Michael S. Tsirkin m...@redhat.com Acked-by: Eric Dumazet eduma...@google.com Signed-off-by: Michael Dalton mwdal...@google.com --- net/core/sock.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 85ad6f0..b3f7ee3 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1836,9 +1836,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio) put_page(pfrag-page); } - /* We restrict high order allocations to users that can afford to wait */ - order = (prio __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0; - + order = SKB_FRAG_PAGE_ORDER; do { gfp_t gfp = prio; -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v6 3/6] virtio-net: auto-tune mergeable rx buffer size for improved performance
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators) changed the mergeable receive buffer size from PAGE_SIZE to MTU-size, introducing a single-stream regression for benchmarks with large average packet size. There is no single optimal buffer size for all workloads. For workloads with packet size = MTU bytes, MTU + virtio-net header-sized buffers are preferred as larger buffers reduce the TCP window due to SKB truesize. However, single-stream workloads with large average packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers are used. This commit auto-tunes the mergeable receiver buffer packet size by choosing the packet buffer size based on an EWMA of the recent packet sizes for the receive queue. Packet buffer sizes range from MTU_SIZE + virtio-net header len to PAGE_SIZE. This improves throughput for large packet workloads, as any workload with average packet size = PAGE_SIZE will use PAGE_SIZE buffers. These optimizations interact positively with recent commit ba275241030c (virtio-net: coalesce rx frags when possible during rx), which coalesces adjacent RX SKB fragments in virtio_net. The coalescing optimizations benefit buffers of any size. Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs between two QEMU VMs on a single physical machine. Each VM has two VCPUs with all offloads vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup cpuset, using cgroups to ensure that other processes in the system will not be scheduled on the benchmark CPUs. Trunk includes SKB rx frag coalescing. net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s net-next (MTU-size bufs): 13170.01Gb/s net-next + auto-tune: 14555.94Gb/s Jason Wang also reported a throughput increase on mlx4 from 22Gb/s using MTU-sized buffers to about 26Gb/s using auto-tuning. Signed-off-by: Michael Dalton mwdal...@google.com --- v5-v6: Fix merge conflict. Subtract 1 before encoding the scaled truesize for a mergeable buffer ctx to support 64KB PAGE_SIZE. v2-v3: Remove per-receive queue metadata ring. Encode packet buffer base address and truesize into an unsigned long by requiring a minimum packet size alignment of 256. Permit attempts to fill an already-full RX ring (reverting the change in v2). v1-v2: Add per-receive queue metadata ring to track precise truesize for mergeable receive buffers. Remove all truesize approximation. Never try to fill a full RX ring (required for metadata ring in v2). drivers/net/virtio_net.c | 100 +++ 1 file changed, 75 insertions(+), 25 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 5ee71dc..dacd43b 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -26,6 +26,7 @@ #include linux/if_vlan.h #include linux/slab.h #include linux/cpu.h +#include linux/average.h static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -36,11 +37,18 @@ module_param(gso, bool, 0444); /* FIXME: MTU in config. */ #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \ -sizeof(struct virtio_net_hdr_mrg_rxbuf), \ -L1_CACHE_BYTES)) #define GOOD_COPY_LEN 128 +/* Weight used for the RX packet size EWMA. The average packet size is used to + * determine the packet buffer size when refilling RX rings. As the entire RX + * ring may be refilled at once, the weight is chosen so that the EWMA will be + * insensitive to short-term, transient changes in packet size. + */ +#define RECEIVE_AVG_WEIGHT 64 + +/* Minimum alignment for mergeable packet buffers. */ +#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256) + #define VIRTNET_DRIVER_VERSION 1.0.0 struct virtnet_stats { @@ -75,6 +83,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Average packet length for mergeable receive buffers. */ + struct ewma mrg_avg_pkt_len; + /* Page frag for packet buffer allocation. */ struct page_frag alloc_frag; @@ -216,6 +227,24 @@ static void skb_xmit_done(struct virtqueue *vq) netif_wake_subqueue(vi-dev, vq2txq(vq)); } +static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx) +{ + unsigned int truesize = mrg_ctx (MERGEABLE_BUFFER_ALIGN - 1); + return (truesize + 1) * MERGEABLE_BUFFER_ALIGN; +} + +static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx) +{ + return (void *)(mrg_ctx -MERGEABLE_BUFFER_ALIGN); + +} + +static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize) +{ + unsigned int size = truesize / MERGEABLE_BUFFER_ALIGN; + return (unsigned long)buf | (size - 1); +} + /* Called from bottom half context */ static struct sk_buff *page_to_skb(struct
[PATCH net-next v6 4/6] net-sysfs: add support for device-specific rx queue sysfs attributes
Extend existing support for netdevice receive queue sysfs attributes to permit a device-specific attribute group. Initial use case for this support will be to allow the virtio-net device to export per-receive queue mergeable receive buffer size. Signed-off-by: Michael Dalton mwdal...@google.com --- v4-v5: Handle sysfs_create_group failure. Call sysfs_remove_group when removing a RX queue kobj if a device-specific group exists. v3-v4: Simplify by removing loop in get_netdev_rx_queue_index. include/linux/netdevice.h | 35 + net/core/dev.c| 12 ++-- net/core/net-sysfs.c | 50 +++ 3 files changed, 66 insertions(+), 31 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index d7668b88..e985231 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu *rps_sock_flow_table; bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id, u16 filter_id); #endif +#endif /* CONFIG_RPS */ /* This structure contains an instance of an RX queue. */ struct netdev_rx_queue { +#ifdef CONFIG_RPS struct rps_map __rcu*rps_map; struct rps_dev_flow_table __rcu *rps_flow_table; +#endif struct kobject kobj; struct net_device *dev; } cacheline_aligned_in_smp; -#endif /* CONFIG_RPS */ + +/* + * RX queue sysfs structures and functions. + */ +struct rx_queue_attribute { + struct attribute attr; + ssize_t (*show)(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attr, char *buf); + ssize_t (*store)(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attr, const char *buf, size_t len); +}; #ifdef CONFIG_XPS /* @@ -1313,7 +1326,7 @@ struct net_device { unicast) */ -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS struct netdev_rx_queue *_rx; /* Number of RX queues allocated at register_netdev() time */ @@ -1424,6 +1437,8 @@ struct net_device { struct device dev; /* space for optional device, statistics, and wireless sysfs groups */ const struct attribute_group *sysfs_groups[4]; + /* space for optional per-rx queue attributes */ + const struct attribute_group *sysfs_rx_queue_group; /* rtnetlink link ops */ const struct rtnl_link_ops *rtnl_link_ops; @@ -2375,7 +2390,7 @@ static inline bool netif_is_multiqueue(const struct net_device *dev) int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq); #else static inline int netif_set_real_num_rx_queues(struct net_device *dev, @@ -2394,7 +2409,7 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev, from_dev-real_num_tx_queues); if (err) return err; -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS return netif_set_real_num_rx_queues(to_dev, from_dev-real_num_rx_queues); #else @@ -2402,6 +2417,18 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev, #endif } +#ifdef CONFIG_SYSFS +static inline unsigned int get_netdev_rx_queue_index( + struct netdev_rx_queue *queue) +{ + struct net_device *dev = queue-dev; + int index = queue - dev-_rx; + + BUG_ON(index = dev-num_rx_queues); + return index; +} +#endif + #define DEFAULT_MAX_NUM_RSS_QUEUES (8) int netif_get_num_default_rss_queues(void); diff --git a/net/core/dev.c b/net/core/dev.c index f87bedd..288df62 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2083,7 +2083,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq) } EXPORT_SYMBOL(netif_set_real_num_tx_queues); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS /** * netif_set_real_num_rx_queues - set actual number of RX queues used * @dev: Network device @@ -5764,7 +5764,7 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev, } EXPORT_SYMBOL(netif_stacked_transfer_operstate); -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS static int netif_alloc_rx_queues(struct net_device *dev) { unsigned int i, count = dev-num_rx_queues; @@ -6309,7 +6309,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, return NULL; } -#ifdef CONFIG_RPS +#ifdef CONFIG_SYSFS if (rxqs 1) { pr_err(alloc_netdev: Unable to allocate device with zero RX queues\n); return NULL; @@ -6365,7 +6365,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, if (netif_alloc_netdev_queues(dev
[PATCH net-next v6 5/6] lib: Ensure EWMA does not store wrong intermediate values
To ensure ewma_read() without a lock returns a valid but possibly out of date average, modify ewma_add() by using ACCESS_ONCE to prevent intermediate wrong values from being written to avg-internal. Suggested-by: Eric Dumazet eric.duma...@gmail.com Acked-by: Michael S. Tsirkin m...@redhat.com Acked-by: Eric Dumazet eduma...@google.com Signed-off-by: Michael Dalton mwdal...@google.com --- lib/average.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/lib/average.c b/lib/average.c index 99a67e6..114d1be 100644 --- a/lib/average.c +++ b/lib/average.c @@ -53,8 +53,10 @@ EXPORT_SYMBOL(ewma_init); */ struct ewma *ewma_add(struct ewma *avg, unsigned long val) { - avg-internal = avg-internal ? - (((avg-internal avg-weight) - avg-internal) + + unsigned long internal = ACCESS_ONCE(avg-internal); + + ACCESS_ONCE(avg-internal) = internal ? + (((internal avg-weight) - internal) + (val avg-factor)) avg-weight : (val avg-factor); return avg; -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v6 6/6] virtio-net: initial rx sysfs support, export mergeable rx buffer size
Add initial support for per-rx queue sysfs attributes to virtio-net. If mergeable packet buffers are enabled, adds a read-only mergeable packet buffer size sysfs attribute for each RX queue. Suggested-by: Michael S. Tsirkin m...@redhat.com Acked-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Michael Dalton mwdal...@google.com --- v3-v4: Remove seqcount due to EWMA changes in patch 5. Add missing Suggested-By. drivers/net/virtio_net.c | 46 ++ 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index dacd43b..d75f8ed 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -600,18 +600,25 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) return err; } -static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) +static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len) { const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); + unsigned int len; + + len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len), + GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); + return ALIGN(len, MERGEABLE_BUFFER_ALIGN); +} + +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) +{ struct page_frag *alloc_frag = rq-alloc_frag; char *buf; unsigned long ctx; int err; unsigned int len, hole; - len = hdr_len + clamp_t(unsigned int, ewma_read(rq-mrg_avg_pkt_len), - GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); - len = ALIGN(len, MERGEABLE_BUFFER_ALIGN); + len = get_mergeable_buf_len(rq-mrg_avg_pkt_len); if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp))) return -ENOMEM; @@ -1584,6 +1591,33 @@ err: return ret; } +#ifdef CONFIG_SYSFS +static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue, + struct rx_queue_attribute *attribute, char *buf) +{ + struct virtnet_info *vi = netdev_priv(queue-dev); + unsigned int queue_index = get_netdev_rx_queue_index(queue); + struct ewma *avg; + + BUG_ON(queue_index = vi-max_queue_pairs); + avg = vi-rq[queue_index].mrg_avg_pkt_len; + return sprintf(buf, %u\n, get_mergeable_buf_len(avg)); +} + +static struct rx_queue_attribute mergeable_rx_buffer_size_attribute = + __ATTR_RO(mergeable_rx_buffer_size); + +static struct attribute *virtio_net_mrg_rx_attrs[] = { + mergeable_rx_buffer_size_attribute.attr, + NULL +}; + +static const struct attribute_group virtio_net_mrg_rx_group = { + .name = virtio_net, + .attrs = virtio_net_mrg_rx_attrs +}; +#endif + static int virtnet_probe(struct virtio_device *vdev) { int i, err; @@ -1698,6 +1732,10 @@ static int virtnet_probe(struct virtio_device *vdev) if (err) goto free_stats; +#ifdef CONFIG_SYSFS + if (vi-mergeable_rx_bufs) + dev-sysfs_rx_queue_group = virtio_net_mrg_rx_group; +#endif netif_set_real_num_tx_queues(dev, vi-curr_queue_pairs); netif_set_real_num_rx_queues(dev, vi-curr_queue_pairs); -- 1.8.5.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
I'd like to confirm the preferred sysfs path structure for mergeable receive buffers. Is 'mergeable_rx_buffer_size' the right attribute name to use or is there a strong preference for a different name? I believe the current approach proposed for the next patchset is to use a per-netdev attribute group which we will add to the receive queue kobj (struct netdev_rx_queue). That leaves us with at least two options: (1) Name the attribute group something, e.g., 'virtio-net', in which case all virtio-net attributes for eth0 queue N will be of the form: /sys/class/net/eth0/queues/rx-N/virtio-net/attribute name (2) Do not name the attribute group (leave the name NULL), in which case AFAICT virtio-net and device-independent attributes would be mixed without any indication. For example, all virtio-net attributes for netdev eth0 queue N would be of the form: /sys/class/net/eth0/queues/rx-N/attribute name FWIW, the bonding netdev has a similar sysfs issue and uses a per-netdev attribute group (stored in the 'sysfs_groups' field of struct netdevice) In the case of bonding, the attribute group is named, so device-independent netdev attributes are found in /sys/class/net/eth0/attribute name while bonding attributes are placed in /sys/class/net/eth0/bonding/attribute name. So it seems like there is some precedent for using an attribute group name corresponding to the driver name. Does using an attribute group name of 'virtio-net' sound good or would an empty or different attribute group name be preferred? Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
On Mon, Jan 13, 2014 at 7:38 AM, Ben Hutchings bhutchi...@solarflare.com wrote: I don't think RPS should own this structure. It's just that there are currently no per-RX-queue attributes other than those defined by RPS. Agreed, there is useful attribute-independent functionality already built around netdev_rx_queue - e.g., dynamically resizing the rx queue kobjs as the number of RX queues enabled for the netdev is changed. While the current attributes happen to be used only by RPS, AFAICT it seems RPS should not own netdev_rx_queue but rather should own the RPS-specific fields themselves within netdev_rx_queue. If there are no objections, it seems like I could modify netdev_rx_queue and related functionality so that their existence does not depend on CONFIG_RPS, and instead just have CONFIG_RPS control whether or not the RPS-specific attributes/fields are present. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
Sorry I missed this important piece of information, it appears that netdev_queue (the TX equivalent of netdev_rx_queue) already has decoupled itself from CONFIG_XPS due to an attribute, queue_trans_timeout, that does not depend on XPS functionality. So it seems that something somewhat equivalent has already happened on the TX side. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
Hi Michael, On Sun, Jan 12, 2014 at 9:09 AM, Michael S. Tsirkin m...@redhat.com wrote: Can't we add struct attribute * to netdevice, and pass that in when creating the kobj? I like that idea, I think that will work and should be better than the alternatives. The actual kobjs for RX queues (struct netdev_rx_queue) are allocated and deallocated by calls to net_rx_queue_update_kobjects, which resizes RX queue kobjects when the netdev RX queues are resized. Is this what you had in mind: (1) Add a pointer to an attribute group to struct net_device, used for per-netdev rx queue attributes and initialized before the call to register_netdevice(). (2) Declare an attribute group containing the mergeable_rx_buffer_size attribute in virtio-net, and initialize the per-netdevice group pointer to the address of this group in virtnet_probe before register_netdevice (3) In net-sysfs, modify net_rx_queue_update_kobjects (or rx_queue_add_kobject) to call sysfs_create_group on the per-netdev attribute group (if non-NULL), adding the attributes in the group to the RX queue kobject. That should allow us to have per-RX queue attributes that are device-specific. I'm not a sysfs expert, but it seems that rx_queue_ktype and rx_queue_sysfs_ops presume that all rx queue sysfs operations are performed on attributes of type rx_queue_attribute. That type will need to be moved from net-sysfs.c to a header file like netdevice.h so that the type can be used in virtio-net when we declare the mergeable_rx_buffer_size attribute. The last issue is how the rx_queue_attribute 'show' function implementation for mergeable_rx_buffer_size will access the appropriate per-receive queue EWMA data. The arguments to the show function will be the netdev_rx_queue and the attribute itself. We can get to the struct net_device from the netdev_rx_queue. If we extended netdev_rx_queue to indicate the queue_index or to store a void *priv_data pointer, that would be sufficient to allow us to resolve this issue. Please let me know if the above sounds good or if you see a better way to accomplish this goal. Thanks! Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
Hi Jason, Michael Sorry for the delay in response. Jason, I agree this patch ended up being larger than expected. The major implementation parts are: (1) Setup directory structure (driver/per-netdev/rx-queue directories) (2) Network device renames (optional, so debugfs dir has the right name) (3) Support resizing the # of RX queues (optional - we could just export max_queue_pairs files and not delete files if an RX queue is disabled) (4) Reference counting - used in case someone opens a debugfs file and then removes the virtio-net device. (5) The actual mergeable rx buffer file implementation itself. For now I have added a seqcount for memory safety, but if a read-only race condition is acceptable we could elide the seqcount. FWIW, the seqcount write in receive_mergeable() should, on modern x86, translate to two non-atomic adds and two compiler barriers, so overhead is not expected to be meaningful. We can move to sysfs and this would simplify or eliminate much of the above, including most of (1) - (4). I believe our choices for what to do for the next patchset include: (a) Use debugfs as is currently done, removing any optional features listed above that are deemed unnecessary. (b) Add a per-netdev sysfs attribute group to net_device-sysfs_groups. Each attribute would display the mergeable packet buffer size for a given RX queue, and there would be max_queue_pairs attributes in total. This is already supported by net/core/net-sysfs.c:netdev_register_kobject(), but means that we would have a static set of per-RX queue files for all RX queues supported by the netdev, rather than dynamically displaying only the files corresponding to enabled RX queues (e.g., when # of RX queues is changed by ethtool -L device). For an example of this approach, see drivers/net/bonding/bond_sysfs.c. (c) Modify struct netdev_rx_queue to add virtio-net EWMA fields directly, and modify net-sysfs.c to manage the new fields. Unlike (b), this approach supports the RX queue resizing in (3) but means putting virtio-net info in netdev_rx_queue, which currently has only device-independent fields. My preference would be (b): try using sysfs and adding a device-specific attribute group to the virtio-net netdevice (stored in the existing 'sysfs_groups' field and supported by net-sysfs). This would avoid adding virtio-net specific information to net-sysfs. What would you prefer (or is there a better way than the approaches above)? Thanks! Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
Also, one other note: if we use sysfs, the directory structure will be different depending on our chosen sysfs strategy. If we augment netdev_rx_queue, the new attributes will be found in the standard 'rx-N' netdev subdirectory, e.g., /sys/class/net/eth0/queues/rx-0/mergeable_rx_buffer_size Whereas if we use per-netdev attributes, our attributes would be in /sys/class/net/eth0/group name/attribute name, which may be less intuitive as AFAICT we'd have to indicate both the queue # and type of value being reported using the attribute name. E.g., /sys/class/net/eth0/virtio-net/rx-0_mergeable_buffer_size. That's somewhat less elegant. I don't see an easy way to add new attributes to the 'rx-N' subdirectories without directly modifying struct netdev_rx_queue, so I think this is another tradeoff between the two sysfs approaches. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
Hi Michael, Here's a quick sketch of some code that enforces a minimum buffer alignment of only 64, and has a maximum theoretical buffer size of aligned GOOD_PACKET_LEN + (BUF_ALIGN - 1) * BUF_ALIGN, which is at least 1536 + 63 * 64 = 5568. On x86, we already use a 64 byte alignment, and this code supports all current buffer sizes, from 1536 to PAGE_SIZE. #if L1_CACHE_BYTES 64 #define MERGEABLE_BUFFER_ALIGN 64 #define MERGEABLE_BUFFER_SHIFT 6 #else #define MERGEABLE_BUFFER_ALIGN L1_CACHE_BYTES #define MERGEABLE_BUFFER_SHIFT L1_CACHE_SHIFT #endif #define MERGEABLE_BUFFER_MIN ALIGN(GOOD_PACKET_LEN + sizeof(virtio_net_hdr_mrg_rbuf), MERGEABLE_BUFFER_ALIGN) #define MERGEABLE_BUFFER_MAX min(MERGEABLE_BUFFER_MIN + (MERGEABLE_BUFFER_ALIGN - 1) * MERGEABLE_BUFFER_ALIGN, PAGE_SIZE) /* Extract buffer length from a mergeable buffer context. */ static u16 get_mergeable_buf_ctx_len(void *ctx) { u16 len = (uintptr_t)ctx (MERGEABLE_BUFFER_ALIGN - 1); return MERGEABLE_BUFFER_MIN + (len MERGEABLE_BUFFER_SHIFT); } /* Extract buffer base address from a mergeable buffer context. */ static void *get_mergeable_buf_ctx_base(void *ctx) { return (void *) ((uintptr)ctx -MERGEABLE_BUFFER_ALIGN); } /* Convert a base address and length to a mergeable buffer context. */ static void *to_mergeable_buf_ctx(void *base, u16 len) { len -= MERGEABLE_BUFFER_MIN; return (void *) ((uintptr)base | (len MERGEABLE_BUFFER_SHIFT)); } /* Compute the packet buffer length for a receive queue. */ static u16 get_mergeable_buffer_len(struct receive_queue *rq) { u16 len = clamp_t(u16, MERGEABLE_BUFFER_MIN, ewma_read(rq-avg_pkt_len), MERGEABLE_BUFFER_MAX); return ALIGN(len, MERGEABLE_BUFFER_ALIGN); } Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
If the prior code snippet looks good to you, I'll use something like that as a baseline for a v3 patchset. I don't think we need a stricter alignment than 64 to express values in the range (1536 ... 4096), as the code snippet shows, which is great for x86 4KB pages. On other architectures that have larger page sizes 4KB with = 64b cachelines, we may want to increase the alignment so that the max buffer size will be = PAGE_SIZE (max size allowed by skb_page_frag_refill). If we use a minimum alignment of 128, our maximum theoretical packet buffer length is 1536 + 127 * 128 = 17792. With 256 byte alignment, we can express a maximum packet buffer size 65536. Given the above, I think we want to select the min buffer alignment based on the PAGE_SIZE: = 4KB PAGE_SIZE: 64b min alignment = 16KB PAGE_SIZE: 128b min alignment 16KB PAGE_SIZE: 256b min alignment So the prior code snippet would be relatively unchanged, except that references to the previous minimum alignment of 64 would be replaced by a #define'd constant derived from PAGE_SIZE as shown above. This would guarantee that we use the minimum alignment necessary to ensure that virtio-net can post a max size (PAGE_SIZE) buffer, and for x86 this means we won't increase the alignment beyond the x86's current L1_CACHE_BYTES value (64). Also, sorry I haven't had a chance to respond yet to the debugfs feedback, I will get to that soon (just wanted to do a further deep dive on some of the sysfs/debugfs tradeoffs). Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
Hi Michael, Your improvements (code changes, more consistent naming, and use of 256-byte alignment only) all sound good to me. I will get started on a v3 patchset in conformance with your recommendations after sorting out what we want to do with the debugfs/sysfs issues. I will followup soon on the thread for patch 4/4 so we can close on what changes are needed for debugfs/sysfs. Thanks! Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
Hi Jason, On Tue, Jan 7, 2014 at 10:23 PM, Jason Wang jasow...@redhat.com wrote: What's the reason that this extra space is not accounted for truesize? The initial rationale was that this extra space is due to internal fragmentation in the page frag allocator, but I agree with you -- this code should be changed and the extra space accounted for. Any internal fragmentation leading to a larger last packet allocated from the page should be reflected in the SKB truesize of the last packet. I will do a followup patchset that accounts correctly for the extra space, which will also me to remove the two max statements you indicated. Thanks for finding this issue. + if (err 0) { + put_page(virt_to_head_page(ctx-buf)); + return err; Should we also roll back the frag offset added above to avoid leaking frags? I believe the put_page here is sufficient for correctness. When we allocate a buffer using skb_page_frag_refill, we use get_page/put_page to allocate/free respectively. For example, if the virtqueue_add_inbuf succeeded, we would eventually call put_page either in virtio-net (e.g., page_to_skb for packets = GOOD_COPY_LEN bytes) or later in __skb_frag_unref and other functions called during dev_kfree_skb. However, an offset rollback does allow the space to be reused by the next allocation, which could be a good optimization. I can do the offset rollback (with a put_page) in the next patchset. What do you think? + /* Do not attempt to add a buffer if the RX ring is full. */ + if (unlikely(!rq-vq-num_free)) + return true; I haven't figured out why this is needed. It seems safe for virtqueue_add_inbuf() just fail in add_recv_xx()? I think this is safe with one caveat -- we can't modify rq-mrg_buf_ctx until we know the ring isn't full (otherwise, we clobber an in-use entry). It is safe to modify rq-mrg_buf_ctx after we know that virtqueue_add_inbuf has succeeded. I can remove the rq_num_free check from try_fill_recv, and then modify virtqueue_add_inbuf to use a local mergeable_receive_buf_ctx. Once virtqueue_add_inbuf succeeds, the contents of the local variable can be copied to rq-mrg_buf_ctx[rq-mrg_buf_ctx_head]. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
Hi Eric, Michael, On Wed, Jan 8, 2014 at 11:16 AM, Michael S. Tsirkin m...@redhat.com wrote: Why should we select a frame at random and make it's truesize bigger? All frames are to blame for the extra space. Just ignoring it seems more symmetrical. Sounds good, based on Eric's feedback and Michael's feedback above, I will leave the 'extra space' handling as-is in the followup patchset and will not track the extra space in ctx-truesize. AFAICT, The two max() statements will need to remain (as buffer length may exceed ctx-truesize). Thanks for the feedback. If you intend to repost anyway (for the below wrinkle) then you can do it right here just as well I guess. Seems a bit prettier. Will do. You don't have to fill in ctx before calling add_inbuf, do you? Just fill it afterwards. Agreed, ctx does not need to be filled until after add_inbuf. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
Hi Michael, On Wed, Jan 8, 2014 at 5:42 PM, Michael S. Tsirkin m...@redhat.com wrote: Sorry that I didn't notice early, but there seems to be a bug here. See below. Yes, that is definitely a bug. Virtio spec permits OOO completions, but current code assumes in-order completion. Thanks for catching this. Don't need full int really, it's up to 4K/cache line size, 1 byte would be enough, maximum 2 ... So if all we want is extra 1-2 bytes per buffer, we don't really need this extra level of indirection I think. We can just allocate them before the header together with an skb. I'm not sure if I'm parsing the above correctly, but do you mean using a few bytes at the beginning of the packet buffer to store truesize? I think that will break Jason's virtio-net RX frag coalescing code. To coalesce consecutive RX packet buffers, our packet buffers must be physically adjacent, and any extra bytes before the start of the buffer would break that. We could allocate an SKB per packet buffer, but if we have multi-buffer packets often(e.g., netperf benefiting from GSO/GRO), we would be allocating 1 SKB per packet buffer instead of 1 SKB per MAX_SKB_FRAGS buffers. How do you feel about any of the below alternatives: (1) Modify the existing mrg_buf_ctx to chain together free entries We can use the 'buf' pointer in mergeable_receive_buf_ctx to chain together free entries so that we can support OOO completions. This would be similar to how virtio-queue manages free sg entries. (2) Combine the buffer pointer and truesize into a single void* value Your point about there only being a byte needed to encode truesize is spot on, and I think we could leverage this to eliminate the out-of-band metadata ring entirely. If we were willing to change the packet buffer alignment from L1_CACHE_BYTES to 256 (or min (256, L1_CACHE_SIZE)), we could encode the truesize in the least significant 8 bits of the buffer address (encoded as truesize 8 as we know all sizes are a multiple of 256). This would allow packet buffers up to 64KB in length. Is there another approach you would prefer to any of these? If the cleanliness issues and larger alignment aren't too bad, I think (2) sounds promising and allow us to eliminate the metadata ring entirely while still permitting RX frag coalescing. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
Sorry, forgot to mention - if we want to explore combining the buffer address and truesize into a single void *, we could also exploit the fact that our size ranges from aligned GOOD_PACKET_LEN to PAGE_SIZE, and potentially encode fewer values for truesize (and require a smaller alignment than 256). The prior e-mails discussion of 256 byte alignment with 256 values is just one potential design point. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v2 2/4] virtio-net: use per-receive queue page frag alloc for mergeable bufs
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC mergeable rx buffer allocations. This commit migrates virtio-net to use per-receive queue page frags for GFP_ATOMIC allocation. This change unifies mergeable rx buffer memory allocation, which now will use skb_refill_frag() for both atomic and GFP-WAIT buffer allocations. To address fragmentation concerns, if after buffer allocation there is too little space left in the page frag to allocate a subsequent buffer, the remaining space is added to the current allocated buffer so that the remaining space can be used to store packet data. Signed-off-by: Michael Dalton mwdal...@google.com --- v2: Use GFP_COLD for RX buffer allocations (as in netdev_alloc_frag()). Remove per-netdev GFP_KERNEL page_frag allocator. drivers/net/virtio_net.c | 69 1 file changed, 35 insertions(+), 34 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c51a988..526dfd8 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -78,6 +78,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Page frag for packet buffer allocation. */ + struct page_frag alloc_frag; + /* RX: fragments + linear part + virtio header */ struct scatterlist sg[MAX_SKB_FRAGS + 2]; @@ -126,11 +129,6 @@ struct virtnet_info { /* Lock for config space updates */ struct mutex config_lock; - /* Page_frag for GFP_KERNEL packet buffer allocation when we run -* low on memory. -*/ - struct page_frag alloc_frag; - /* Does the affinity hint is set for virtqueues? */ bool affinity_hint_set; @@ -336,8 +334,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, int num_buf = hdr-mhdr.num_buffers; struct page *page = virt_to_head_page(buf); int offset = buf - page_address(page); - struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, - MERGE_BUFFER_LEN); + unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN); + struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize); struct sk_buff *curr_skb = head_skb; if (unlikely(!curr_skb)) @@ -353,11 +351,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, dev-stats.rx_length_errors++; goto err_buf; } - if (unlikely(len MERGE_BUFFER_LEN)) { - pr_debug(%s: rx error: merge buffer too long\n, -dev-name); - len = MERGE_BUFFER_LEN; - } page = virt_to_head_page(buf); --rq-num; @@ -376,19 +369,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, head_skb-truesize += nskb-truesize; num_skb_frags = 0; } + truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN); if (curr_skb != head_skb) { head_skb-data_len += len; head_skb-len += len; - head_skb-truesize += MERGE_BUFFER_LEN; + head_skb-truesize += truesize; } offset = buf - page_address(page); if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { put_page(page); skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, -len, MERGE_BUFFER_LEN); +len, truesize); } else { skb_add_rx_frag(curr_skb, num_skb_frags, page, - offset, len, MERGE_BUFFER_LEN); + offset, len, truesize); } } @@ -578,25 +572,24 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) { - struct virtnet_info *vi = rq-vq-vdev-priv; - char *buf = NULL; + struct page_frag *alloc_frag = rq-alloc_frag; + char *buf; int err; + unsigned int len, hole; - if (gfp __GFP_WAIT) { - if (skb_page_frag_refill(MERGE_BUFFER_LEN, vi-alloc_frag, -gfp)) { - buf = (char *)page_address(vi-alloc_frag.page) + - vi-alloc_frag.offset; - get_page(vi-alloc_frag.page); - vi-alloc_frag.offset += MERGE_BUFFER_LEN; - } - } else { - buf = netdev_alloc_frag(MERGE_BUFFER_LEN); - } - if (!buf) + if (unlikely
[PATCH net-next v2 1/4] net: allow 0 order atomic page alloc in skb_page_frag_refill
skb_page_frag_refill currently permits only order-0 page allocs unless GFP_WAIT is used. Change skb_page_frag_refill to attempt higher-order page allocations whether or not GFP_WAIT is used. If memory cannot be allocated, the allocator will fall back to successively smaller page allocs (down to order-0 page allocs). This change brings skb_page_frag_refill in line with the existing page allocation strategy employed by netdev_alloc_frag, which attempts higher-order page allocations whether or not GFP_WAIT is set, falling back to successively lower-order page allocations on failure. Part of migration of virtio-net to per-receive queue page frag allocators. Acked-by: Michael S. Tsirkin m...@redhat.com Acked-by: Eric Dumazet eduma...@google.com Signed-off-by: Michael Dalton mwdal...@google.com --- net/core/sock.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 5393b4b..a0d522a 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1865,9 +1865,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio) put_page(pfrag-page); } - /* We restrict high order allocations to users that can afford to wait */ - order = (prio __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0; - + order = SKB_FRAG_PAGE_ORDER; do { gfp_t gfp = prio; -- 1.8.5.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators) changed the mergeable receive buffer size from PAGE_SIZE to MTU-size, introducing a single-stream regression for benchmarks with large average packet size. There is no single optimal buffer size for all workloads. For workloads with packet size = MTU bytes, MTU + virtio-net header-sized buffers are preferred as larger buffers reduce the TCP window due to SKB truesize. However, single-stream workloads with large average packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers are used. This commit auto-tunes the mergeable receiver buffer packet size by choosing the packet buffer size based on an EWMA of the recent packet sizes for the receive queue. Packet buffer sizes range from MTU_SIZE + virtio-net header len to PAGE_SIZE. This improves throughput for large packet workloads, as any workload with average packet size = PAGE_SIZE will use PAGE_SIZE buffers. These optimizations interact positively with recent commit ba275241030c (virtio-net: coalesce rx frags when possible during rx), which coalesces adjacent RX SKB fragments in virtio_net. The coalescing optimizations benefit buffers of any size. Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs between two QEMU VMs on a single physical machine. Each VM has two VCPUs with all offloads vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup cpuset, using cgroups to ensure that other processes in the system will not be scheduled on the benchmark CPUs. Trunk includes SKB rx frag coalescing. net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s net-next (MTU-size bufs): 13170.01Gb/s net-next + auto-tune: 14555.94Gb/s Jason Wang also reported a throughput increase on mlx4 from 22Gb/s using MTU-sized buffers to about 26Gb/s using auto-tuning. Signed-off-by: Michael Dalton mwdal...@google.com --- v2: Add per-receive queue metadata ring to track precise truesize for mergeable receive buffers. Remove all truesize approximation. Never try to fill a full RX ring (required for metadata ring in v2). drivers/net/virtio_net.c | 145 ++- 1 file changed, 107 insertions(+), 38 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 526dfd8..f6e1ee0 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -26,6 +26,7 @@ #include linux/if_vlan.h #include linux/slab.h #include linux/cpu.h +#include linux/average.h static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -36,11 +37,15 @@ module_param(gso, bool, 0444); /* FIXME: MTU in config. */ #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \ -sizeof(struct virtio_net_hdr_mrg_rxbuf), \ -L1_CACHE_BYTES)) #define GOOD_COPY_LEN 128 +/* Weight used for the RX packet size EWMA. The average packet size is used to + * determine the packet buffer size when refilling RX rings. As the entire RX + * ring may be refilled at once, the weight is chosen so that the EWMA will be + * insensitive to short-term, transient changes in packet size. + */ +#define RECEIVE_AVG_WEIGHT 64 + #define VIRTNET_DRIVER_VERSION 1.0.0 struct virtnet_stats { @@ -65,11 +70,30 @@ struct send_queue { char name[40]; }; +/* Per-packet buffer context for mergeable receive buffers. */ +struct mergeable_receive_buf_ctx { + /* Packet buffer base address. */ + void *buf; + + /* Original size of the packet buffer for use in SKB truesize. Does not +* include any padding space used to avoid internal fragmentation. +*/ + unsigned int truesize; +}; + /* Internal representation of a receive virtqueue */ struct receive_queue { /* Virtqueue associated with this receive_queue */ struct virtqueue *vq; + /* Circular buffer of mergeable rxbuf contexts. */ + struct mergeable_receive_buf_ctx *mrg_buf_ctx; + + /* Number of elements head index of mrg_buf_ctx. Size must be +* equal to the associated virtqueue's vring size. +*/ + unsigned int mrg_buf_ctx_size, mrg_buf_ctx_head; + struct napi_struct napi; /* Number of input buffers, and max we've ever had. */ @@ -78,6 +102,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Average packet length for mergeable receive buffers. */ + struct ewma mrg_avg_pkt_len; + /* Page frag for packet buffer allocation. */ struct page_frag alloc_frag; @@ -327,32 +354,32 @@ err: static struct sk_buff *receive_mergeable(struct net_device *dev, struct receive_queue *rq, -void *buf, +struct
[PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
Add initial support for debugfs to virtio-net. Each virtio-net network device will have a directory under /virtio-net in debugfs. The per-network device directory will contain one sub-directory per active, enabled receive queue. If mergeable receive buffers are enabled, each receive queue directory will contain a read-only file that returns the current packet buffer size for the receive queue. Signed-off-by: Michael Dalton mwdal...@google.com --- drivers/net/virtio_net.c | 314 --- 1 file changed, 296 insertions(+), 18 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index f6e1ee0..5da18d6 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -27,6 +27,9 @@ #include linux/slab.h #include linux/cpu.h #include linux/average.h +#include linux/seqlock.h +#include linux/kref.h +#include linux/debugfs.h static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -35,6 +38,9 @@ static bool csum = true, gso = true; module_param(csum, bool, 0444); module_param(gso, bool, 0444); +/* Debugfs root directory for all virtio-net devices. */ +static struct dentry *virtnet_debugfs_root; + /* FIXME: MTU in config. */ #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) #define GOOD_COPY_LEN 128 @@ -102,9 +108,6 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; - /* Average packet length for mergeable receive buffers. */ - struct ewma mrg_avg_pkt_len; - /* Page frag for packet buffer allocation. */ struct page_frag alloc_frag; @@ -115,6 +118,28 @@ struct receive_queue { char name[40]; }; +/* Per-receive queue statistics exported via debugfs. */ +struct receive_queue_stats { + /* Average packet length of receive queue (for mergeable rx buffers). */ + struct ewma avg_pkt_len; + + /* Per-receive queue stats debugfs directory. */ + struct dentry *dbg; + + /* Reference count for the receive queue statistics, needed because +* an open debugfs file may outlive the receive queue and netdevice. +* Open files will remain in-use until all outstanding file descriptors +* are closed, even after the underlying file is unlinked. +*/ + struct kref refcount; + + /* Sequence counter to allow debugfs readers to safely access stats. +* Assumes a single virtio-net writer, which is enforced by virtio-net +* and NAPI. +*/ + seqcount_t dbg_seq; +}; + struct virtnet_info { struct virtio_device *vdev; struct virtqueue *cvq; @@ -147,6 +172,15 @@ struct virtnet_info { /* Active statistics */ struct virtnet_stats __percpu *stats; + /* Per-receive queue statstics exported via debugfs. Stored in +* virtnet_info to survive freeze/restore -- a task may have a per-rq +* debugfs file open at the time of freeze. +*/ + struct receive_queue_stats **rq_stats; + + /* Per-netdevice debugfs directory. */ + struct dentry *dbg_dev_root; + /* Work struct for refilling if we run low on memory. */ struct delayed_work refill; @@ -358,6 +392,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, unsigned int len) { struct skb_vnet_hdr *hdr = ctx-buf; + struct virtnet_info *vi = netdev_priv(dev); + struct receive_queue_stats *rq_stats = vi-rq_stats[vq2rxq(rq-vq)]; int num_buf = hdr-mhdr.num_buffers; struct page *page = virt_to_head_page(ctx-buf); int offset = ctx-buf - page_address(page); @@ -413,7 +449,9 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, } } - ewma_add(rq-mrg_avg_pkt_len, head_skb-len); + write_seqcount_begin(rq_stats-dbg_seq); + ewma_add(rq_stats-avg_pkt_len, head_skb-len); + write_seqcount_end(rq_stats-dbg_seq); return head_skb; err_skb: @@ -600,18 +638,30 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) return err; } +static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len) +{ + const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); + unsigned int len; + + len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len), + GOOD_PACKET_LEN, PAGE_SIZE - hdr_len); + return ALIGN(len, L1_CACHE_BYTES); +} + static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) { const unsigned int ring_size = rq-mrg_buf_ctx_size; - const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); struct page_frag *alloc_frag = rq-alloc_frag; + struct virtnet_info *vi = rq-vq-vdev-priv; struct mergeable_receive_buf_ctx *ctx; int err; unsigned int len, hole; - len = hdr_len + clamp_t(unsigned
Re: [PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance
I'm working on a followup patchset to address current feedback. I think it will be cleaner to do a debugfs implementation for per-receive queue packet buffer size exporting, so I'm trying that out. On Thu, Dec 26, 2013 at 7:04 PM, Jason Wang jasow...@redhat.com wrote: We can make this more accurate by using extra data structure to track the real buf size and using it as token. I agree -- we can do precise buffer total len tracking. Something like struct mergeable_packet_buffer_ctx { void *buf; unsigned int total_len; }; Each receive queue could have a pointer to an array of N buffer contexts, where N is queue size (kzalloc'd in init_vqs or similar). That would allow us to allocate all of our buffer context data at startup. Would this be preferred to the current approach or is there another approach you would prefer? All other things being equal, having precise length tracking is advantageous, so I'm inclined to try this out and see how it goes. I think this is a big design point - for example, if we have an extra buffer context structure, then per-receive queue frag allocators are not required for auto-tuning and we can reduce the number of patches in this patchset. I'm happy to implement either way. Thanks! Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance
On Mon, Dec 23, 2013 at 4:51 AM, Michael S. Tsirkin m...@redhat.com wrote: OK so a high level benchmark shows it's worth it, but how well does the logic work? I think we should make the buffer size accessible in sysfs or debugfs, and look at it, otherwise we don't really know. Exporting the size sounds good to me, it is definitely an important metric and would give more visibility to the admin. Do you have a preference for implementation strategy? I was thinking just add a DEVICE_ATTR to create a read-only sysfs file, 'mergeable_rx_buffer_size', and return a space-separated list of the current buffer size (computed from the average packet size) for each receive queue. -EINVAL or a similar error could be returned if the netdev was not configured for mergeable rx buffers. I don't get the real motivation for this. We have skbs A,B,C sharing a page, with chunk D being unused. This randomly charges chunk D to an skb that ended up last in the page. Correct? Why does this make sense? The intent of this code is to adjust the SKB true size for the packet. We should completely use each packet buffer except for the last buffer. For all buffers except the last buffer, it should be the case that 'len' (bytes received) = buffer size. For the last buffer, this code adjusts the truesize by comparing the approximated buffer size with the bytes received into the buffer, and adding the difference to the SKB truesize if the buffer size is greater than the number of bytes received. We approximate the buffer size by using the last packet buffer size from that same page, which as you have correctly noted may be a buffer that belongs to a different packet on the same virtio-net device. This buffer size should be very close to the actual buffer size because our EWMA estimator uses a high weight (so the packet buffer size changes very slowly) and there are only a handful packets on a page (even order-3). Why head_skb only? Why not full buffer size that comes from host? This is simply len. Sorry, I believe this code fragment should be clearer. Basically, we have a corner case in that for packets with size = GOOD_COPY_LEN, there are no frags because page_to_skb() already unref'd the page and the entire packet contents are copied to skb-data. In this case, the SKB truesize is already accurate and should not be updated (and it would be unsafe to access page-private as page is already unref'd). I'll look at the above code again and cleanup (please let me know if you have a preference) and/or add a comment to clarify. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next 2/3] virtio-net: use per-receive queue page frag alloc for mergeable bufs
On Mon, Dec 23, 2013 at 11:37 AM, Michael S. Tsirkin m...@redhat.com wrote: So there isn't a conflict with respect to locking. Is it problematic to use same page_frag with both GFP_ATOMIC and with GFP_KERNEL? If yes why? I believe it is safe to use the same page_frag and I will send out a followup patchset using just the per-receive page_frags. For future consideration, Eric noted that disabling NAPI before GFP_KERNEL allocs can potentially inhibit virtio-net network processing for some time (e.g., during a blocking memory allocation or preemption). Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH stable 1/2] virtio_net: fix error handling for mergeable buffers
Hi Michael, quick question below: On Wed, Dec 25, 2013 at 6:56 AM, Michael S. Tsirkin m...@redhat.com wrote: if (i = MAX_SKB_FRAGS) { pr_debug(%s: packet too long\n, skb-dev-name); skb-dev-stats.rx_length_errors++; - return -EINVAL; + return NULL; } Should this error handling path free the SKB before returning NULL? It seems like if we just return NULL we may leak memory. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH stable 1/2] virtio_net: fix error handling for mergeable buffers
Acked-by: Michael Dalton mwdal...@google.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH stable] virtio_net: don't leak memory or block when too many frags
Acked-by: Michael Dalton mwdal...@google.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next 2/3] virtio-net: use per-receive queue page frag alloc for mergeable bufs
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC mergeable rx buffer allocations. This commit migrates virtio-net to use per-receive queue page frags for GFP_ATOMIC allocation. This change unifies mergeable rx buffer memory allocation, which now will use skb_refill_frag() for both atomic and GFP-WAIT buffer allocations. To address fragmentation concerns, if after buffer allocation there is too little space left in the page frag to allocate a subsequent buffer, the remaining space is added to the current allocated buffer so that the remaining space can be used to store packet data. Signed-off-by: Michael Dalton mwdal...@google.com --- drivers/net/virtio_net.c | 69 ++-- 1 file changed, 38 insertions(+), 31 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c51a988..d38d130 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -78,6 +78,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Page frag for GFP_ATOMIC packet buffer allocation. */ + struct page_frag atomic_frag; + /* RX: fragments + linear part + virtio header */ struct scatterlist sg[MAX_SKB_FRAGS + 2]; @@ -127,9 +130,9 @@ struct virtnet_info { struct mutex config_lock; /* Page_frag for GFP_KERNEL packet buffer allocation when we run -* low on memory. +* low on memory. May sleep. */ - struct page_frag alloc_frag; + struct page_frag sleep_frag; /* Does the affinity hint is set for virtqueues? */ bool affinity_hint_set; @@ -336,8 +339,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, int num_buf = hdr-mhdr.num_buffers; struct page *page = virt_to_head_page(buf); int offset = buf - page_address(page); - struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, - MERGE_BUFFER_LEN); + int truesize = max_t(int, len, MERGE_BUFFER_LEN); + struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize); struct sk_buff *curr_skb = head_skb; if (unlikely(!curr_skb)) @@ -353,11 +356,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, dev-stats.rx_length_errors++; goto err_buf; } - if (unlikely(len MERGE_BUFFER_LEN)) { - pr_debug(%s: rx error: merge buffer too long\n, -dev-name); - len = MERGE_BUFFER_LEN; - } page = virt_to_head_page(buf); --rq-num; @@ -376,19 +374,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, head_skb-truesize += nskb-truesize; num_skb_frags = 0; } + truesize = max_t(int, len, MERGE_BUFFER_LEN); if (curr_skb != head_skb) { head_skb-data_len += len; head_skb-len += len; - head_skb-truesize += MERGE_BUFFER_LEN; + head_skb-truesize += truesize; } offset = buf - page_address(page); if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { put_page(page); skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, -len, MERGE_BUFFER_LEN); +len, truesize); } else { skb_add_rx_frag(curr_skb, num_skb_frags, page, - offset, len, MERGE_BUFFER_LEN); + offset, len, truesize); } } @@ -579,24 +578,24 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) { struct virtnet_info *vi = rq-vq-vdev-priv; - char *buf = NULL; - int err; + struct page_frag *alloc_frag; + char *buf; + int err, len, hole; - if (gfp __GFP_WAIT) { - if (skb_page_frag_refill(MERGE_BUFFER_LEN, vi-alloc_frag, -gfp)) { - buf = (char *)page_address(vi-alloc_frag.page) + - vi-alloc_frag.offset; - get_page(vi-alloc_frag.page); - vi-alloc_frag.offset += MERGE_BUFFER_LEN; - } - } else { - buf = netdev_alloc_frag(MERGE_BUFFER_LEN); - } - if (!buf) + alloc_frag = (gfp __GFP_WAIT) ? vi-sleep_frag : rq-atomic_frag; + if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp))) return
[PATCH net-next 1/3] net: allow 0 order atomic page alloc in skb_page_frag_refill
skb_page_frag_refill currently permits only order-0 page allocs unless GFP_WAIT is used. Change skb_page_frag_refill to attempt higher-order page allocations whether or not GFP_WAIT is used. If memory cannot be allocated, the allocator will fall back to successively smaller page allocs (down to order-0 page allocs). This change brings skb_page_frag_refill in line with the existing page allocation strategy employed by netdev_alloc_frag, which attempts higher-order page allocations whether or not GFP_WAIT is set, falling back to successively lower-order page allocations on failure. Part of migration of virtio-net to per-receive queue page frag allocators. Signed-off-by: Michael Dalton mwdal...@google.com --- net/core/sock.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index ab20ed9..7383d23 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1865,9 +1865,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio) put_page(pfrag-page); } - /* We restrict high order allocations to users that can afford to wait */ - order = (prio __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0; - + order = SKB_FRAG_PAGE_ORDER; do { gfp_t gfp = prio; -- 1.8.5.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators) changed the mergeable receive buffer size from PAGE_SIZE to MTU-size, introducing a single-stream regression for benchmarks with large average packet size. There is no single optimal buffer size for all workloads. For workloads with packet size = MTU bytes, MTU + virtio-net header-sized buffers are preferred as larger buffers reduce the TCP window due to SKB truesize. However, single-stream workloads with large average packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers are used. This commit auto-tunes the mergeable receiver buffer packet size by choosing the packet buffer size based on an EWMA of the recent packet sizes for the receive queue. Packet buffer sizes range from MTU_SIZE + virtio-net header len to PAGE_SIZE. This improves throughput for large packet workloads, as any workload with average packet size = PAGE_SIZE will use PAGE_SIZE buffers. These optimizations interact positively with recent commit ba275241030c (virtio-net: coalesce rx frags when possible during rx), which coalesces adjacent RX SKB fragments in virtio_net. The coalescing optimizations benefit buffers of any size. Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs between two QEMU VMs on a single physical machine. Each VM has two VCPUs with all offloads vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup cpuset, using cgroups to ensure that other processes in the system will not be scheduled on the benchmark CPUs. Trunk includes SKB rx frag coalescing. net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s net-next (MTU-size bufs): 13170.01Gb/s net-next + auto-tune: 14555.94Gb/s Signed-off-by: Michael Dalton mwdal...@google.com --- drivers/net/virtio_net.c | 63 +++- 1 file changed, 46 insertions(+), 17 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index d38d130..904af37 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -26,6 +26,7 @@ #include linux/if_vlan.h #include linux/slab.h #include linux/cpu.h +#include linux/average.h static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -36,11 +37,15 @@ module_param(gso, bool, 0444); /* FIXME: MTU in config. */ #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \ -sizeof(struct virtio_net_hdr_mrg_rxbuf), \ -L1_CACHE_BYTES)) #define GOOD_COPY_LEN 128 +/* Weight used for the RX packet size EWMA. The average packet size is used to + * determine the packet buffer size when refilling RX rings. As the entire RX + * ring may be refilled at once, the weight is chosen so that the EWMA will be + * insensitive to short-term, transient changes in packet size. + */ +#define RECEIVE_AVG_WEIGHT 64 + #define VIRTNET_DRIVER_VERSION 1.0.0 struct virtnet_stats { @@ -78,6 +83,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Average packet length for mergeable receive buffers. */ + struct ewma mrg_avg_pkt_len; + /* Page frag for GFP_ATOMIC packet buffer allocation. */ struct page_frag atomic_frag; @@ -339,13 +347,11 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, int num_buf = hdr-mhdr.num_buffers; struct page *page = virt_to_head_page(buf); int offset = buf - page_address(page); - int truesize = max_t(int, len, MERGE_BUFFER_LEN); - struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize); + struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, len); struct sk_buff *curr_skb = head_skb; if (unlikely(!curr_skb)) goto err_skb; - while (--num_buf) { int num_skb_frags; @@ -374,23 +380,40 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, head_skb-truesize += nskb-truesize; num_skb_frags = 0; } - truesize = max_t(int, len, MERGE_BUFFER_LEN); if (curr_skb != head_skb) { head_skb-data_len += len; head_skb-len += len; - head_skb-truesize += truesize; + head_skb-truesize += len; } offset = buf - page_address(page); if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { put_page(page); skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, -len, truesize); +len, len); } else { skb_add_rx_frag(curr_skb, num_skb_frags, page
Re: [PATCH v2] virtio-net: free bufs correctly on invalid packet length
Hi David, This patch fixes a bug introduced by 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators). The bug is present in both net-next and net. Thanks Best, Mike On Fri, Dec 6, 2013 at 1:32 PM, David Miller da...@davemloft.net wrote: From: Michael Dalton mwdal...@google.com Date: Thu, 5 Dec 2013 13:14:05 -0800 When a packet with invalid length arrives, ensure that the packet is freed correctly if mergeable packet buffers and big packets (GUEST_TSO4) are both enabled. Signed-off-by: Michael Dalton mwdal...@google.com Applied, is this needed for -stable? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] virtio-net: determine type of bufs correctly
Thanks Andrey, great catch. I believe this issue occurs in one more place, when packets are dropped if they are too short. I will send out a patch momentarily to fix that additional case. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] virtio-net: free bufs correctly on invalid packet length
When a packet with invalid length arrives, ensure that the packet is freed correctly if mergeable packet buffers and big packets (GUEST_TSO4) are both enabled. --- drivers/net/virtio_net.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 916241d..6a4665c 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -426,10 +426,10 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) if (unlikely(len sizeof(struct virtio_net_hdr) + ETH_HLEN)) { pr_debug(%s: short packet %i\n, dev-name, len); dev-stats.rx_length_errors++; - if (vi-big_packets) - give_pages(rq, buf); - else if (vi-mergeable_rx_bufs) + if (vi-mergeable_rx_bufs) put_page(virt_to_head_page(buf)); + else if (vi-big_packets) + give_pages(rq, buf); else dev_kfree_skb(buf); return; -- 1.8.5.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-net: free bufs correctly on invalid packet length
Thanks Sergei, Yes this is a similar bugfix, the patch I saw from Andrey fixed this issue in free_unused_bufs. The problem also occurs when dropping a packet that is too short. Apologies for forgetting to sign off on the patch, I will re-send. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2] virtio-net: free bufs correctly on invalid packet length
When a packet with invalid length arrives, ensure that the packet is freed correctly if mergeable packet buffers and big packets (GUEST_TSO4) are both enabled. Signed-off-by: Michael Dalton mwdal...@google.com --- drivers/net/virtio_net.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 916241d..6a4665c 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -426,10 +426,10 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) if (unlikely(len sizeof(struct virtio_net_hdr) + ETH_HLEN)) { pr_debug(%s: short packet %i\n, dev-name, len); dev-stats.rx_length_errors++; - if (vi-big_packets) - give_pages(rq, buf); - else if (vi-mergeable_rx_bufs) + if (vi-mergeable_rx_bufs) put_page(virt_to_head_page(buf)); + else if (vi-big_packets) + give_pages(rq, buf); else dev_kfree_skb(buf); return; -- 1.8.5.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] virtio-net: free bufs correctly on invalid packet length
Hi, A quick note on this patch: I have confirmed that without this patch a kernel crash occurs if we force a 'packet too short' error sufficiently many times. This patch eliminates the kernel crash. Since this crash would be triggered by a hypervisor bug, I made a small change not reflected in the above patch to make the crash easier to reproduce for testing purposes. I treated 1 out of every 128 packets with len MERGE_BUFFER_LEN as 'too short'. With this change in place, just running netperf will cause the sender to crash very quickly (the receiver will transmit pure data ACKs that meet the drop criteria). If anyone would like to reproduce the crash using the above setup, I added an unsigned int num_packets field to struct receive_queue and changed the if condition for the packet too short check in receive_buf() from: if (unlikely(len sizeof(struct virtio_net_hdr) + ETH_HLEN)) { to: if (unlikely((len sizeof(struct virtio_net_hdr) + ETH_HLEN) || (len MERGE_BUFFER_LEN ((++rq-num_packets 127) == 0 { Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net] virtio-net: fix page refcnt leaking when fail to allocate frag skb
Great catch Jason. I agree this now raises the larger issue of how to handle a memory alloc failure in the middle of receive. As Eric mentioned, we can drop the packet and free the remaining (num_buf) frags. Michael, perhaps I'm missing something, but why would you prefer pre-allocating buffers in this case? If the guest kernel is OOM'ing, dropping packets should provide backpressure. Also, we could just as easily fail the initial skb alloc in page_to_skb, and I think that case also needs to be handled now in the same fashion as a memory allocation failure in receive_mergeable. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net] virtio-net: fix page refcnt leaking when fail to allocate frag skb
Hi, After further reflection I think we're looking at two related issues: (a) a memory leak that Jason has identified that occurs when a memory allocation fails in receive_mergeable. Jasons commit solves this issue. (b) virtio-net does not dequeue all buffers for a packet in the case that an error occurs on receive and mergeable receive buffers is enabled. For (a), this bug is new and due to changes in 2613af0ed18a, and the net impact is memory leak on the physical page. However, I believe (b) has always been possible in some form because if page_to_skb() returns NULL (e.g., due to SKB allocation failure), receive_mergeable is never called. AFAICT this is also the behavior prior to 2613af0ed18a. The net impact of (b) would be that virtio-net would interpret a packet buffer that is in the middle of a mergeable packet as the start of a new packet, which is definitely also a bug (and the buffer contents could contain bytes that resembled a valid virtio-net header). A solution for (b) will require handling both the page_to_skb memory allocation failures and the memory allocation failures in receive_mergeable introduced by 2613af0ed18a. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
Hi, Apologies for the delay, I wanted to get answers together for all of the open questions raised on this thread. The first patch in this patchset is already merged, so after the merge window re-opens I'll send out new patchsets covering the remaining 3 patches. After reflecting on feedback from this thread, I think it makes sense to separate out the per-receive queue page frag allocator patches from the autotuning patch when the merge window re-opens. The per-receive queue page frag allocator patches help deal with fragmentation (PAGE_SIZE does not evenly divide MERGE_BUFFER_LEN), and provide benefits whether or not auto-tuning is present. Auto-tuning can then be evaluated separately. On Wed, 2013-11-13 at 15:10 +0800, Jason Wang wrote: There's one concern with EWMA. How well does it handle multiple streams each with different packet size? E.g there may be two flows, one with 256 bytes each packet another is 64K. Looks like it can result we allocate PAGE_SIZE buffer for 256 (which is bad since the payload/truesize is low) bytes or 1500+ for 64K buffer (which is ok since we can do coalescing). If multiple streams of very different packet sizes are arriving on the same receive queue, no single buffer size is ideal(e.g., large buffers will cause small packets to take up too much memory, but small buffers may reduce throughput somewhat for large packets). We don't know a priori which packet will be delivered to a given receive queue packet buffer, so any size we choose will not be optimal for all cases if we have significant variance in packet sizes. Do you have perf numbers that just without this patch? We need to know how much EWMA help exactly. Great point, I should have included that in my initial benchmarking. I ran a benchmark in the same environment as my initial results, this time with the first 3 patches in this patchset applied but without the autotuning patch. The average performance over 5 runs of 30-second netperf was 13760.85Gb/s. Is there a chance that est_buffer_len was smaller than or equal with len? Yes, that is possible if the average packet length decreases. Not sure this is accurate, since buflen may change and several frags may share a single page. So the est_buffer_len we get in receive_mergeable() may not be the correct value. I agree it may not be 100% accurate but we can choose a weight that will cause the average packet size to change slowly. Even with an order 3 page there will not be too many packet buffers allocated from a single page. On Wed, 2013-11-13 at 17:42 +0800, Michael S. Tsirkin wrote: I'm not sure it's useful - no one is likely to tune it in practice. But how about a comment explaining how was the number chosen? That makes sense, I agree a comment is needed. The weight determines how quickly we react to a change in packet size. As we attempt to fill all free ring entries on refill (in try_fill_recv), I chose a large weight so that a short burst of traffic with a different average packet size will not substantially shift the packet buffer size for the entire ring the next time try_fill_recv is called. I'll add a comment that compares 64 to nearby values (32, 16). Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] virtio-net: mergeable buffer size should include virtio-net header
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators) changed the mergeable receive buffer size from PAGE_SIZE to MTU-size. However, the merge buffer size does not take into account the size of the virtio-net header. Consequently, packets that are MTU-size will take two buffers intead of one (to store the virtio-net header), substantially decreasing the throughput of MTU-size traffic due to TCP window / SKB truesize effects. This commit changes the mergeable buffer size to include the virtio-net header. The buffer size is cacheline-aligned because skb_page_frag_refill will not automatically align the requested size. Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs between two QEMU VMs on a single physical machine. Each VM has two VCPUs and vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup cpuset, using cgroups to ensure that other processes in the system will not be scheduled on the benchmark CPUs. Transmit offloads and mergeable receive buffers are enabled, but guest_tso4 / guest_csum are explicitly disabled to force MTU-sized packets on the receiver. next-net trunk before 2613af0ed18a (PAGE_SIZE buf): 3861.08Gb/s net-next trunk (MTU 1500- packet uses two buf due to size bug): 4076.62Gb/s net-next trunk (MTU 1480- packet fits in one buf): 6301.34Gb/s net-next trunk w/ size fix (MTU 1500 - packet fits in one buf): 6445.44Gb/s Suggested-by: Eric Northup digitale...@google.com Signed-off-by: Michael Dalton mwdal...@google.com --- drivers/net/virtio_net.c | 30 -- 1 file changed, 16 insertions(+), 14 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 01f4eb5..69fb225 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -36,7 +36,10 @@ module_param(csum, bool, 0444); module_param(gso, bool, 0444); /* FIXME: MTU in config. */ -#define MAX_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) +#define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) +#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \ +sizeof(struct virtio_net_hdr_mrg_rxbuf), \ +L1_CACHE_BYTES)) #define GOOD_COPY_LEN 128 #define VIRTNET_DRIVER_VERSION 1.0.0 @@ -314,10 +317,10 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) head_skb-dev-stats.rx_length_errors++; return -EINVAL; } - if (unlikely(len MAX_PACKET_LEN)) { + if (unlikely(len MERGE_BUFFER_LEN)) { pr_debug(%s: rx error: merge buffer too long\n, head_skb-dev-name); - len = MAX_PACKET_LEN; + len = MERGE_BUFFER_LEN; } if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) { struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC); @@ -336,18 +339,17 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) if (curr_skb != head_skb) { head_skb-data_len += len; head_skb-len += len; - head_skb-truesize += MAX_PACKET_LEN; + head_skb-truesize += MERGE_BUFFER_LEN; } page = virt_to_head_page(buf); offset = buf - (char *)page_address(page); if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { put_page(page); skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, -len, MAX_PACKET_LEN); +len, MERGE_BUFFER_LEN); } else { skb_add_rx_frag(curr_skb, num_skb_frags, page, - offset, len, - MAX_PACKET_LEN); + offset, len, MERGE_BUFFER_LEN); } --rq-num; } @@ -383,7 +385,7 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) struct page *page = virt_to_head_page(buf); skb = page_to_skb(rq, page, (char *)buf - (char *)page_address(page), - len, MAX_PACKET_LEN); + len, MERGE_BUFFER_LEN); if (unlikely(!skb)) { dev-stats.rx_dropped++; put_page(page); @@ -471,11 +473,11 @@ static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp) struct skb_vnet_hdr *hdr; int err; - skb = __netdev_alloc_skb_ip_align(vi-dev, MAX_PACKET_LEN, gfp); + skb = __netdev_alloc_skb_ip_align(vi-dev, GOOD_PACKET_LEN, gfp); if (unlikely(!skb
[PATCH net-next 1/4] virtio-net: mergeable buffer size should include virtio-net header
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators) changed the mergeable receive buffer size from PAGE_SIZE to MTU-size. However, the merge buffer size does not take into account the size of the virtio-net header. Consequently, packets that are MTU-size will take two buffers intead of one (to store the virtio-net header), substantially decreasing the throughput of MTU-size traffic due to TCP window / SKB truesize effects. This commit changes the mergeable buffer size to include the virtio-net header. The buffer size is cacheline-aligned because skb_page_frag_refill will not automatically align the requested size. Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs between two QEMU VMs on a single physical machine. Each VM has two VCPUs and vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup cpuset, using cgroups to ensure that other processes in the system will not be scheduled on the benchmark CPUs. Transmit offloads and mergeable receive buffers are enabled, but guest_tso4 / guest_csum are explicitly disabled to force MTU-sized packets on the receiver. next-net trunk before 2613af0ed18a (PAGE_SIZE buf): 3861.08Gb/s net-next trunk (MTU 1500- packet uses two buf due to size bug): 4076.62Gb/s net-next trunk (MTU 1480- packet fits in one buf): 6301.34Gb/s net-next trunk w/ size fix (MTU 1500 - packet fits in one buf): 6445.44Gb/s Suggested-by: Eric Northup digitale...@google.com Signed-off-by: Michael Dalton mwdal...@google.com --- drivers/net/virtio_net.c | 30 -- 1 file changed, 16 insertions(+), 14 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 01f4eb5..69fb225 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -36,7 +36,10 @@ module_param(csum, bool, 0444); module_param(gso, bool, 0444); /* FIXME: MTU in config. */ -#define MAX_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) +#define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) +#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \ +sizeof(struct virtio_net_hdr_mrg_rxbuf), \ +L1_CACHE_BYTES)) #define GOOD_COPY_LEN 128 #define VIRTNET_DRIVER_VERSION 1.0.0 @@ -314,10 +317,10 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) head_skb-dev-stats.rx_length_errors++; return -EINVAL; } - if (unlikely(len MAX_PACKET_LEN)) { + if (unlikely(len MERGE_BUFFER_LEN)) { pr_debug(%s: rx error: merge buffer too long\n, head_skb-dev-name); - len = MAX_PACKET_LEN; + len = MERGE_BUFFER_LEN; } if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) { struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC); @@ -336,18 +339,17 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) if (curr_skb != head_skb) { head_skb-data_len += len; head_skb-len += len; - head_skb-truesize += MAX_PACKET_LEN; + head_skb-truesize += MERGE_BUFFER_LEN; } page = virt_to_head_page(buf); offset = buf - (char *)page_address(page); if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { put_page(page); skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, -len, MAX_PACKET_LEN); +len, MERGE_BUFFER_LEN); } else { skb_add_rx_frag(curr_skb, num_skb_frags, page, - offset, len, - MAX_PACKET_LEN); + offset, len, MERGE_BUFFER_LEN); } --rq-num; } @@ -383,7 +385,7 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) struct page *page = virt_to_head_page(buf); skb = page_to_skb(rq, page, (char *)buf - (char *)page_address(page), - len, MAX_PACKET_LEN); + len, MERGE_BUFFER_LEN); if (unlikely(!skb)) { dev-stats.rx_dropped++; put_page(page); @@ -471,11 +473,11 @@ static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp) struct skb_vnet_hdr *hdr; int err; - skb = __netdev_alloc_skb_ip_align(vi-dev, MAX_PACKET_LEN, gfp); + skb = __netdev_alloc_skb_ip_align(vi-dev, GOOD_PACKET_LEN, gfp); if (unlikely(!skb
[PATCH net-next 3/4] virtio-net: use per-receive queue page frag alloc for mergeable bufs
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC mergeable rx buffer allocations. This commit migrates virtio-net to use per-receive queue page frags for GFP_ATOMIC allocation. This change unifies mergeable rx buffer memory allocation, which now will use skb_refill_frag() for both atomic and GFP-WAIT buffer allocations. To address fragmentation concerns, if after buffer allocation there is too little space left in the page frag to allocate a subsequent buffer, the remaining space is added to the current allocated buffer so that the remaining space can be used to store packet data. Signed-off-by: Michael Dalton mwdal...@google.com --- drivers/net/virtio_net.c | 70 +++- 1 file changed, 39 insertions(+), 31 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 69fb225..0c93054 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -79,6 +79,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Page frag for GFP_ATOMIC packet buffer allocation. */ + struct page_frag atomic_frag; + /* RX: fragments + linear part + virtio header */ struct scatterlist sg[MAX_SKB_FRAGS + 2]; @@ -128,9 +131,9 @@ struct virtnet_info { struct mutex config_lock; /* Page_frag for GFP_KERNEL packet buffer allocation when we run -* low on memory. +* low on memory. May sleep. */ - struct page_frag alloc_frag; + struct page_frag sleep_frag; /* Does the affinity hint is set for virtqueues? */ bool affinity_hint_set; @@ -305,7 +308,7 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) struct sk_buff *curr_skb = head_skb; char *buf; struct page *page; - int num_buf, len, offset; + int num_buf, len, offset, truesize; num_buf = hdr-mhdr.num_buffers; while (--num_buf) { @@ -317,11 +320,7 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) head_skb-dev-stats.rx_length_errors++; return -EINVAL; } - if (unlikely(len MERGE_BUFFER_LEN)) { - pr_debug(%s: rx error: merge buffer too long\n, -head_skb-dev-name); - len = MERGE_BUFFER_LEN; - } + truesize = max_t(int, len, MERGE_BUFFER_LEN); if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) { struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC); if (unlikely(!nskb)) { @@ -339,17 +338,17 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) if (curr_skb != head_skb) { head_skb-data_len += len; head_skb-len += len; - head_skb-truesize += MERGE_BUFFER_LEN; + head_skb-truesize += truesize; } page = virt_to_head_page(buf); offset = buf - (char *)page_address(page); if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { put_page(page); skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, -len, MERGE_BUFFER_LEN); +len, truesize); } else { skb_add_rx_frag(curr_skb, num_skb_frags, page, - offset, len, MERGE_BUFFER_LEN); + offset, len, truesize); } --rq-num; } @@ -383,9 +382,10 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) skb_trim(skb, len); } else if (vi-mergeable_rx_bufs) { struct page *page = virt_to_head_page(buf); + int truesize = max_t(int, len, MERGE_BUFFER_LEN); skb = page_to_skb(rq, page, (char *)buf - (char *)page_address(page), - len, MERGE_BUFFER_LEN); + len, truesize); if (unlikely(!skb)) { dev-stats.rx_dropped++; put_page(page); @@ -540,24 +540,24 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp) static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp) { struct virtnet_info *vi = rq-vq-vdev-priv; - char *buf = NULL; - int err; + struct page_frag *alloc_frag; + char *buf; + int err, len, hole; - if (gfp __GFP_WAIT) { - if (skb_page_frag_refill(MERGE_BUFFER_LEN, vi-alloc_frag, -gfp
[PATCH net-next 2/4] net: allow 0 order atomic page alloc in skb_page_frag_refill
skb_page_frag_refill currently permits only order-0 page allocs unless GFP_WAIT is used. Change skb_page_frag_refill to attempt higher-order page allocations whether or not GFP_WAIT is used. If memory cannot be allocated, the allocator will fall back to successively smaller page allocs (down to order-0 page allocs). This change brings skb_page_frag_refill in line with the existing page allocation strategy employed by netdev_alloc_frag, which attempts higher-order page allocations whether or not GFP_WAIT is set, falling back to successively lower-order page allocations on failure. Part of migration of virtio-net to per-receive queue page frag allocators. Signed-off-by: Michael Dalton mwdal...@google.com --- net/core/sock.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index ab20ed9..7383d23 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1865,9 +1865,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio) put_page(pfrag-page); } - /* We restrict high order allocations to users that can afford to wait */ - order = (prio __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0; - + order = SKB_FRAG_PAGE_ORDER; do { gfp_t gfp = prio; -- 1.8.4.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators) changed the mergeable receive buffer size from PAGE_SIZE to MTU-size, introducing a single-stream regression for benchmarks with large average packet size. There is no single optimal buffer size for all workloads. For workloads with packet size = MTU bytes, MTU + virtio-net header-sized buffers are preferred as larger buffers reduce the TCP window due to SKB truesize. However, single-stream workloads with large average packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers are used. This commit auto-tunes the mergeable receiver buffer packet size by choosing the packet buffer size based on an EWMA of the recent packet sizes for the receive queue. Packet buffer sizes range from MTU_SIZE + virtio-net header len to PAGE_SIZE. This improves throughput for large packet workloads, as any workload with average packet size = PAGE_SIZE will use PAGE_SIZE buffers. These optimizations interact positively with recent commit ba275241030c (virtio-net: coalesce rx frags when possible during rx), which coalesces adjacent RX SKB fragments in virtio_net. The coalescing optimizations benefit buffers of any size. Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs between two QEMU VMs on a single physical machine. Each VM has two VCPUs with all offloads vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup cpuset, using cgroups to ensure that other processes in the system will not be scheduled on the benchmark CPUs. Trunk includes SKB rx frag coalescing. net-next trunk w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s net-next trunk (MTU-size bufs): 13170.01Gb/s net-next trunk + auto-tune: 14555.94Gb/s Signed-off-by: Michael Dalton mwdal...@google.com --- drivers/net/virtio_net.c | 73 +++- 1 file changed, 53 insertions(+), 20 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 0c93054..b1086e0 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -27,6 +27,7 @@ #include linux/if_vlan.h #include linux/slab.h #include linux/cpu.h +#include linux/average.h static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -37,10 +38,8 @@ module_param(gso, bool, 0444); /* FIXME: MTU in config. */ #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN) -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \ -sizeof(struct virtio_net_hdr_mrg_rxbuf), \ -L1_CACHE_BYTES)) #define GOOD_COPY_LEN 128 +#define RECEIVE_AVG_WEIGHT 64 #define VIRTNET_DRIVER_VERSION 1.0.0 @@ -79,6 +78,9 @@ struct receive_queue { /* Chain pages by the private ptr. */ struct page *pages; + /* Average packet length for mergeable receive buffers. */ + struct ewma mrg_avg_pkt_len; + /* Page frag for GFP_ATOMIC packet buffer allocation. */ struct page_frag atomic_frag; @@ -302,14 +304,17 @@ static struct sk_buff *page_to_skb(struct receive_queue *rq, return skb; } -static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) +static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb, +struct page *head_page) { struct skb_vnet_hdr *hdr = skb_vnet_hdr(head_skb); struct sk_buff *curr_skb = head_skb; + struct page *page = head_page; char *buf; - struct page *page; - int num_buf, len, offset, truesize; + int num_buf, len, offset; + u32 est_buffer_len; + len = head_skb-len; num_buf = hdr-mhdr.num_buffers; while (--num_buf) { int num_skb_frags = skb_shinfo(curr_skb)-nr_frags; @@ -320,7 +325,6 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) head_skb-dev-stats.rx_length_errors++; return -EINVAL; } - truesize = max_t(int, len, MERGE_BUFFER_LEN); if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) { struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC); if (unlikely(!nskb)) { @@ -338,20 +342,38 @@ static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) if (curr_skb != head_skb) { head_skb-data_len += len; head_skb-len += len; - head_skb-truesize += truesize; + head_skb-truesize += len; } page = virt_to_head_page(buf); offset = buf - (char *)page_address(page); if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { put_page(page); skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1
Re: [PATCH net-next] virtio_net: migrate mergeable rx buffers to page frag allocators
Agreed Eric, the buffer size should be increased so that we can accommodate a MTU-sized packet + mergeable virtio net header in a single buffer. I will send a patch to fix shortly cleaning up the #define headers as Rusty indicated and increasing the buffer size slightly by VirtioNet header size bytes per Eric. Jason, I'll followup with you directly - I'd like to know your exact workload (single steam or multi-stream netperf?), VM configuration, etc, and also see if the nit that Erichas pointed out affects your results. It is also worth noting that we may want to tune the queue sizes for your benchmarks, e.g, by reducing buffer size from 4KB to MTU-sized but keeping queue length constant, we're implicitly decreasing the number of bytes stored in the VirtioQueue for the VirtioNet device, so increasing the queue size may help. Best, Mike ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next] virtio_net: migrate mergeable rx buffers to page frag allocators
The virtio_net driver's mergeable receive buffer allocator uses 4KB packet buffers. For MTU-sized traffic, SKB truesize is 4KB but only ~1500 bytes of the buffer is used to store packet data, reducing the effective TCP window size substantially. This patch addresses the performance concerns with mergeable receive buffers by allocating MTU-sized packet buffers using page frag allocators. If more than MAX_SKB_FRAGS buffers are needed, the SKB frag_list is used. Signed-off-by: Michael Dalton mwdal...@google.com --- drivers/net/virtio_net.c | 164 ++- 1 file changed, 106 insertions(+), 58 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 9fbdfcd..113ee93 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -124,6 +124,11 @@ struct virtnet_info { /* Lock for config space updates */ struct mutex config_lock; + /* Page_frag for GFP_KERNEL packet buffer allocation when we run +* low on memory. +*/ + struct page_frag alloc_frag; + /* Does the affinity hint is set for virtqueues? */ bool affinity_hint_set; @@ -217,33 +222,18 @@ static void skb_xmit_done(struct virtqueue *vq) netif_wake_subqueue(vi-dev, vq2txq(vq)); } -static void set_skb_frag(struct sk_buff *skb, struct page *page, -unsigned int offset, unsigned int *len) -{ - int size = min((unsigned)PAGE_SIZE - offset, *len); - int i = skb_shinfo(skb)-nr_frags; - - __skb_fill_page_desc(skb, i, page, offset, size); - - skb-data_len += size; - skb-len += size; - skb-truesize += PAGE_SIZE; - skb_shinfo(skb)-nr_frags++; - skb_shinfo(skb)-tx_flags |= SKBTX_SHARED_FRAG; - *len -= size; -} - /* Called from bottom half context */ static struct sk_buff *page_to_skb(struct receive_queue *rq, - struct page *page, unsigned int len) + struct page *page, unsigned int offset, + unsigned int len, unsigned int truesize) { struct virtnet_info *vi = rq-vq-vdev-priv; struct sk_buff *skb; struct skb_vnet_hdr *hdr; - unsigned int copy, hdr_len, offset; + unsigned int copy, hdr_len, hdr_padded_len; char *p; - p = page_address(page); + p = page_address(page) + offset; /* copy small packet so we can reuse these pages for small data */ skb = netdev_alloc_skb_ip_align(vi-dev, GOOD_COPY_LEN); @@ -254,16 +244,17 @@ static struct sk_buff *page_to_skb(struct receive_queue *rq, if (vi-mergeable_rx_bufs) { hdr_len = sizeof hdr-mhdr; - offset = hdr_len; + hdr_padded_len = sizeof hdr-mhdr; } else { hdr_len = sizeof hdr-hdr; - offset = sizeof(struct padded_vnet_hdr); + hdr_padded_len = sizeof(struct padded_vnet_hdr); } memcpy(hdr, p, hdr_len); len -= hdr_len; - p += offset; + offset += hdr_padded_len; + p += hdr_padded_len; copy = len; if (copy skb_tailroom(skb)) @@ -273,6 +264,14 @@ static struct sk_buff *page_to_skb(struct receive_queue *rq, len -= copy; offset += copy; + if (vi-mergeable_rx_bufs) { + if (len) + skb_add_rx_frag(skb, 0, page, offset, len, truesize); + else + put_page(page); + return skb; + } + /* * Verify that we can indeed put this data into a skb. * This is here to handle cases when the device erroneously @@ -284,9 +283,12 @@ static struct sk_buff *page_to_skb(struct receive_queue *rq, dev_kfree_skb(skb); return NULL; } - + BUG_ON(offset = PAGE_SIZE); while (len) { - set_skb_frag(skb, page, offset, len); + unsigned int frag_size = min((unsigned)PAGE_SIZE - offset, len); + skb_add_rx_frag(skb, skb_shinfo(skb)-nr_frags, page, offset, + frag_size, truesize); + len -= frag_size; page = (struct page *)page-private; offset = 0; } @@ -297,33 +299,52 @@ static struct sk_buff *page_to_skb(struct receive_queue *rq, return skb; } -static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb) +static int receive_mergeable(struct receive_queue *rq, struct sk_buff *head_skb) { - struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb); + struct skb_vnet_hdr *hdr = skb_vnet_hdr(head_skb); + struct sk_buff *curr_skb = head_skb; + char *buf; struct page *page; - int num_buf, i, len; + int num_buf, len; num_buf = hdr-mhdr.num_buffers; while (--num_buf) { - i = skb_shinfo(skb)-nr_frags