[PATCH net-next] virtio-net: fix build error when CONFIG_AVERAGE is not enabled

2014-01-17 Thread Michael Dalton
Commit ab7db91705e9 (virtio-net: auto-tune mergeable rx buffer size for
improved performance) introduced a virtio-net dependency on EWMA.
The inclusion of EWMA is controlled by CONFIG_AVERAGE. Fix build error
when CONFIG_AVERAGE is not enabled by adding select AVERAGE to
virtio-net's Kconfig entry.

Build failure reported using config make ARCH=s390 defconfig.

Signed-off-by: Michael Dalton mwdal...@google.com
---
 drivers/net/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index b45b240..f342278 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -236,6 +236,7 @@ config VETH
 config VIRTIO_NET
tristate Virtio network driver
depends on VIRTIO
+   select AVERAGE
---help---
  This is the virtual network driver for virtio.  It can be used with
  lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v3 1/5] net: allow 0 order atomic page alloc in skb_page_frag_refill

2014-01-16 Thread Michael Dalton
skb_page_frag_refill currently permits only order-0 page allocs
unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
higher-order page allocations whether or not GFP_WAIT is used. If
memory cannot be allocated, the allocator will fall back to
successively smaller page allocs (down to order-0 page allocs).

This change brings skb_page_frag_refill in line with the existing
page allocation strategy employed by netdev_alloc_frag, which attempts
higher-order page allocations whether or not GFP_WAIT is set, falling
back to successively lower-order page allocations on failure. Part
of migration of virtio-net to per-receive queue page frag allocators.

Acked-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Eric Dumazet eduma...@google.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
 net/core/sock.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 85ad6f0..b3f7ee3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1836,9 +1836,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
page_frag *pfrag, gfp_t prio)
put_page(pfrag-page);
}
 
-   /* We restrict high order allocations to users that can afford to wait 
*/
-   order = (prio  __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
-
+   order = SKB_FRAG_PAGE_ORDER;
do {
gfp_t gfp = prio;
 
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v3 3/5] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-16 Thread Michael Dalton
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag
allocators) changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size = MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size =
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c (virtio-net: coalesce rx frags when possible during rx),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads  vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Signed-off-by: Michael Dalton mwdal...@google.com
---
v2-v3: Remove per-receive queue metadata ring. Encode packet buffer
base address and truesize into an unsigned long by requiring a
minimum packet size alignment of 256. Permit attempts to fill
an already-full RX ring (reverting the change in v2).
v1-v2: Add per-receive queue metadata ring to track precise truesize for
mergeable receive buffers. Remove all truesize approximation. Never
try to fill a full RX ring (required for metadata ring in v2).
 drivers/net/virtio_net.c | 99 
 1 file changed, 74 insertions(+), 25 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 36cbf06..3e82311 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include linux/if_vlan.h
 #include linux/slab.h
 #include linux/cpu.h
+#include linux/average.h
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -36,11 +37,18 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-L1_CACHE_BYTES))
 #define GOOD_COPY_LEN  128
 
+/* Weight used for the RX packet size EWMA. The average packet size is used to
+ * determine the packet buffer size when refilling RX rings. As the entire RX
+ * ring may be refilled at once, the weight is chosen so that the EWMA will be
+ * insensitive to short-term, transient changes in packet size.
+ */
+#define RECEIVE_AVG_WEIGHT 64
+
+/* Minimum alignment for mergeable packet buffers. */
+#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256)
+
 #define VIRTNET_DRIVER_VERSION 1.0.0
 
 struct virtnet_stats {
@@ -78,6 +86,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Average packet length for mergeable receive buffers. */
+   struct ewma mrg_avg_pkt_len;
+
/* Page frag for packet buffer allocation. */
struct page_frag alloc_frag;
 
@@ -219,6 +230,23 @@ static void skb_xmit_done(struct virtqueue *vq)
netif_wake_subqueue(vi-dev, vq2txq(vq));
 }
 
+static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
+{
+   unsigned int truesize = mrg_ctx  (MERGEABLE_BUFFER_ALIGN - 1);
+   return truesize * MERGEABLE_BUFFER_ALIGN;
+}
+
+static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx)
+{
+   return (void *)(mrg_ctx  -MERGEABLE_BUFFER_ALIGN);
+
+}
+
+static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
+{
+   return (unsigned long)buf | (truesize / MERGEABLE_BUFFER_ALIGN);
+}
+
 /* Called from bottom half context */
 static struct sk_buff *page_to_skb(struct receive_queue *rq,
   struct page *page, unsigned int offset,
@@ -327,31 +355,33 @@ err:
 
 static struct sk_buff *receive_mergeable(struct net_device *dev

[PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes

2014-01-16 Thread Michael Dalton
Extend existing support for netdevice receive queue sysfs attributes to
permit a device-specific attribute group. Initial use case for this
support will be to allow the virtio-net device to export per-receive
queue mergeable receive buffer size.

Signed-off-by: Michael Dalton mwdal...@google.com
---
 include/linux/netdevice.h | 40 
 net/core/dev.c| 12 ++--
 net/core/net-sysfs.c  | 33 -
 3 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5c88ab1..71b8bc4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu 
*rps_sock_flow_table;
 bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
 u16 filter_id);
 #endif
+#endif /* CONFIG_RPS */
 
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
+#ifdef CONFIG_RPS
struct rps_map __rcu*rps_map;
struct rps_dev_flow_table __rcu *rps_flow_table;
+#endif
struct kobject  kobj;
struct net_device   *dev;
 } cacheline_aligned_in_smp;
-#endif /* CONFIG_RPS */
+
+/*
+ * RX queue sysfs structures and functions.
+ */
+struct rx_queue_attribute {
+   struct attribute attr;
+   ssize_t (*show)(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attr, char *buf);
+   ssize_t (*store)(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attr, const char *buf, size_t len);
+};
 
 #ifdef CONFIG_XPS
 /*
@@ -1313,7 +1326,7 @@ struct net_device {
   unicast) */
 
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
struct netdev_rx_queue  *_rx;
 
/* Number of RX queues allocated at register_netdev() time */
@@ -1424,6 +1437,8 @@ struct net_device {
struct device   dev;
/* space for optional device, statistics, and wireless sysfs groups */
const struct attribute_group *sysfs_groups[4];
+   /* space for optional per-rx queue attributes */
+   const struct attribute_group *sysfs_rx_queue_group;
 
/* rtnetlink link ops */
const struct rtnl_link_ops *rtnl_link_ops;
@@ -2374,7 +2389,7 @@ static inline bool netif_is_multiqueue(const struct 
net_device *dev)
 
 int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq);
 #else
 static inline int netif_set_real_num_rx_queues(struct net_device *dev,
@@ -2393,7 +2408,7 @@ static inline int netif_copy_real_num_queues(struct 
net_device *to_dev,
   from_dev-real_num_tx_queues);
if (err)
return err;
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
return netif_set_real_num_rx_queues(to_dev,
from_dev-real_num_rx_queues);
 #else
@@ -2401,6 +2416,23 @@ static inline int netif_copy_real_num_queues(struct 
net_device *to_dev,
 #endif
 }
 
+#ifdef CONFIG_SYSFS
+static inline unsigned int get_netdev_rx_queue_index(
+   struct netdev_rx_queue *queue)
+{
+   struct net_device *dev = queue-dev;
+   int i;
+
+   for (i = 0; i  dev-num_rx_queues; i++)
+   if (queue == dev-_rx[i])
+   break;
+
+   BUG_ON(i = dev-num_rx_queues);
+
+   return i;
+}
+#endif
+
 #define DEFAULT_MAX_NUM_RSS_QUEUES (8)
 int netif_get_num_default_rss_queues(void);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 20c834e..4be7931 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2080,7 +2080,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, 
unsigned int txq)
 }
 EXPORT_SYMBOL(netif_set_real_num_tx_queues);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 /**
  * netif_set_real_num_rx_queues - set actual number of RX queues used
  * @dev: Network device
@@ -5727,7 +5727,7 @@ void netif_stacked_transfer_operstate(const struct 
net_device *rootdev,
 }
 EXPORT_SYMBOL(netif_stacked_transfer_operstate);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 static int netif_alloc_rx_queues(struct net_device *dev)
 {
unsigned int i, count = dev-num_rx_queues;
@@ -6272,7 +6272,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
return NULL;
}
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
if (rxqs  1) {
pr_err(alloc_netdev: Unable to allocate device with zero RX 
queues\n);
return NULL;
@@ -6328,7 +6328,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
if (netif_alloc_netdev_queues(dev))
goto free_all;
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
dev-num_rx_queues = rxqs;
dev

[PATCH net-next v3 5/5] virtio-net: initial rx sysfs support, export mergeable rx buffer size

2014-01-16 Thread Michael Dalton
Add initial support for per-rx queue sysfs attributes to virtio-net. If
mergeable packet buffers are enabled, adds a read-only mergeable packet
buffer size sysfs attribute for each RX queue.

Signed-off-by: Michael Dalton mwdal...@google.com
---
 drivers/net/virtio_net.c | 66 +---
 1 file changed, 62 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3e82311..f315cbb 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -27,6 +27,7 @@
 #include linux/slab.h
 #include linux/cpu.h
 #include linux/average.h
+#include linux/seqlock.h
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -89,6 +90,12 @@ struct receive_queue {
/* Average packet length for mergeable receive buffers. */
struct ewma mrg_avg_pkt_len;
 
+   /* Sequence counter to allow sysfs readers to safely access stats.
+* Assumes a single virtio-net writer, which is enforced by virtio-net
+* and NAPI.
+*/
+   seqcount_t sysfs_seq;
+
/* Page frag for packet buffer allocation. */
struct page_frag alloc_frag;
 
@@ -416,7 +423,9 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
}
}
 
+   write_seqcount_begin(rq-sysfs_seq);
ewma_add(rq-mrg_avg_pkt_len, head_skb-len);
+   write_seqcount_end(rq-sysfs_seq);
return head_skb;
 
 err_skb:
@@ -604,18 +613,29 @@ static int add_recvbuf_big(struct receive_queue *rq, 
gfp_t gfp)
return err;
 }
 
-static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
 {
const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   unsigned int len;
+
+   len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
+   GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+   return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+}
+
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+{
struct page_frag *alloc_frag = rq-alloc_frag;
char *buf;
unsigned long ctx;
int err;
unsigned int len, hole;
 
-   len = hdr_len + clamp_t(unsigned int, ewma_read(rq-mrg_avg_pkt_len),
-   GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
-   len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+   /* avg_pkt_len is written only in NAPI rx softirq context. We may
+* read avg_pkt_len without using the sysfs_seq seqcount, as this code
+* is called only in NAPI rx softirq context or when NAPI is disabled.
+*/
+   len = get_mergeable_buf_len(rq-mrg_avg_pkt_len);
if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
return -ENOMEM;
 
@@ -1557,6 +1577,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
   napi_weight);
 
sg_init_table(vi-rq[i].sg, ARRAY_SIZE(vi-rq[i].sg));
+   seqcount_init(vi-rq[i].sysfs_seq);
ewma_init(vi-rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
sg_init_table(vi-sq[i].sg, ARRAY_SIZE(vi-sq[i].sg));
}
@@ -1594,6 +1615,39 @@ err:
return ret;
 }
 
+#ifdef CONFIG_SYSFS
+static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attribute, char *buf)
+{
+   struct virtnet_info *vi = netdev_priv(queue-dev);
+   unsigned int queue_index = get_netdev_rx_queue_index(queue);
+   struct receive_queue *rq;
+   struct ewma avg;
+   unsigned int start;
+
+   BUG_ON(queue_index = vi-max_queue_pairs);
+   rq = vi-rq[queue_index];
+   do {
+   start = read_seqcount_begin(rq-sysfs_seq);
+   avg = rq-mrg_avg_pkt_len;
+   } while (read_seqcount_retry(rq-sysfs_seq, start));
+   return sprintf(buf, %u\n, get_mergeable_buf_len(avg));
+}
+
+static struct rx_queue_attribute mergeable_rx_buffer_size_attribute =
+   __ATTR_RO(mergeable_rx_buffer_size);
+
+static struct attribute *virtio_net_mrg_rx_attrs[] = {
+   mergeable_rx_buffer_size_attribute.attr,
+   NULL
+};
+
+static const struct attribute_group virtio_net_mrg_rx_group = {
+   .name = virtio_net,
+   .attrs = virtio_net_mrg_rx_attrs
+};
+#endif
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
int i, err;
@@ -1708,6 +1762,10 @@ static int virtnet_probe(struct virtio_device *vdev)
if (err)
goto free_stats;
 
+#ifdef CONFIG_SYSFS
+   if (vi-mergeable_rx_bufs)
+   dev-sysfs_rx_queue_group = virtio_net_mrg_rx_group;
+#endif
netif_set_real_num_tx_queues(dev, vi-curr_queue_pairs);
netif_set_real_num_rx_queues(dev, vi-curr_queue_pairs);
 
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux

Re: [PATCH net-next v3 5/5] virtio-net: initial rx sysfs support, export mergeable rx buffer size

2014-01-16 Thread Michael Dalton
Sorry, just realized - I think disabling NAPI is necessary but not
sufficient. There is also the issue that refill_work() could be
scheduled. If refill_work() executes, it will re-enable NAPI. We'd need
to cancel the vi-refill delayed work to prevent this AFAICT, and also
ensure that no other function re-schedules vi-refill or re-enables NAPI
(virtnet_open/close, virtnet_set_queues, and virtnet_freeze/restore).

How is the following sequence of operations:
rtnl_lock();
cancel_delayed_work_sync(vi-refill);
napi_disable(rq-napi);
read rq-mrg_avg_pkt_len
virtnet_enable_napi();
rtnl_unlock();

Additionally, if we disable NAPI when reading this file, perhaps
the permissions should be changed to 400 so that an unprivileged
user cannot temporarily disable network RX processing by reading these
sysfs files. Does that sound reasonable?

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes

2014-01-16 Thread Michael Dalton
On Jan 16, 2014 at 10:57 AM, Ben Hutchings bhutchi...@solarflare.com wrote:
 Why write a loop when you can do:
 i = queue - dev-_rx;
Good point, the loop approach was done in get_netdev_queue_index --
I agree your fix is faster and simpler. I'll fix in next patchset.
Thanks!

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v4 2/6] virtio-net: use per-receive queue page frag alloc for mergeable bufs

2014-01-16 Thread Michael Dalton
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
mergeable rx buffer allocations. This commit migrates virtio-net to use
per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
mergeable rx buffer memory allocation, which now will use skb_refill_frag()
for both atomic and GFP-WAIT buffer allocations.

To address fragmentation concerns, if after buffer allocation there
is too little space left in the page frag to allocate a subsequent
buffer, the remaining space is added to the current allocated buffer
so that the remaining space can be used to store packet data.

Signed-off-by: Michael Dalton mwdal...@google.com
---
v1-v2: Use GFP_COLD for RX buffer allocations (as in netdev_alloc_frag()).
Remove per-netdev GFP_KERNEL page_frag allocator.

 drivers/net/virtio_net.c | 69 
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b17240..36cbf06 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -78,6 +78,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Page frag for packet buffer allocation. */
+   struct page_frag alloc_frag;
+
/* RX: fragments + linear part + virtio header */
struct scatterlist sg[MAX_SKB_FRAGS + 2];
 
@@ -126,11 +129,6 @@ struct virtnet_info {
/* Lock for config space updates */
struct mutex config_lock;
 
-   /* Page_frag for GFP_KERNEL packet buffer allocation when we run
-* low on memory.
-*/
-   struct page_frag alloc_frag;
-
/* Does the affinity hint is set for virtqueues? */
bool affinity_hint_set;
 
@@ -336,8 +334,8 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
int num_buf = hdr-mhdr.num_buffers;
struct page *page = virt_to_head_page(buf);
int offset = buf - page_address(page);
-   struct sk_buff *head_skb = page_to_skb(rq, page, offset, len,
-  MERGE_BUFFER_LEN);
+   unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+   struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
struct sk_buff *curr_skb = head_skb;
 
if (unlikely(!curr_skb))
@@ -353,11 +351,6 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
dev-stats.rx_length_errors++;
goto err_buf;
}
-   if (unlikely(len  MERGE_BUFFER_LEN)) {
-   pr_debug(%s: rx error: merge buffer too long\n,
-dev-name);
-   len = MERGE_BUFFER_LEN;
-   }
 
page = virt_to_head_page(buf);
--rq-num;
@@ -376,19 +369,20 @@ static struct sk_buff *receive_mergeable(struct 
net_device *dev,
head_skb-truesize += nskb-truesize;
num_skb_frags = 0;
}
+   truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
if (curr_skb != head_skb) {
head_skb-data_len += len;
head_skb-len += len;
-   head_skb-truesize += MERGE_BUFFER_LEN;
+   head_skb-truesize += truesize;
}
offset = buf - page_address(page);
if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
put_page(page);
skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-len, MERGE_BUFFER_LEN);
+len, truesize);
} else {
skb_add_rx_frag(curr_skb, num_skb_frags, page,
-   offset, len, MERGE_BUFFER_LEN);
+   offset, len, truesize);
}
}
 
@@ -578,25 +572,24 @@ static int add_recvbuf_big(struct receive_queue *rq, 
gfp_t gfp)
 
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
-   struct virtnet_info *vi = rq-vq-vdev-priv;
-   char *buf = NULL;
+   struct page_frag *alloc_frag = rq-alloc_frag;
+   char *buf;
int err;
+   unsigned int len, hole;
 
-   if (gfp  __GFP_WAIT) {
-   if (skb_page_frag_refill(MERGE_BUFFER_LEN, vi-alloc_frag,
-gfp)) {
-   buf = (char *)page_address(vi-alloc_frag.page) +
- vi-alloc_frag.offset;
-   get_page(vi-alloc_frag.page);
-   vi-alloc_frag.offset += MERGE_BUFFER_LEN;
-   }
-   } else {
-   buf = netdev_alloc_frag(MERGE_BUFFER_LEN);
-   }
-   if (!buf)
+   if (unlikely

[PATCH net-next v4 3/6] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-16 Thread Michael Dalton
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag
allocators) changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size = MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size =
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c (virtio-net: coalesce rx frags when possible during rx),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads  vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Signed-off-by: Michael Dalton mwdal...@google.com
---
v2-v3: Remove per-receive queue metadata ring. Encode packet buffer
base address and truesize into an unsigned long by requiring a
minimum packet size alignment of 256. Permit attempts to fill
an already-full RX ring (reverting the change in v2).
v1-v2: Add per-receive queue metadata ring to track precise truesize for
mergeable receive buffers. Remove all truesize approximation. Never
try to fill a full RX ring (required for metadata ring in v2).

 drivers/net/virtio_net.c | 99 
 1 file changed, 74 insertions(+), 25 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 36cbf06..3e82311 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include linux/if_vlan.h
 #include linux/slab.h
 #include linux/cpu.h
+#include linux/average.h
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -36,11 +37,18 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-L1_CACHE_BYTES))
 #define GOOD_COPY_LEN  128
 
+/* Weight used for the RX packet size EWMA. The average packet size is used to
+ * determine the packet buffer size when refilling RX rings. As the entire RX
+ * ring may be refilled at once, the weight is chosen so that the EWMA will be
+ * insensitive to short-term, transient changes in packet size.
+ */
+#define RECEIVE_AVG_WEIGHT 64
+
+/* Minimum alignment for mergeable packet buffers. */
+#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256)
+
 #define VIRTNET_DRIVER_VERSION 1.0.0
 
 struct virtnet_stats {
@@ -78,6 +86,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Average packet length for mergeable receive buffers. */
+   struct ewma mrg_avg_pkt_len;
+
/* Page frag for packet buffer allocation. */
struct page_frag alloc_frag;
 
@@ -219,6 +230,23 @@ static void skb_xmit_done(struct virtqueue *vq)
netif_wake_subqueue(vi-dev, vq2txq(vq));
 }
 
+static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
+{
+   unsigned int truesize = mrg_ctx  (MERGEABLE_BUFFER_ALIGN - 1);
+   return truesize * MERGEABLE_BUFFER_ALIGN;
+}
+
+static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx)
+{
+   return (void *)(mrg_ctx  -MERGEABLE_BUFFER_ALIGN);
+
+}
+
+static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
+{
+   return (unsigned long)buf | (truesize / MERGEABLE_BUFFER_ALIGN);
+}
+
 /* Called from bottom half context */
 static struct sk_buff *page_to_skb(struct receive_queue *rq,
   struct page *page, unsigned int offset,
@@ -327,31 +355,33 @@ err:
 
 static struct sk_buff *receive_mergeable(struct net_device *dev

[PATCH net-next v4 1/6] net: allow 0 order atomic page alloc in skb_page_frag_refill

2014-01-16 Thread Michael Dalton
skb_page_frag_refill currently permits only order-0 page allocs
unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
higher-order page allocations whether or not GFP_WAIT is used. If
memory cannot be allocated, the allocator will fall back to
successively smaller page allocs (down to order-0 page allocs).

This change brings skb_page_frag_refill in line with the existing
page allocation strategy employed by netdev_alloc_frag, which attempts
higher-order page allocations whether or not GFP_WAIT is set, falling
back to successively lower-order page allocations on failure. Part
of migration of virtio-net to per-receive queue page frag allocators.

Acked-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Eric Dumazet eduma...@google.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
 net/core/sock.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 85ad6f0..b3f7ee3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1836,9 +1836,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
page_frag *pfrag, gfp_t prio)
put_page(pfrag-page);
}
 
-   /* We restrict high order allocations to users that can afford to wait 
*/
-   order = (prio  __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
-
+   order = SKB_FRAG_PAGE_ORDER;
do {
gfp_t gfp = prio;
 
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v4 5/6] lib: Ensure EWMA does not store wrong intermediate values

2014-01-16 Thread Michael Dalton
To ensure ewma_read() without a lock returns a valid but possibly
out of date average, modify ewma_add() by using ACCESS_ONCE to prevent
intermediate wrong values from being written to avg-internal.

Suggested-by: Eric Dumazet eric.duma...@gmail.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
 lib/average.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/average.c b/lib/average.c
index 99a67e6..114d1be 100644
--- a/lib/average.c
+++ b/lib/average.c
@@ -53,8 +53,10 @@ EXPORT_SYMBOL(ewma_init);
  */
 struct ewma *ewma_add(struct ewma *avg, unsigned long val)
 {
-   avg-internal = avg-internal  ?
-   (((avg-internal  avg-weight) - avg-internal) +
+   unsigned long internal = ACCESS_ONCE(avg-internal);
+
+   ACCESS_ONCE(avg-internal) = internal ?
+   (((internal  avg-weight) - internal) +
(val  avg-factor))  avg-weight :
(val  avg-factor);
return avg;
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v4 4/6] net-sysfs: add support for device-specific rx queue sysfs attributes

2014-01-16 Thread Michael Dalton
Extend existing support for netdevice receive queue sysfs attributes to
permit a device-specific attribute group. Initial use case for this
support will be to allow the virtio-net device to export per-receive
queue mergeable receive buffer size.

Signed-off-by: Michael Dalton mwdal...@google.com
---
v3-v4: Simplify by removing loop in get_netdev_rx_queue_index.

 include/linux/netdevice.h | 35 +++
 net/core/dev.c| 12 ++--
 net/core/net-sysfs.c  | 33 -
 3 files changed, 53 insertions(+), 27 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5c88ab1..38929bc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu 
*rps_sock_flow_table;
 bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
 u16 filter_id);
 #endif
+#endif /* CONFIG_RPS */
 
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
+#ifdef CONFIG_RPS
struct rps_map __rcu*rps_map;
struct rps_dev_flow_table __rcu *rps_flow_table;
+#endif
struct kobject  kobj;
struct net_device   *dev;
 } cacheline_aligned_in_smp;
-#endif /* CONFIG_RPS */
+
+/*
+ * RX queue sysfs structures and functions.
+ */
+struct rx_queue_attribute {
+   struct attribute attr;
+   ssize_t (*show)(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attr, char *buf);
+   ssize_t (*store)(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attr, const char *buf, size_t len);
+};
 
 #ifdef CONFIG_XPS
 /*
@@ -1313,7 +1326,7 @@ struct net_device {
   unicast) */
 
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
struct netdev_rx_queue  *_rx;
 
/* Number of RX queues allocated at register_netdev() time */
@@ -1424,6 +1437,8 @@ struct net_device {
struct device   dev;
/* space for optional device, statistics, and wireless sysfs groups */
const struct attribute_group *sysfs_groups[4];
+   /* space for optional per-rx queue attributes */
+   const struct attribute_group *sysfs_rx_queue_group;
 
/* rtnetlink link ops */
const struct rtnl_link_ops *rtnl_link_ops;
@@ -2374,7 +2389,7 @@ static inline bool netif_is_multiqueue(const struct 
net_device *dev)
 
 int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq);
 #else
 static inline int netif_set_real_num_rx_queues(struct net_device *dev,
@@ -2393,7 +2408,7 @@ static inline int netif_copy_real_num_queues(struct 
net_device *to_dev,
   from_dev-real_num_tx_queues);
if (err)
return err;
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
return netif_set_real_num_rx_queues(to_dev,
from_dev-real_num_rx_queues);
 #else
@@ -2401,6 +2416,18 @@ static inline int netif_copy_real_num_queues(struct 
net_device *to_dev,
 #endif
 }
 
+#ifdef CONFIG_SYSFS
+static inline unsigned int get_netdev_rx_queue_index(
+   struct netdev_rx_queue *queue)
+{
+   struct net_device *dev = queue-dev;
+   int index = queue - dev-_rx;
+
+   BUG_ON(index = dev-num_rx_queues);
+   return index;
+}
+#endif
+
 #define DEFAULT_MAX_NUM_RSS_QUEUES (8)
 int netif_get_num_default_rss_queues(void);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 20c834e..4be7931 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2080,7 +2080,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, 
unsigned int txq)
 }
 EXPORT_SYMBOL(netif_set_real_num_tx_queues);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 /**
  * netif_set_real_num_rx_queues - set actual number of RX queues used
  * @dev: Network device
@@ -5727,7 +5727,7 @@ void netif_stacked_transfer_operstate(const struct 
net_device *rootdev,
 }
 EXPORT_SYMBOL(netif_stacked_transfer_operstate);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 static int netif_alloc_rx_queues(struct net_device *dev)
 {
unsigned int i, count = dev-num_rx_queues;
@@ -6272,7 +6272,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
return NULL;
}
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
if (rxqs  1) {
pr_err(alloc_netdev: Unable to allocate device with zero RX 
queues\n);
return NULL;
@@ -6328,7 +6328,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
if (netif_alloc_netdev_queues(dev))
goto free_all;
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
dev-num_rx_queues = rxqs;
dev-real_num_rx_queues = rxqs

[PATCH net-next v4 6/6] virtio-net: initial rx sysfs support, export mergeable rx buffer size

2014-01-16 Thread Michael Dalton
Add initial support for per-rx queue sysfs attributes to virtio-net. If
mergeable packet buffers are enabled, adds a read-only mergeable packet
buffer size sysfs attribute for each RX queue.

Suggested-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
v3-v4: Remove seqcount due to EWMA changes in patch 5.
Add missing Suggested-By.

 drivers/net/virtio_net.c | 46 ++
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3e82311..968eacd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -604,18 +604,25 @@ static int add_recvbuf_big(struct receive_queue *rq, 
gfp_t gfp)
return err;
 }
 
-static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
 {
const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   unsigned int len;
+
+   len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
+   GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+   return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+}
+
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+{
struct page_frag *alloc_frag = rq-alloc_frag;
char *buf;
unsigned long ctx;
int err;
unsigned int len, hole;
 
-   len = hdr_len + clamp_t(unsigned int, ewma_read(rq-mrg_avg_pkt_len),
-   GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
-   len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+   len = get_mergeable_buf_len(rq-mrg_avg_pkt_len);
if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
return -ENOMEM;
 
@@ -1594,6 +1601,33 @@ err:
return ret;
 }
 
+#ifdef CONFIG_SYSFS
+static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attribute, char *buf)
+{
+   struct virtnet_info *vi = netdev_priv(queue-dev);
+   unsigned int queue_index = get_netdev_rx_queue_index(queue);
+   struct ewma *avg;
+
+   BUG_ON(queue_index = vi-max_queue_pairs);
+   avg = vi-rq[queue_index].mrg_avg_pkt_len;
+   return sprintf(buf, %u\n, get_mergeable_buf_len(avg));
+}
+
+static struct rx_queue_attribute mergeable_rx_buffer_size_attribute =
+   __ATTR_RO(mergeable_rx_buffer_size);
+
+static struct attribute *virtio_net_mrg_rx_attrs[] = {
+   mergeable_rx_buffer_size_attribute.attr,
+   NULL
+};
+
+static const struct attribute_group virtio_net_mrg_rx_group = {
+   .name = virtio_net,
+   .attrs = virtio_net_mrg_rx_attrs
+};
+#endif
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
int i, err;
@@ -1708,6 +1742,10 @@ static int virtnet_probe(struct virtio_device *vdev)
if (err)
goto free_stats;
 
+#ifdef CONFIG_SYSFS
+   if (vi-mergeable_rx_bufs)
+   dev-sysfs_rx_queue_group = virtio_net_mrg_rx_group;
+#endif
netif_set_real_num_tx_queues(dev, vi-curr_queue_pairs);
netif_set_real_num_rx_queues(dev, vi-curr_queue_pairs);
 
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v4 1/6] net: allow 0 order atomic page alloc in skb_page_frag_refill

2014-01-16 Thread Michael Dalton
On Thu, Jan 16, 2014 at 3:30 PM, David Miller da...@davemloft.net wrote:
 Actually, I reverted, please resubmit this series with the following
 build warning corrected:

Thanks David, I will send out another patchset shortly with the warning
resolved and a header e-mail (and one other sysfs group fix that I just
found in the same file). Sorry I didn't include a header e-mail
initially.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v5 2/6] virtio-net: use per-receive queue page frag alloc for mergeable bufs

2014-01-16 Thread Michael Dalton
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
mergeable rx buffer allocations. This commit migrates virtio-net to use
per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
mergeable rx buffer memory allocation, which now will use skb_refill_frag()
for both atomic and GFP-WAIT buffer allocations.

To address fragmentation concerns, if after buffer allocation there
is too little space left in the page frag to allocate a subsequent
buffer, the remaining space is added to the current allocated buffer
so that the remaining space can be used to store packet data.

Acked-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
v1-v2: Use GFP_COLD for RX buffer allocations (as in netdev_alloc_frag()).
Remove per-netdev GFP_KERNEL page_frag allocator.

 drivers/net/virtio_net.c | 69 
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b17240..36cbf06 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -78,6 +78,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Page frag for packet buffer allocation. */
+   struct page_frag alloc_frag;
+
/* RX: fragments + linear part + virtio header */
struct scatterlist sg[MAX_SKB_FRAGS + 2];
 
@@ -126,11 +129,6 @@ struct virtnet_info {
/* Lock for config space updates */
struct mutex config_lock;
 
-   /* Page_frag for GFP_KERNEL packet buffer allocation when we run
-* low on memory.
-*/
-   struct page_frag alloc_frag;
-
/* Does the affinity hint is set for virtqueues? */
bool affinity_hint_set;
 
@@ -336,8 +334,8 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
int num_buf = hdr-mhdr.num_buffers;
struct page *page = virt_to_head_page(buf);
int offset = buf - page_address(page);
-   struct sk_buff *head_skb = page_to_skb(rq, page, offset, len,
-  MERGE_BUFFER_LEN);
+   unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+   struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
struct sk_buff *curr_skb = head_skb;
 
if (unlikely(!curr_skb))
@@ -353,11 +351,6 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
dev-stats.rx_length_errors++;
goto err_buf;
}
-   if (unlikely(len  MERGE_BUFFER_LEN)) {
-   pr_debug(%s: rx error: merge buffer too long\n,
-dev-name);
-   len = MERGE_BUFFER_LEN;
-   }
 
page = virt_to_head_page(buf);
--rq-num;
@@ -376,19 +369,20 @@ static struct sk_buff *receive_mergeable(struct 
net_device *dev,
head_skb-truesize += nskb-truesize;
num_skb_frags = 0;
}
+   truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
if (curr_skb != head_skb) {
head_skb-data_len += len;
head_skb-len += len;
-   head_skb-truesize += MERGE_BUFFER_LEN;
+   head_skb-truesize += truesize;
}
offset = buf - page_address(page);
if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
put_page(page);
skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-len, MERGE_BUFFER_LEN);
+len, truesize);
} else {
skb_add_rx_frag(curr_skb, num_skb_frags, page,
-   offset, len, MERGE_BUFFER_LEN);
+   offset, len, truesize);
}
}
 
@@ -578,25 +572,24 @@ static int add_recvbuf_big(struct receive_queue *rq, 
gfp_t gfp)
 
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
-   struct virtnet_info *vi = rq-vq-vdev-priv;
-   char *buf = NULL;
+   struct page_frag *alloc_frag = rq-alloc_frag;
+   char *buf;
int err;
+   unsigned int len, hole;
 
-   if (gfp  __GFP_WAIT) {
-   if (skb_page_frag_refill(MERGE_BUFFER_LEN, vi-alloc_frag,
-gfp)) {
-   buf = (char *)page_address(vi-alloc_frag.page) +
- vi-alloc_frag.offset;
-   get_page(vi-alloc_frag.page);
-   vi-alloc_frag.offset += MERGE_BUFFER_LEN;
-   }
-   } else {
-   buf = netdev_alloc_frag(MERGE_BUFFER_LEN

[PATCH net-next v5 3/6] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-16 Thread Michael Dalton
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag
allocators) changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size = MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size =
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c (virtio-net: coalesce rx frags when possible during rx),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads  vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Acked-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
v2-v3: Remove per-receive queue metadata ring. Encode packet buffer
base address and truesize into an unsigned long by requiring a
minimum packet size alignment of 256. Permit attempts to fill
an already-full RX ring (reverting the change in v2).
v1-v2: Add per-receive queue metadata ring to track precise truesize for
mergeable receive buffers. Remove all truesize approximation. Never
try to fill a full RX ring (required for metadata ring in v2).

 drivers/net/virtio_net.c | 99 
 1 file changed, 74 insertions(+), 25 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 36cbf06..3e82311 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include linux/if_vlan.h
 #include linux/slab.h
 #include linux/cpu.h
+#include linux/average.h
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -36,11 +37,18 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-L1_CACHE_BYTES))
 #define GOOD_COPY_LEN  128
 
+/* Weight used for the RX packet size EWMA. The average packet size is used to
+ * determine the packet buffer size when refilling RX rings. As the entire RX
+ * ring may be refilled at once, the weight is chosen so that the EWMA will be
+ * insensitive to short-term, transient changes in packet size.
+ */
+#define RECEIVE_AVG_WEIGHT 64
+
+/* Minimum alignment for mergeable packet buffers. */
+#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256)
+
 #define VIRTNET_DRIVER_VERSION 1.0.0
 
 struct virtnet_stats {
@@ -78,6 +86,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Average packet length for mergeable receive buffers. */
+   struct ewma mrg_avg_pkt_len;
+
/* Page frag for packet buffer allocation. */
struct page_frag alloc_frag;
 
@@ -219,6 +230,23 @@ static void skb_xmit_done(struct virtqueue *vq)
netif_wake_subqueue(vi-dev, vq2txq(vq));
 }
 
+static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
+{
+   unsigned int truesize = mrg_ctx  (MERGEABLE_BUFFER_ALIGN - 1);
+   return truesize * MERGEABLE_BUFFER_ALIGN;
+}
+
+static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx)
+{
+   return (void *)(mrg_ctx  -MERGEABLE_BUFFER_ALIGN);
+
+}
+
+static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
+{
+   return (unsigned long)buf | (truesize / MERGEABLE_BUFFER_ALIGN);
+}
+
 /* Called from bottom half context */
 static struct sk_buff *page_to_skb(struct receive_queue *rq,
   struct page *page, unsigned int offset,
@@ -327,31 +355,33 @@ err:
 
 static struct sk_buff

[PATCH net-next v5 1/6] net: allow 0 order atomic page alloc in skb_page_frag_refill

2014-01-16 Thread Michael Dalton
skb_page_frag_refill currently permits only order-0 page allocs
unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
higher-order page allocations whether or not GFP_WAIT is used. If
memory cannot be allocated, the allocator will fall back to
successively smaller page allocs (down to order-0 page allocs).

This change brings skb_page_frag_refill in line with the existing
page allocation strategy employed by netdev_alloc_frag, which attempts
higher-order page allocations whether or not GFP_WAIT is set, falling
back to successively lower-order page allocations on failure. Part
of migration of virtio-net to per-receive queue page frag allocators.

Acked-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Eric Dumazet eduma...@google.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
 net/core/sock.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 85ad6f0..b3f7ee3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1836,9 +1836,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
page_frag *pfrag, gfp_t prio)
put_page(pfrag-page);
}
 
-   /* We restrict high order allocations to users that can afford to wait 
*/
-   order = (prio  __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
-
+   order = SKB_FRAG_PAGE_ORDER;
do {
gfp_t gfp = prio;
 
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v5 0/6] virtio-net: mergeable rx buffer size auto-tuning

2014-01-16 Thread Michael Dalton
The virtio-net device currently uses aligned MTU-sized mergeable receive
packet buffers. Network throughput for workloads with large average
packet size can be improved by posting larger receive packet buffers.
However, due to SKB truesize effects, posting large (e.g, PAGE_SIZE)
buffers reduces the throughput of workloads that do not benefit from GRO
and have no large inbound packets.

This patchset introduces virtio-net mergeable buffer size auto-tuning,
with buffer sizes ranging from aligned MTU-size to PAGE_SIZE. Packet
buffer size is chosen based on a per-receive queue EWMA of incoming
packet size.

To unify mergeable receive buffer memory allocation and improve
SKB frag coalescing, all mergeable buffer memory allocation is
migrated to per-receive queue page frag allocators.

The per-receive queue mergeable packet buffer size is exported via
sysfs, and the network device sysfs layer has been extended to add
support for device-specific per-receive queue sysfs attribute groups.

Michael Dalton (6):
  net: allow  0 order atomic page alloc in skb_page_frag_refill
  virtio-net: use per-receive queue page frag alloc for mergeable bufs
  virtio-net: auto-tune mergeable rx buffer size for improved
performance
  net-sysfs: add support for device-specific rx queue sysfs attributes
  lib: Ensure EWMA does not store wrong intermediate values
  virtio-net: initial rx sysfs support, export mergeable rx buffer size

 drivers/net/virtio_net.c  | 196 +-
 include/linux/netdevice.h |  35 -
 lib/average.c |   6 +-
 net/core/dev.c|  12 +--
 net/core/net-sysfs.c  |  50 +++-
 net/core/sock.c   |   4 +-
 6 files changed, 213 insertions(+), 90 deletions(-)

-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v5 4/6] net-sysfs: add support for device-specific rx queue sysfs attributes

2014-01-16 Thread Michael Dalton
Extend existing support for netdevice receive queue sysfs attributes to
permit a device-specific attribute group. Initial use case for this
support will be to allow the virtio-net device to export per-receive
queue mergeable receive buffer size.

Signed-off-by: Michael Dalton mwdal...@google.com
---
v4-v5: Handle sysfs_create_group failure. Call sysfs_remove_group when
removing a RX queue kobj if a device-specific group exists.
v3-v4: Simplify by removing loop in get_netdev_rx_queue_index.

 include/linux/netdevice.h | 35 +
 net/core/dev.c| 12 ++--
 net/core/net-sysfs.c  | 50 +++
 3 files changed, 66 insertions(+), 31 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5c88ab1..38929bc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu 
*rps_sock_flow_table;
 bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
 u16 filter_id);
 #endif
+#endif /* CONFIG_RPS */
 
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
+#ifdef CONFIG_RPS
struct rps_map __rcu*rps_map;
struct rps_dev_flow_table __rcu *rps_flow_table;
+#endif
struct kobject  kobj;
struct net_device   *dev;
 } cacheline_aligned_in_smp;
-#endif /* CONFIG_RPS */
+
+/*
+ * RX queue sysfs structures and functions.
+ */
+struct rx_queue_attribute {
+   struct attribute attr;
+   ssize_t (*show)(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attr, char *buf);
+   ssize_t (*store)(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attr, const char *buf, size_t len);
+};
 
 #ifdef CONFIG_XPS
 /*
@@ -1313,7 +1326,7 @@ struct net_device {
   unicast) */
 
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
struct netdev_rx_queue  *_rx;
 
/* Number of RX queues allocated at register_netdev() time */
@@ -1424,6 +1437,8 @@ struct net_device {
struct device   dev;
/* space for optional device, statistics, and wireless sysfs groups */
const struct attribute_group *sysfs_groups[4];
+   /* space for optional per-rx queue attributes */
+   const struct attribute_group *sysfs_rx_queue_group;
 
/* rtnetlink link ops */
const struct rtnl_link_ops *rtnl_link_ops;
@@ -2374,7 +2389,7 @@ static inline bool netif_is_multiqueue(const struct 
net_device *dev)
 
 int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq);
 #else
 static inline int netif_set_real_num_rx_queues(struct net_device *dev,
@@ -2393,7 +2408,7 @@ static inline int netif_copy_real_num_queues(struct 
net_device *to_dev,
   from_dev-real_num_tx_queues);
if (err)
return err;
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
return netif_set_real_num_rx_queues(to_dev,
from_dev-real_num_rx_queues);
 #else
@@ -2401,6 +2416,18 @@ static inline int netif_copy_real_num_queues(struct 
net_device *to_dev,
 #endif
 }
 
+#ifdef CONFIG_SYSFS
+static inline unsigned int get_netdev_rx_queue_index(
+   struct netdev_rx_queue *queue)
+{
+   struct net_device *dev = queue-dev;
+   int index = queue - dev-_rx;
+
+   BUG_ON(index = dev-num_rx_queues);
+   return index;
+}
+#endif
+
 #define DEFAULT_MAX_NUM_RSS_QUEUES (8)
 int netif_get_num_default_rss_queues(void);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 20c834e..4be7931 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2080,7 +2080,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, 
unsigned int txq)
 }
 EXPORT_SYMBOL(netif_set_real_num_tx_queues);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 /**
  * netif_set_real_num_rx_queues - set actual number of RX queues used
  * @dev: Network device
@@ -5727,7 +5727,7 @@ void netif_stacked_transfer_operstate(const struct 
net_device *rootdev,
 }
 EXPORT_SYMBOL(netif_stacked_transfer_operstate);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 static int netif_alloc_rx_queues(struct net_device *dev)
 {
unsigned int i, count = dev-num_rx_queues;
@@ -6272,7 +6272,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
return NULL;
}
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
if (rxqs  1) {
pr_err(alloc_netdev: Unable to allocate device with zero RX 
queues\n);
return NULL;
@@ -6328,7 +6328,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
if (netif_alloc_netdev_queues(dev

[PATCH net-next v5 5/6] lib: Ensure EWMA does not store wrong intermediate values

2014-01-16 Thread Michael Dalton
To ensure ewma_read() without a lock returns a valid but possibly
out of date average, modify ewma_add() by using ACCESS_ONCE to prevent
intermediate wrong values from being written to avg-internal.

Suggested-by: Eric Dumazet eric.duma...@gmail.com
Acked-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Eric Dumazet eduma...@google.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
 lib/average.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/average.c b/lib/average.c
index 99a67e6..114d1be 100644
--- a/lib/average.c
+++ b/lib/average.c
@@ -53,8 +53,10 @@ EXPORT_SYMBOL(ewma_init);
  */
 struct ewma *ewma_add(struct ewma *avg, unsigned long val)
 {
-   avg-internal = avg-internal  ?
-   (((avg-internal  avg-weight) - avg-internal) +
+   unsigned long internal = ACCESS_ONCE(avg-internal);
+
+   ACCESS_ONCE(avg-internal) = internal ?
+   (((internal  avg-weight) - internal) +
(val  avg-factor))  avg-weight :
(val  avg-factor);
return avg;
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v5 6/6] virtio-net: initial rx sysfs support, export mergeable rx buffer size

2014-01-16 Thread Michael Dalton
Add initial support for per-rx queue sysfs attributes to virtio-net. If
mergeable packet buffers are enabled, adds a read-only mergeable packet
buffer size sysfs attribute for each RX queue.

Suggested-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
v3-v4: Remove seqcount due to EWMA changes in patch 5.
Add missing Suggested-By. 

 drivers/net/virtio_net.c | 46 ++
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3e82311..968eacd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -604,18 +604,25 @@ static int add_recvbuf_big(struct receive_queue *rq, 
gfp_t gfp)
return err;
 }
 
-static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
 {
const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   unsigned int len;
+
+   len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
+   GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+   return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+}
+
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+{
struct page_frag *alloc_frag = rq-alloc_frag;
char *buf;
unsigned long ctx;
int err;
unsigned int len, hole;
 
-   len = hdr_len + clamp_t(unsigned int, ewma_read(rq-mrg_avg_pkt_len),
-   GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
-   len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+   len = get_mergeable_buf_len(rq-mrg_avg_pkt_len);
if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
return -ENOMEM;
 
@@ -1594,6 +1601,33 @@ err:
return ret;
 }
 
+#ifdef CONFIG_SYSFS
+static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attribute, char *buf)
+{
+   struct virtnet_info *vi = netdev_priv(queue-dev);
+   unsigned int queue_index = get_netdev_rx_queue_index(queue);
+   struct ewma *avg;
+
+   BUG_ON(queue_index = vi-max_queue_pairs);
+   avg = vi-rq[queue_index].mrg_avg_pkt_len;
+   return sprintf(buf, %u\n, get_mergeable_buf_len(avg));
+}
+
+static struct rx_queue_attribute mergeable_rx_buffer_size_attribute =
+   __ATTR_RO(mergeable_rx_buffer_size);
+
+static struct attribute *virtio_net_mrg_rx_attrs[] = {
+   mergeable_rx_buffer_size_attribute.attr,
+   NULL
+};
+
+static const struct attribute_group virtio_net_mrg_rx_group = {
+   .name = virtio_net,
+   .attrs = virtio_net_mrg_rx_attrs
+};
+#endif
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
int i, err;
@@ -1708,6 +1742,10 @@ static int virtnet_probe(struct virtio_device *vdev)
if (err)
goto free_stats;
 
+#ifdef CONFIG_SYSFS
+   if (vi-mergeable_rx_bufs)
+   dev-sysfs_rx_queue_group = virtio_net_mrg_rx_group;
+#endif
netif_set_real_num_tx_queues(dev, vi-curr_queue_pairs);
netif_set_real_num_rx_queues(dev, vi-curr_queue_pairs);
 
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v6 0/6] virtio-net: mergeable rx buffer size auto-tuning

2014-01-16 Thread Michael Dalton
The virtio-net device currently uses aligned MTU-sized mergeable receive
packet buffers. Network throughput for workloads with large average
packet size can be improved by posting larger receive packet buffers.
However, due to SKB truesize effects, posting large (e.g, PAGE_SIZE)
buffers reduces the throughput of workloads that do not benefit from GRO
and have no large inbound packets.

This patchset introduces virtio-net mergeable buffer size auto-tuning,
with buffer sizes ranging from aligned MTU-size to PAGE_SIZE. Packet
buffer size is chosen based on a per-receive queue EWMA of incoming
packet size.

To unify mergeable receive buffer memory allocation and improve
SKB frag coalescing, all mergeable buffer memory allocation is
migrated to per-receive queue page frag allocators.

The per-receive queue mergeable packet buffer size is exported via
sysfs, and the network device sysfs layer has been extended to add
support for device-specific per-receive queue sysfs attribute groups.

Michael Dalton (6):
  net: allow  0 order atomic page alloc in skb_page_frag_refill
  virtio-net: use per-receive queue page frag alloc for mergeable bufs
  virtio-net: auto-tune mergeable rx buffer size for improved
performance
  net-sysfs: add support for device-specific rx queue sysfs attributes
  lib: Ensure EWMA does not store wrong intermediate values
  virtio-net: initial rx sysfs support, export mergeable rx buffer size

 drivers/net/virtio_net.c  | 197 +-
 include/linux/netdevice.h |  35 +++-
 lib/average.c |   6 +-
 net/core/dev.c|  12 +--
 net/core/net-sysfs.c  |  50 +++-
 net/core/sock.c   |   4 +-
 6 files changed, 214 insertions(+), 90 deletions(-)

-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v6 1/6] net: allow 0 order atomic page alloc in skb_page_frag_refill

2014-01-16 Thread Michael Dalton
skb_page_frag_refill currently permits only order-0 page allocs
unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
higher-order page allocations whether or not GFP_WAIT is used. If
memory cannot be allocated, the allocator will fall back to
successively smaller page allocs (down to order-0 page allocs).

This change brings skb_page_frag_refill in line with the existing
page allocation strategy employed by netdev_alloc_frag, which attempts
higher-order page allocations whether or not GFP_WAIT is set, falling
back to successively lower-order page allocations on failure. Part
of migration of virtio-net to per-receive queue page frag allocators.

Acked-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Eric Dumazet eduma...@google.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
 net/core/sock.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 85ad6f0..b3f7ee3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1836,9 +1836,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
page_frag *pfrag, gfp_t prio)
put_page(pfrag-page);
}
 
-   /* We restrict high order allocations to users that can afford to wait 
*/
-   order = (prio  __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
-
+   order = SKB_FRAG_PAGE_ORDER;
do {
gfp_t gfp = prio;
 
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v6 3/6] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-16 Thread Michael Dalton
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag
allocators) changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size = MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size =
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c (virtio-net: coalesce rx frags when possible during rx),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads  vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Signed-off-by: Michael Dalton mwdal...@google.com
---
v5-v6: Fix merge conflict. Subtract 1 before encoding the scaled truesize
for a mergeable buffer ctx to support 64KB PAGE_SIZE.
v2-v3: Remove per-receive queue metadata ring. Encode packet buffer
base address and truesize into an unsigned long by requiring a
minimum packet size alignment of 256. Permit attempts to fill
an already-full RX ring (reverting the change in v2).
v1-v2: Add per-receive queue metadata ring to track precise truesize for
mergeable receive buffers. Remove all truesize approximation. Never
try to fill a full RX ring (required for metadata ring in v2).
 drivers/net/virtio_net.c | 100 +++
 1 file changed, 75 insertions(+), 25 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5ee71dc..dacd43b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include linux/if_vlan.h
 #include linux/slab.h
 #include linux/cpu.h
+#include linux/average.h
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -36,11 +37,18 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-L1_CACHE_BYTES))
 #define GOOD_COPY_LEN  128
 
+/* Weight used for the RX packet size EWMA. The average packet size is used to
+ * determine the packet buffer size when refilling RX rings. As the entire RX
+ * ring may be refilled at once, the weight is chosen so that the EWMA will be
+ * insensitive to short-term, transient changes in packet size.
+ */
+#define RECEIVE_AVG_WEIGHT 64
+
+/* Minimum alignment for mergeable packet buffers. */
+#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256)
+
 #define VIRTNET_DRIVER_VERSION 1.0.0
 
 struct virtnet_stats {
@@ -75,6 +83,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Average packet length for mergeable receive buffers. */
+   struct ewma mrg_avg_pkt_len;
+
/* Page frag for packet buffer allocation. */
struct page_frag alloc_frag;
 
@@ -216,6 +227,24 @@ static void skb_xmit_done(struct virtqueue *vq)
netif_wake_subqueue(vi-dev, vq2txq(vq));
 }
 
+static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
+{
+   unsigned int truesize = mrg_ctx  (MERGEABLE_BUFFER_ALIGN - 1);
+   return (truesize + 1) * MERGEABLE_BUFFER_ALIGN;
+}
+
+static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx)
+{
+   return (void *)(mrg_ctx  -MERGEABLE_BUFFER_ALIGN);
+
+}
+
+static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
+{
+   unsigned int size = truesize / MERGEABLE_BUFFER_ALIGN;
+   return (unsigned long)buf | (size - 1);
+}
+
 /* Called from bottom half context */
 static struct sk_buff *page_to_skb(struct

[PATCH net-next v6 4/6] net-sysfs: add support for device-specific rx queue sysfs attributes

2014-01-16 Thread Michael Dalton
Extend existing support for netdevice receive queue sysfs attributes to
permit a device-specific attribute group. Initial use case for this
support will be to allow the virtio-net device to export per-receive
queue mergeable receive buffer size.

Signed-off-by: Michael Dalton mwdal...@google.com
---
v4-v5: Handle sysfs_create_group failure. Call sysfs_remove_group when
removing a RX queue kobj if a device-specific group exists.
v3-v4: Simplify by removing loop in get_netdev_rx_queue_index.

 include/linux/netdevice.h | 35 +
 net/core/dev.c| 12 ++--
 net/core/net-sysfs.c  | 50 +++
 3 files changed, 66 insertions(+), 31 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d7668b88..e985231 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu 
*rps_sock_flow_table;
 bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
 u16 filter_id);
 #endif
+#endif /* CONFIG_RPS */
 
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
+#ifdef CONFIG_RPS
struct rps_map __rcu*rps_map;
struct rps_dev_flow_table __rcu *rps_flow_table;
+#endif
struct kobject  kobj;
struct net_device   *dev;
 } cacheline_aligned_in_smp;
-#endif /* CONFIG_RPS */
+
+/*
+ * RX queue sysfs structures and functions.
+ */
+struct rx_queue_attribute {
+   struct attribute attr;
+   ssize_t (*show)(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attr, char *buf);
+   ssize_t (*store)(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attr, const char *buf, size_t len);
+};
 
 #ifdef CONFIG_XPS
 /*
@@ -1313,7 +1326,7 @@ struct net_device {
   unicast) */
 
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
struct netdev_rx_queue  *_rx;
 
/* Number of RX queues allocated at register_netdev() time */
@@ -1424,6 +1437,8 @@ struct net_device {
struct device   dev;
/* space for optional device, statistics, and wireless sysfs groups */
const struct attribute_group *sysfs_groups[4];
+   /* space for optional per-rx queue attributes */
+   const struct attribute_group *sysfs_rx_queue_group;
 
/* rtnetlink link ops */
const struct rtnl_link_ops *rtnl_link_ops;
@@ -2375,7 +2390,7 @@ static inline bool netif_is_multiqueue(const struct 
net_device *dev)
 
 int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq);
 #else
 static inline int netif_set_real_num_rx_queues(struct net_device *dev,
@@ -2394,7 +2409,7 @@ static inline int netif_copy_real_num_queues(struct 
net_device *to_dev,
   from_dev-real_num_tx_queues);
if (err)
return err;
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
return netif_set_real_num_rx_queues(to_dev,
from_dev-real_num_rx_queues);
 #else
@@ -2402,6 +2417,18 @@ static inline int netif_copy_real_num_queues(struct 
net_device *to_dev,
 #endif
 }
 
+#ifdef CONFIG_SYSFS
+static inline unsigned int get_netdev_rx_queue_index(
+   struct netdev_rx_queue *queue)
+{
+   struct net_device *dev = queue-dev;
+   int index = queue - dev-_rx;
+
+   BUG_ON(index = dev-num_rx_queues);
+   return index;
+}
+#endif
+
 #define DEFAULT_MAX_NUM_RSS_QUEUES (8)
 int netif_get_num_default_rss_queues(void);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index f87bedd..288df62 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2083,7 +2083,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, 
unsigned int txq)
 }
 EXPORT_SYMBOL(netif_set_real_num_tx_queues);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 /**
  * netif_set_real_num_rx_queues - set actual number of RX queues used
  * @dev: Network device
@@ -5764,7 +5764,7 @@ void netif_stacked_transfer_operstate(const struct 
net_device *rootdev,
 }
 EXPORT_SYMBOL(netif_stacked_transfer_operstate);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 static int netif_alloc_rx_queues(struct net_device *dev)
 {
unsigned int i, count = dev-num_rx_queues;
@@ -6309,7 +6309,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
return NULL;
}
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
if (rxqs  1) {
pr_err(alloc_netdev: Unable to allocate device with zero RX 
queues\n);
return NULL;
@@ -6365,7 +6365,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
if (netif_alloc_netdev_queues(dev

[PATCH net-next v6 5/6] lib: Ensure EWMA does not store wrong intermediate values

2014-01-16 Thread Michael Dalton
To ensure ewma_read() without a lock returns a valid but possibly
out of date average, modify ewma_add() by using ACCESS_ONCE to prevent
intermediate wrong values from being written to avg-internal.

Suggested-by: Eric Dumazet eric.duma...@gmail.com
Acked-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Eric Dumazet eduma...@google.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
 lib/average.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/average.c b/lib/average.c
index 99a67e6..114d1be 100644
--- a/lib/average.c
+++ b/lib/average.c
@@ -53,8 +53,10 @@ EXPORT_SYMBOL(ewma_init);
  */
 struct ewma *ewma_add(struct ewma *avg, unsigned long val)
 {
-   avg-internal = avg-internal  ?
-   (((avg-internal  avg-weight) - avg-internal) +
+   unsigned long internal = ACCESS_ONCE(avg-internal);
+
+   ACCESS_ONCE(avg-internal) = internal ?
+   (((internal  avg-weight) - internal) +
(val  avg-factor))  avg-weight :
(val  avg-factor);
return avg;
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v6 6/6] virtio-net: initial rx sysfs support, export mergeable rx buffer size

2014-01-16 Thread Michael Dalton
Add initial support for per-rx queue sysfs attributes to virtio-net. If
mergeable packet buffers are enabled, adds a read-only mergeable packet
buffer size sysfs attribute for each RX queue.

Suggested-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
v3-v4: Remove seqcount due to EWMA changes in patch 5.
Add missing Suggested-By. 

 drivers/net/virtio_net.c | 46 ++
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index dacd43b..d75f8ed 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -600,18 +600,25 @@ static int add_recvbuf_big(struct receive_queue *rq, 
gfp_t gfp)
return err;
 }
 
-static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
 {
const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   unsigned int len;
+
+   len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
+   GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+   return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+}
+
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+{
struct page_frag *alloc_frag = rq-alloc_frag;
char *buf;
unsigned long ctx;
int err;
unsigned int len, hole;
 
-   len = hdr_len + clamp_t(unsigned int, ewma_read(rq-mrg_avg_pkt_len),
-   GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
-   len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+   len = get_mergeable_buf_len(rq-mrg_avg_pkt_len);
if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
return -ENOMEM;
 
@@ -1584,6 +1591,33 @@ err:
return ret;
 }
 
+#ifdef CONFIG_SYSFS
+static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue,
+   struct rx_queue_attribute *attribute, char *buf)
+{
+   struct virtnet_info *vi = netdev_priv(queue-dev);
+   unsigned int queue_index = get_netdev_rx_queue_index(queue);
+   struct ewma *avg;
+
+   BUG_ON(queue_index = vi-max_queue_pairs);
+   avg = vi-rq[queue_index].mrg_avg_pkt_len;
+   return sprintf(buf, %u\n, get_mergeable_buf_len(avg));
+}
+
+static struct rx_queue_attribute mergeable_rx_buffer_size_attribute =
+   __ATTR_RO(mergeable_rx_buffer_size);
+
+static struct attribute *virtio_net_mrg_rx_attrs[] = {
+   mergeable_rx_buffer_size_attribute.attr,
+   NULL
+};
+
+static const struct attribute_group virtio_net_mrg_rx_group = {
+   .name = virtio_net,
+   .attrs = virtio_net_mrg_rx_attrs
+};
+#endif
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
int i, err;
@@ -1698,6 +1732,10 @@ static int virtnet_probe(struct virtio_device *vdev)
if (err)
goto free_stats;
 
+#ifdef CONFIG_SYSFS
+   if (vi-mergeable_rx_bufs)
+   dev-sysfs_rx_queue_group = virtio_net_mrg_rx_group;
+#endif
netif_set_real_num_tx_queues(dev, vi-curr_queue_pairs);
netif_set_real_num_rx_queues(dev, vi-curr_queue_pairs);
 
-- 
1.8.5.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size

2014-01-14 Thread Michael Dalton
I'd like to confirm the preferred sysfs path structure for mergeable
receive buffers. Is 'mergeable_rx_buffer_size' the right attribute name
to use or is there a strong preference for a different name?

I believe the current approach proposed for the next patchset is to use a
per-netdev attribute group which we will add to the receive
queue kobj (struct netdev_rx_queue). That leaves us with at
least two options:
  (1) Name the attribute group something, e.g., 'virtio-net', in which
  case all virtio-net attributes for eth0 queue N will be of
  the form:
  /sys/class/net/eth0/queues/rx-N/virtio-net/attribute name

  (2) Do not name the attribute group (leave the name NULL), in which
  case AFAICT virtio-net and device-independent attributes would be
  mixed without any indication. For example, all virtio-net
  attributes for netdev eth0 queue N would be of the form:
  /sys/class/net/eth0/queues/rx-N/attribute name

FWIW, the bonding netdev has a similar sysfs issue and uses a per-netdev
attribute group (stored in the 'sysfs_groups' field of struct netdevice)
In the case of bonding, the attribute group is named, so
device-independent netdev attributes are found in
/sys/class/net/eth0/attribute name while bonding attributes are placed
in /sys/class/net/eth0/bonding/attribute name.

So it seems like there is some precedent for using an attribute group
name corresponding to the driver name. Does using an attribute group
name of 'virtio-net' sound good or would an empty or different attribute
group name be preferred?

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size

2014-01-13 Thread Michael Dalton
On Mon, Jan 13, 2014 at 7:38 AM, Ben Hutchings bhutchi...@solarflare.com
wrote:
 I don't think RPS should own this structure.  It's just that there are
 currently no per-RX-queue attributes other than those defined by RPS.

Agreed, there is useful attribute-independent functionality already
built around netdev_rx_queue - e.g., dynamically resizing the rx queue
kobjs as the number of RX queues enabled for the netdev is changed. While
the current attributes happen to be used only by RPS, AFAICT it seems
RPS should not own netdev_rx_queue but rather should own the RPS-specific
fields themselves within netdev_rx_queue.

If there are no objections, it seems like I could modify
netdev_rx_queue and related functionality so that their existence does
not depend on CONFIG_RPS, and instead just have CONFIG_RPS control
whether or not the RPS-specific attributes/fields are present.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size

2014-01-13 Thread Michael Dalton
Sorry I missed this important piece of information, it appears that
netdev_queue (the TX equivalent of netdev_rx_queue) already has
decoupled itself from CONFIG_XPS due to an attribute,
queue_trans_timeout, that does not depend on XPS functionality. So it
seems that something somewhat equivalent has already happened on the
TX side.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size

2014-01-12 Thread Michael Dalton
Hi Michael,

On Sun, Jan 12, 2014 at 9:09 AM, Michael S. Tsirkin m...@redhat.com wrote:
 Can't we add struct attribute * to netdevice, and pass that in when
 creating the kobj?

I like that idea, I think that will work and should be better than
the alternatives. The actual kobjs for RX queues (struct netdev_rx_queue)
are allocated and deallocated by calls to net_rx_queue_update_kobjects,
which resizes RX queue kobjects when the netdev RX queues are resized.

Is this what you had in mind:
(1) Add a pointer to an attribute group to struct net_device, used for
per-netdev rx queue attributes and initialized before the call to
register_netdevice().
(2) Declare an attribute group containing the mergeable_rx_buffer_size
attribute in virtio-net, and initialize the per-netdevice group pointer
to the address of this group in virtnet_probe before register_netdevice
(3) In net-sysfs, modify net_rx_queue_update_kobjects
(or rx_queue_add_kobject) to call sysfs_create_group on the
per-netdev attribute group (if non-NULL), adding the attributes in
the group to the RX queue kobject.

That should allow us to have per-RX queue attributes that are
device-specific. I'm not a sysfs expert, but it seems that rx_queue_ktype
and rx_queue_sysfs_ops presume that all rx queue sysfs operations are
performed on attributes of type rx_queue_attribute. That type will need
to be moved from net-sysfs.c to a header file like netdevice.h so that
the type can be used in virtio-net when we declare the
mergeable_rx_buffer_size attribute.

The last issue is how the rx_queue_attribute 'show' function
implementation for mergeable_rx_buffer_size will access the appropriate
per-receive queue EWMA data. The arguments to the show function will be
the netdev_rx_queue and the attribute itself. We can get to the
struct net_device from the netdev_rx_queue.  If we extended
netdev_rx_queue to indicate the queue_index or to store a void *priv_data
pointer, that would be sufficient to allow us to resolve this issue.

Please let me know if the above sounds good or if you see a better way
to accomplish this goal. Thanks!

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size

2014-01-10 Thread Michael Dalton
Hi Jason, Michael

Sorry for the delay in response. Jason, I agree this patch ended up
being larger than expected. The major implementation parts are:
(1) Setup directory structure (driver/per-netdev/rx-queue directories)
(2) Network device renames (optional, so debugfs dir has the right name)
(3) Support resizing the # of RX queues (optional - we could just export
max_queue_pairs files and not delete files if an RX queue is disabled)
(4) Reference counting - used in case someone opens a debugfs
file and then removes the virtio-net device.
(5) The actual mergeable rx buffer file implementation itself. For now
I have added a seqcount for memory safety, but if a read-only race
condition is acceptable we could elide the seqcount. FWIW, the
seqcount write in receive_mergeable() should, on modern x86,
translate to two non-atomic adds and two compiler barriers, so
overhead is not expected to be meaningful.

We can move to sysfs and this would simplify or eliminate much of the
above, including most of (1) - (4). I believe our choices for what to
do for the next patchset include:
(a) Use debugfs as is currently done, removing any optional features
listed above that are deemed unnecessary.

(b) Add a per-netdev sysfs attribute group to net_device-sysfs_groups.
Each attribute would display the mergeable packet buffer size for a given
RX queue, and there would be max_queue_pairs attributes in total. This
is already supported by net/core/net-sysfs.c:netdev_register_kobject(),
but means that we would have a static set of per-RX queue files for
all RX queues supported by the netdev, rather than dynamically displaying
only the files corresponding to enabled RX queues (e.g., when # of RX
queues is changed by  ethtool -L device).  For an example of this
approach, see drivers/net/bonding/bond_sysfs.c.

(c) Modify struct netdev_rx_queue to add virtio-net EWMA fields directly,
and modify net-sysfs.c to manage the new fields. Unlike (b), this approach
supports the RX queue resizing in (3) but means putting virtio-net info
in netdev_rx_queue, which currently has only device-independent fields.

My preference would be (b): try using sysfs and adding a device-specific
attribute group to the virtio-net netdevice (stored in the existing
'sysfs_groups' field and supported by net-sysfs).  This would avoid
adding virtio-net specific information to net-sysfs. What would you
prefer (or is there a better way than the approaches above)? Thanks!

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size

2014-01-10 Thread Michael Dalton
Also, one other note: if we use sysfs, the directory structure will
be different depending on our chosen sysfs strategy. If we augment
netdev_rx_queue, the new attributes will be found in the standard
'rx-N' netdev subdirectory, e.g.,
/sys/class/net/eth0/queues/rx-0/mergeable_rx_buffer_size

Whereas if we use per-netdev attributes, our attributes would be in
/sys/class/net/eth0/group name/attribute name, which may be
less intuitive as AFAICT we'd have to indicate both the queue # and
type of value being reported using the attribute name. E.g.,
/sys/class/net/eth0/virtio-net/rx-0_mergeable_buffer_size.
That's somewhat less elegant.

I don't see an easy way to add new attributes to the 'rx-N'
subdirectories without directly modifying struct netdev_rx_queue,
so I think this is another tradeoff between the two sysfs approaches.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-09 Thread Michael Dalton
Hi Michael,

Here's a quick sketch of some code that enforces a minimum buffer
alignment of only 64, and has a maximum theoretical buffer size of
aligned GOOD_PACKET_LEN + (BUF_ALIGN - 1) * BUF_ALIGN, which is at least
1536 + 63 * 64 = 5568. On x86, we already use a 64 byte alignment, and
this code supports all current buffer sizes, from 1536 to PAGE_SIZE.

#if L1_CACHE_BYTES  64
#define MERGEABLE_BUFFER_ALIGN 64
#define MERGEABLE_BUFFER_SHIFT 6
#else
#define MERGEABLE_BUFFER_ALIGN L1_CACHE_BYTES
#define MERGEABLE_BUFFER_SHIFT L1_CACHE_SHIFT
#endif
#define MERGEABLE_BUFFER_MIN ALIGN(GOOD_PACKET_LEN +
   sizeof(virtio_net_hdr_mrg_rbuf),
   MERGEABLE_BUFFER_ALIGN)
#define MERGEABLE_BUFFER_MAX min(MERGEABLE_BUFFER_MIN +
 (MERGEABLE_BUFFER_ALIGN - 1) *
 MERGEABLE_BUFFER_ALIGN, PAGE_SIZE)
/* Extract buffer length from a mergeable buffer context. */
static u16 get_mergeable_buf_ctx_len(void *ctx) {
u16 len = (uintptr_t)ctx  (MERGEABLE_BUFFER_ALIGN - 1);
return MERGEABLE_BUFFER_MIN + (len  MERGEABLE_BUFFER_SHIFT);
}
/* Extract buffer base address from a mergeable buffer context. */
static void *get_mergeable_buf_ctx_base(void *ctx) {
return (void *) ((uintptr)ctx  -MERGEABLE_BUFFER_ALIGN);
}
/* Convert a base address and length to a mergeable buffer context. */
static void *to_mergeable_buf_ctx(void *base, u16 len) {
len -= MERGEABLE_BUFFER_MIN;
return (void *) ((uintptr)base | (len  MERGEABLE_BUFFER_SHIFT));
}
/* Compute the packet buffer length for a receive queue. */
static u16 get_mergeable_buffer_len(struct receive_queue *rq) {
u16 len = clamp_t(u16, MERGEABLE_BUFFER_MIN,
  ewma_read(rq-avg_pkt_len),
  MERGEABLE_BUFFER_MAX);
return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
}

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-09 Thread Michael Dalton
If the prior code snippet looks good to you, I'll use something like
that as a baseline for a v3 patchset. I don't think we need a stricter
alignment than 64 to express values in the range (1536 ... 4096), as
the code snippet shows, which is great for x86 4KB pages.

On other architectures that have larger page sizes  4KB with = 64b
cachelines, we may want to increase the alignment so that the max buffer
size will be = PAGE_SIZE (max size allowed by skb_page_frag_refill).

If we use a minimum alignment of 128, our maximum theoretical
packet buffer length is 1536 + 127 * 128 = 17792. With 256 byte
alignment, we can express a maximum packet buffer size  65536.

Given the above, I think we want to select the min buffer alignment
based on the PAGE_SIZE:
= 4KB PAGE_SIZE: 64b min alignment
= 16KB PAGE_SIZE: 128b min alignment
 16KB PAGE_SIZE: 256b min alignment

So the prior code snippet would be relatively unchanged, except that
references to the previous minimum alignment of 64 would be replaced by
a #define'd constant derived from PAGE_SIZE as shown above.

This would guarantee that we use the minimum alignment necessary to
ensure that virtio-net can post a max size (PAGE_SIZE) buffer, and for
x86 this means we won't increase the alignment beyond the x86's current
L1_CACHE_BYTES value (64). Also, sorry I haven't had a chance to respond
yet to the debugfs feedback, I will get to that soon (just wanted to do
a further deep dive on some of the sysfs/debugfs tradeoffs).

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-09 Thread Michael Dalton
Hi Michael,

Your improvements (code changes, more consistent naming, and use of
256-byte alignment only) all sound good to me. I will get started on a v3
patchset in conformance with your recommendations after sorting out
what we want to do with the debugfs/sysfs issues. I will followup soon on
the thread for patch 4/4 so we can close on what changes are needed for
debugfs/sysfs. Thanks!

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-08 Thread Michael Dalton
Hi Jason,

On Tue, Jan 7, 2014 at 10:23 PM, Jason Wang jasow...@redhat.com wrote:
 What's the reason that this extra space is not accounted for truesize?
The initial rationale was that this extra space is due to
internal fragmentation in the page frag allocator, but I agree with
you -- this code should be changed and the extra space accounted for.
Any internal fragmentation leading to a larger last packet allocated from
the page should be reflected in the SKB truesize of the last packet.

I will do a followup patchset that accounts correctly for the extra
space, which will also me to remove the two max statements you
indicated. Thanks for finding this issue.

 + if (err  0) {
 + put_page(virt_to_head_page(ctx-buf));
 + return err;
 Should we also roll back the frag offset added above to avoid leaking frags?
I believe the put_page here is sufficient for correctness. When we
allocate a buffer using skb_page_frag_refill, we use get_page/put_page
to allocate/free respectively. For example, if the virtqueue_add_inbuf
succeeded, we would eventually call put_page either in virtio-net
(e.g., page_to_skb for packets = GOOD_COPY_LEN bytes) or later in
__skb_frag_unref and other functions called during dev_kfree_skb.

However, an offset rollback does allow the space to be reused by the next
allocation, which could be a good optimization. I can do the offset
rollback (with a put_page) in the next patchset. What do you think?

 + /* Do not attempt to add a buffer if the RX ring is full. */
 + if (unlikely(!rq-vq-num_free))
 + return true;
 I haven't figured out why this is needed. It seems safe for
 virtqueue_add_inbuf() just fail in add_recv_xx()?
I think this is safe with one caveat -- we can't modify
rq-mrg_buf_ctx until we know the ring isn't full (otherwise, we
clobber an in-use entry). It is safe to modify rq-mrg_buf_ctx
after we know that virtqueue_add_inbuf has succeeded.

I can remove the rq_num_free check from try_fill_recv, and then
modify virtqueue_add_inbuf to use a local mergeable_receive_buf_ctx.
Once virtqueue_add_inbuf succeeds, the contents of the local variable
can be copied to rq-mrg_buf_ctx[rq-mrg_buf_ctx_head].

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-08 Thread Michael Dalton
Hi Eric, Michael,

On Wed, Jan 8, 2014 at 11:16 AM, Michael S. Tsirkin m...@redhat.com wrote:
 Why should we select a frame at random and make it's truesize bigger?
 All frames are to blame for the extra space.
 Just ignoring it seems more symmetrical.
Sounds good, based on Eric's feedback and Michael's feedback above,
I will leave the 'extra space' handling as-is in the followup patchset
and will not track the extra space in ctx-truesize. AFAICT, The two
max() statements will need to remain (as buffer length may exceed
ctx-truesize).  Thanks for the feedback.

 If you intend to repost anyway (for the below wrinkle) then
 you can do it right here just as well I guess. Seems a bit prettier.
Will do.

 You don't have to fill in ctx before calling add_inbuf, do you?
 Just fill it afterwards.
Agreed, ctx does not need to be filled until after add_inbuf.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-08 Thread Michael Dalton
Hi Michael,

On Wed, Jan 8, 2014 at 5:42 PM, Michael S. Tsirkin m...@redhat.com wrote:
 Sorry that I didn't notice early, but there seems to be a bug here.
 See below.
Yes, that is definitely a bug. Virtio spec permits OOO completions,
but current code assumes in-order completion. Thanks for catching this.

 Don't need full int really, it's up to 4K/cache line size,
 1 byte would be enough, maximum 2 ...
 So if all we want is extra 1-2 bytes per buffer, we don't really
 need this extra level of indirection I think.
 We can just allocate them before the header together with an skb.
I'm not sure if I'm parsing the above correctly, but do you mean using a
few bytes at the beginning of the packet buffer to store truesize? I
think that will break Jason's virtio-net RX frag coalescing
code. To coalesce consecutive RX packet buffers, our packet buffers must
be physically adjacent, and any extra bytes before the start of the
buffer would break that.

We could allocate an SKB per packet buffer, but if we have multi-buffer
packets often(e.g., netperf benefiting from GSO/GRO), we would be
allocating 1 SKB per packet buffer instead of 1 SKB per MAX_SKB_FRAGS
buffers. How do you feel about any of the below alternatives:

(1) Modify the existing mrg_buf_ctx to chain together free entries
We can use the 'buf' pointer in mergeable_receive_buf_ctx to chain
together free entries so that we can support OOO completions. This would
be similar to how virtio-queue manages free sg entries.

(2) Combine the buffer pointer and truesize into a single void* value
Your point about there only being a byte needed to encode truesize is
spot on, and I think we could leverage this to eliminate the out-of-band
metadata ring entirely. If we were willing to change the packet buffer
alignment from L1_CACHE_BYTES to 256 (or min (256, L1_CACHE_SIZE)), we
could encode the truesize in the least significant 8 bits of the buffer
address (encoded as truesize  8 as we know all sizes are a multiple
of 256). This would allow packet buffers up to 64KB in length.

Is there another approach you would prefer to any of these? If the
cleanliness issues and larger alignment aren't too bad, I think (2)
sounds promising and allow us to eliminate the metadata ring
entirely while still permitting RX frag coalescing.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-08 Thread Michael Dalton
Sorry, forgot to mention - if we want to explore combining the buffer
address and truesize into a single void *, we could also exploit the
fact that our size ranges from aligned GOOD_PACKET_LEN to PAGE_SIZE, and
potentially encode fewer values for truesize (and require a smaller
alignment than 256). The prior e-mails discussion of 256 byte alignment
with 256 values is just one potential design point.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v2 2/4] virtio-net: use per-receive queue page frag alloc for mergeable bufs

2014-01-06 Thread Michael Dalton
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
mergeable rx buffer allocations. This commit migrates virtio-net to use
per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
mergeable rx buffer memory allocation, which now will use skb_refill_frag()
for both atomic and GFP-WAIT buffer allocations.

To address fragmentation concerns, if after buffer allocation there
is too little space left in the page frag to allocate a subsequent
buffer, the remaining space is added to the current allocated buffer
so that the remaining space can be used to store packet data.

Signed-off-by: Michael Dalton mwdal...@google.com
---
v2: Use GFP_COLD for RX buffer allocations (as in netdev_alloc_frag()).
Remove per-netdev GFP_KERNEL page_frag allocator.

 drivers/net/virtio_net.c | 69 
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c51a988..526dfd8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -78,6 +78,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Page frag for packet buffer allocation. */
+   struct page_frag alloc_frag;
+
/* RX: fragments + linear part + virtio header */
struct scatterlist sg[MAX_SKB_FRAGS + 2];
 
@@ -126,11 +129,6 @@ struct virtnet_info {
/* Lock for config space updates */
struct mutex config_lock;
 
-   /* Page_frag for GFP_KERNEL packet buffer allocation when we run
-* low on memory.
-*/
-   struct page_frag alloc_frag;
-
/* Does the affinity hint is set for virtqueues? */
bool affinity_hint_set;
 
@@ -336,8 +334,8 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
int num_buf = hdr-mhdr.num_buffers;
struct page *page = virt_to_head_page(buf);
int offset = buf - page_address(page);
-   struct sk_buff *head_skb = page_to_skb(rq, page, offset, len,
-  MERGE_BUFFER_LEN);
+   unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+   struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
struct sk_buff *curr_skb = head_skb;
 
if (unlikely(!curr_skb))
@@ -353,11 +351,6 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
dev-stats.rx_length_errors++;
goto err_buf;
}
-   if (unlikely(len  MERGE_BUFFER_LEN)) {
-   pr_debug(%s: rx error: merge buffer too long\n,
-dev-name);
-   len = MERGE_BUFFER_LEN;
-   }
 
page = virt_to_head_page(buf);
--rq-num;
@@ -376,19 +369,20 @@ static struct sk_buff *receive_mergeable(struct 
net_device *dev,
head_skb-truesize += nskb-truesize;
num_skb_frags = 0;
}
+   truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
if (curr_skb != head_skb) {
head_skb-data_len += len;
head_skb-len += len;
-   head_skb-truesize += MERGE_BUFFER_LEN;
+   head_skb-truesize += truesize;
}
offset = buf - page_address(page);
if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
put_page(page);
skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-len, MERGE_BUFFER_LEN);
+len, truesize);
} else {
skb_add_rx_frag(curr_skb, num_skb_frags, page,
-   offset, len, MERGE_BUFFER_LEN);
+   offset, len, truesize);
}
}
 
@@ -578,25 +572,24 @@ static int add_recvbuf_big(struct receive_queue *rq, 
gfp_t gfp)
 
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
-   struct virtnet_info *vi = rq-vq-vdev-priv;
-   char *buf = NULL;
+   struct page_frag *alloc_frag = rq-alloc_frag;
+   char *buf;
int err;
+   unsigned int len, hole;
 
-   if (gfp  __GFP_WAIT) {
-   if (skb_page_frag_refill(MERGE_BUFFER_LEN, vi-alloc_frag,
-gfp)) {
-   buf = (char *)page_address(vi-alloc_frag.page) +
- vi-alloc_frag.offset;
-   get_page(vi-alloc_frag.page);
-   vi-alloc_frag.offset += MERGE_BUFFER_LEN;
-   }
-   } else {
-   buf = netdev_alloc_frag(MERGE_BUFFER_LEN);
-   }
-   if (!buf)
+   if (unlikely

[PATCH net-next v2 1/4] net: allow 0 order atomic page alloc in skb_page_frag_refill

2014-01-06 Thread Michael Dalton
skb_page_frag_refill currently permits only order-0 page allocs
unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
higher-order page allocations whether or not GFP_WAIT is used. If
memory cannot be allocated, the allocator will fall back to
successively smaller page allocs (down to order-0 page allocs).

This change brings skb_page_frag_refill in line with the existing
page allocation strategy employed by netdev_alloc_frag, which attempts
higher-order page allocations whether or not GFP_WAIT is set, falling
back to successively lower-order page allocations on failure. Part
of migration of virtio-net to per-receive queue page frag allocators.

Acked-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Eric Dumazet eduma...@google.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
 net/core/sock.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 5393b4b..a0d522a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1865,9 +1865,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
page_frag *pfrag, gfp_t prio)
put_page(pfrag-page);
}
 
-   /* We restrict high order allocations to users that can afford to wait 
*/
-   order = (prio  __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
-
+   order = SKB_FRAG_PAGE_ORDER;
do {
gfp_t gfp = prio;
 
-- 
1.8.5.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

2014-01-06 Thread Michael Dalton
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag
allocators) changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size = MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size =
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c (virtio-net: coalesce rx frags when possible during rx),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads  vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Signed-off-by: Michael Dalton mwdal...@google.com
---
v2: Add per-receive queue metadata ring to track precise truesize for
mergeable receive buffers. Remove all truesize approximation. Never
try to fill a full RX ring (required for metadata ring in v2).

 drivers/net/virtio_net.c | 145 ++-
 1 file changed, 107 insertions(+), 38 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 526dfd8..f6e1ee0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include linux/if_vlan.h
 #include linux/slab.h
 #include linux/cpu.h
+#include linux/average.h
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -36,11 +37,15 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-L1_CACHE_BYTES))
 #define GOOD_COPY_LEN  128
 
+/* Weight used for the RX packet size EWMA. The average packet size is used to
+ * determine the packet buffer size when refilling RX rings. As the entire RX
+ * ring may be refilled at once, the weight is chosen so that the EWMA will be
+ * insensitive to short-term, transient changes in packet size.
+ */
+#define RECEIVE_AVG_WEIGHT 64
+
 #define VIRTNET_DRIVER_VERSION 1.0.0
 
 struct virtnet_stats {
@@ -65,11 +70,30 @@ struct send_queue {
char name[40];
 };
 
+/* Per-packet buffer context for mergeable receive buffers. */
+struct mergeable_receive_buf_ctx {
+   /* Packet buffer base address. */
+   void *buf;
+
+   /* Original size of the packet buffer for use in SKB truesize. Does not
+* include any padding space used to avoid internal fragmentation.
+*/
+   unsigned int truesize;
+};
+
 /* Internal representation of a receive virtqueue */
 struct receive_queue {
/* Virtqueue associated with this receive_queue */
struct virtqueue *vq;
 
+   /* Circular buffer of mergeable rxbuf contexts. */
+   struct mergeable_receive_buf_ctx *mrg_buf_ctx;
+
+   /* Number of elements  head index of mrg_buf_ctx. Size must be
+* equal to the associated virtqueue's vring size.
+*/
+   unsigned int mrg_buf_ctx_size, mrg_buf_ctx_head;
+
struct napi_struct napi;
 
/* Number of input buffers, and max we've ever had. */
@@ -78,6 +102,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Average packet length for mergeable receive buffers. */
+   struct ewma mrg_avg_pkt_len;
+
/* Page frag for packet buffer allocation. */
struct page_frag alloc_frag;
 
@@ -327,32 +354,32 @@ err:
 
 static struct sk_buff *receive_mergeable(struct net_device *dev,
 struct receive_queue *rq,
-void *buf,
+struct

[PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size

2014-01-06 Thread Michael Dalton
Add initial support for debugfs to virtio-net. Each virtio-net network
device will have a directory under /virtio-net in debugfs. The
per-network device directory will contain one sub-directory per active,
enabled receive queue. If mergeable receive buffers are enabled, each
receive queue directory will contain a read-only file that returns the
current packet buffer size for the receive queue.

Signed-off-by: Michael Dalton mwdal...@google.com
---
 drivers/net/virtio_net.c | 314 ---
 1 file changed, 296 insertions(+), 18 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f6e1ee0..5da18d6 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -27,6 +27,9 @@
 #include linux/slab.h
 #include linux/cpu.h
 #include linux/average.h
+#include linux/seqlock.h
+#include linux/kref.h
+#include linux/debugfs.h
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -35,6 +38,9 @@ static bool csum = true, gso = true;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 
+/* Debugfs root directory for all virtio-net devices. */
+static struct dentry *virtnet_debugfs_root;
+
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
 #define GOOD_COPY_LEN  128
@@ -102,9 +108,6 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
-   /* Average packet length for mergeable receive buffers. */
-   struct ewma mrg_avg_pkt_len;
-
/* Page frag for packet buffer allocation. */
struct page_frag alloc_frag;
 
@@ -115,6 +118,28 @@ struct receive_queue {
char name[40];
 };
 
+/* Per-receive queue statistics exported via debugfs. */
+struct receive_queue_stats {
+   /* Average packet length of receive queue (for mergeable rx buffers). */
+   struct ewma avg_pkt_len;
+
+   /* Per-receive queue stats debugfs directory. */
+   struct dentry *dbg;
+
+   /* Reference count for the receive queue statistics, needed because
+* an open debugfs file may outlive the receive queue and netdevice.
+* Open files will remain in-use until all outstanding file descriptors
+* are closed, even after the underlying file is unlinked.
+*/
+   struct kref refcount;
+
+   /* Sequence counter to allow debugfs readers to safely access stats.
+* Assumes a single virtio-net writer, which is enforced by virtio-net
+* and NAPI.
+*/
+   seqcount_t dbg_seq;
+};
+
 struct virtnet_info {
struct virtio_device *vdev;
struct virtqueue *cvq;
@@ -147,6 +172,15 @@ struct virtnet_info {
/* Active statistics */
struct virtnet_stats __percpu *stats;
 
+   /* Per-receive queue statstics exported via debugfs. Stored in
+* virtnet_info to survive freeze/restore -- a task may have a per-rq
+* debugfs file open at the time of freeze.
+*/
+   struct receive_queue_stats **rq_stats;
+
+   /* Per-netdevice debugfs directory. */
+   struct dentry *dbg_dev_root;
+
/* Work struct for refilling if we run low on memory. */
struct delayed_work refill;
 
@@ -358,6 +392,8 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
 unsigned int len)
 {
struct skb_vnet_hdr *hdr = ctx-buf;
+   struct virtnet_info *vi = netdev_priv(dev);
+   struct receive_queue_stats *rq_stats = vi-rq_stats[vq2rxq(rq-vq)];
int num_buf = hdr-mhdr.num_buffers;
struct page *page = virt_to_head_page(ctx-buf);
int offset = ctx-buf - page_address(page);
@@ -413,7 +449,9 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
}
}
 
-   ewma_add(rq-mrg_avg_pkt_len, head_skb-len);
+   write_seqcount_begin(rq_stats-dbg_seq);
+   ewma_add(rq_stats-avg_pkt_len, head_skb-len);
+   write_seqcount_end(rq_stats-dbg_seq);
return head_skb;
 
 err_skb:
@@ -600,18 +638,30 @@ static int add_recvbuf_big(struct receive_queue *rq, 
gfp_t gfp)
return err;
 }
 
+static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
+{
+   const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   unsigned int len;
+
+   len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
+   GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+   return ALIGN(len, L1_CACHE_BYTES);
+}
+
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
const unsigned int ring_size = rq-mrg_buf_ctx_size;
-   const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
struct page_frag *alloc_frag = rq-alloc_frag;
+   struct virtnet_info *vi = rq-vq-vdev-priv;
struct mergeable_receive_buf_ctx *ctx;
int err;
unsigned int len, hole;
 
-   len = hdr_len + clamp_t(unsigned

Re: [PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

2013-12-27 Thread Michael Dalton
I'm working on a followup patchset to address current feedback. I think
it will be cleaner to do a debugfs implementation for per-receive queue
packet buffer size exporting, so I'm trying that out.

On Thu, Dec 26, 2013 at 7:04 PM, Jason Wang jasow...@redhat.com wrote:
 We can make this more accurate by using extra data structure to track
 the real buf size and using it as token.

I agree -- we can do precise buffer total len tracking. Something like
struct mergeable_packet_buffer_ctx {
   void *buf;
   unsigned int total_len;
};

Each receive queue could have a pointer to an array of N buffer contexts,
where N is queue size (kzalloc'd in init_vqs or similar). That would
allow us to allocate all of our buffer context data at startup.

Would this be preferred to the current approach or is there another
approach you would prefer? All other things being equal, having precise
length tracking is advantageous, so I'm inclined to try this out and
see how it goes.

I think this is a big design point - for example, if we have an extra
buffer context structure, then per-receive queue frag allocators are not
required for auto-tuning and we can reduce the number of patches in
this patchset.

I'm happy to implement either way.  Thanks!

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

2013-12-26 Thread Michael Dalton
On Mon, Dec 23, 2013 at 4:51 AM, Michael S. Tsirkin m...@redhat.com wrote:
 OK so a high level benchmark shows it's worth it,
 but how well does the logic work?
 I think we should make the buffer size accessible in sysfs
 or debugfs, and look at it, otherwise we don't really know.

Exporting the size sounds good to me, it is definitely an
important metric and would give more visibility to the admin.

Do you have a preference for implementation strategy? I was
thinking just add a DEVICE_ATTR to create a read-only sysfs file,
'mergeable_rx_buffer_size', and return a space-separated list of the
current buffer size (computed from the average packet size) for each
receive queue. -EINVAL or a similar error could be returned if the
netdev was not configured for mergeable rx buffers.

 I don't get the real motivation for this.

 We have skbs A,B,C sharing a page, with chunk D being unused.
 This randomly charges chunk D to an skb that ended up last
 in the page.
 Correct?
 Why does this make sense?

The intent of this code is to adjust the SKB true size for
the packet. We should completely use each packet buffer except
for the last buffer. For all buffers except the last buffer, it
should be the case that 'len' (bytes received) = buffer size. For
the last buffer, this code adjusts the truesize by comparing the
approximated buffer size with the bytes received into the buffer,
and adding the difference to the SKB truesize if the buffer size
is greater than the number of bytes received.

We approximate the buffer size by using the last packet buffer size
from that same page, which as you have correctly noted may be a buffer
that belongs to a different packet on the same virtio-net device. This
buffer size should be very close to the actual buffer size because our
EWMA estimator uses a high weight (so the packet buffer size changes very
slowly) and there are only a handful packets on a page (even order-3).

 Why head_skb only? Why not full buffer size that comes from host?
 This is simply len.

Sorry, I believe this code fragment should be clearer. Basically, we
have a corner case in that for packets with size = GOOD_COPY_LEN, there
are no frags because page_to_skb() already unref'd the page and the entire
packet contents are copied to skb-data. In this case, the SKB truesize
is already accurate and should not be updated (and it would be unsafe to
access page-private as page is already unref'd).

I'll look at the above code again and cleanup (please let me know if you
have a preference) and/or add a comment to clarify.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 2/3] virtio-net: use per-receive queue page frag alloc for mergeable bufs

2013-12-26 Thread Michael Dalton
On Mon, Dec 23, 2013 at 11:37 AM, Michael S. Tsirkin m...@redhat.com wrote:
 So there isn't a conflict with respect to locking.

 Is it problematic to use same page_frag with both GFP_ATOMIC and with
 GFP_KERNEL? If yes why?

I believe it is safe to use the same page_frag and I will send out a
followup patchset using just the per-receive page_frags. For future
consideration, Eric noted that disabling NAPI before GFP_KERNEL
allocs can potentially inhibit virtio-net network processing for some
time (e.g., during a blocking memory allocation or preemption).

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH stable 1/2] virtio_net: fix error handling for mergeable buffers

2013-12-25 Thread Michael Dalton
Hi Michael, quick question below:

On Wed, Dec 25, 2013 at 6:56 AM, Michael S. Tsirkin m...@redhat.com wrote:
 if (i = MAX_SKB_FRAGS) {
 pr_debug(%s: packet too long\n, skb-dev-name);
 skb-dev-stats.rx_length_errors++;
 -   return -EINVAL;
 +   return NULL;
 }

Should this error handling path free the SKB before returning NULL?
It seems like if we just return NULL we may leak memory.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH stable 1/2] virtio_net: fix error handling for mergeable buffers

2013-12-25 Thread Michael Dalton
Acked-by: Michael Dalton mwdal...@google.com
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH stable] virtio_net: don't leak memory or block when too many frags

2013-12-25 Thread Michael Dalton
Acked-by: Michael Dalton mwdal...@google.com
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next 2/3] virtio-net: use per-receive queue page frag alloc for mergeable bufs

2013-12-16 Thread Michael Dalton
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
mergeable rx buffer allocations. This commit migrates virtio-net to use
per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
mergeable rx buffer memory allocation, which now will use skb_refill_frag()
for both atomic and GFP-WAIT buffer allocations.

To address fragmentation concerns, if after buffer allocation there
is too little space left in the page frag to allocate a subsequent
buffer, the remaining space is added to the current allocated buffer
so that the remaining space can be used to store packet data.

Signed-off-by: Michael Dalton mwdal...@google.com
---
 drivers/net/virtio_net.c | 69 ++--
 1 file changed, 38 insertions(+), 31 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c51a988..d38d130 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -78,6 +78,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Page frag for GFP_ATOMIC packet buffer allocation. */
+   struct page_frag atomic_frag;
+
/* RX: fragments + linear part + virtio header */
struct scatterlist sg[MAX_SKB_FRAGS + 2];
 
@@ -127,9 +130,9 @@ struct virtnet_info {
struct mutex config_lock;
 
/* Page_frag for GFP_KERNEL packet buffer allocation when we run
-* low on memory.
+* low on memory. May sleep.
 */
-   struct page_frag alloc_frag;
+   struct page_frag sleep_frag;
 
/* Does the affinity hint is set for virtqueues? */
bool affinity_hint_set;
@@ -336,8 +339,8 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
int num_buf = hdr-mhdr.num_buffers;
struct page *page = virt_to_head_page(buf);
int offset = buf - page_address(page);
-   struct sk_buff *head_skb = page_to_skb(rq, page, offset, len,
-  MERGE_BUFFER_LEN);
+   int truesize = max_t(int, len, MERGE_BUFFER_LEN);
+   struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
struct sk_buff *curr_skb = head_skb;
 
if (unlikely(!curr_skb))
@@ -353,11 +356,6 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
dev-stats.rx_length_errors++;
goto err_buf;
}
-   if (unlikely(len  MERGE_BUFFER_LEN)) {
-   pr_debug(%s: rx error: merge buffer too long\n,
-dev-name);
-   len = MERGE_BUFFER_LEN;
-   }
 
page = virt_to_head_page(buf);
--rq-num;
@@ -376,19 +374,20 @@ static struct sk_buff *receive_mergeable(struct 
net_device *dev,
head_skb-truesize += nskb-truesize;
num_skb_frags = 0;
}
+   truesize = max_t(int, len, MERGE_BUFFER_LEN);
if (curr_skb != head_skb) {
head_skb-data_len += len;
head_skb-len += len;
-   head_skb-truesize += MERGE_BUFFER_LEN;
+   head_skb-truesize += truesize;
}
offset = buf - page_address(page);
if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
put_page(page);
skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-len, MERGE_BUFFER_LEN);
+len, truesize);
} else {
skb_add_rx_frag(curr_skb, num_skb_frags, page,
-   offset, len, MERGE_BUFFER_LEN);
+   offset, len, truesize);
}
}
 
@@ -579,24 +578,24 @@ static int add_recvbuf_big(struct receive_queue *rq, 
gfp_t gfp)
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
struct virtnet_info *vi = rq-vq-vdev-priv;
-   char *buf = NULL;
-   int err;
+   struct page_frag *alloc_frag;
+   char *buf;
+   int err, len, hole;
 
-   if (gfp  __GFP_WAIT) {
-   if (skb_page_frag_refill(MERGE_BUFFER_LEN, vi-alloc_frag,
-gfp)) {
-   buf = (char *)page_address(vi-alloc_frag.page) +
- vi-alloc_frag.offset;
-   get_page(vi-alloc_frag.page);
-   vi-alloc_frag.offset += MERGE_BUFFER_LEN;
-   }
-   } else {
-   buf = netdev_alloc_frag(MERGE_BUFFER_LEN);
-   }
-   if (!buf)
+   alloc_frag = (gfp  __GFP_WAIT) ? vi-sleep_frag : rq-atomic_frag;
+   if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
return

[PATCH net-next 1/3] net: allow 0 order atomic page alloc in skb_page_frag_refill

2013-12-16 Thread Michael Dalton
skb_page_frag_refill currently permits only order-0 page allocs
unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
higher-order page allocations whether or not GFP_WAIT is used. If
memory cannot be allocated, the allocator will fall back to
successively smaller page allocs (down to order-0 page allocs).

This change brings skb_page_frag_refill in line with the existing
page allocation strategy employed by netdev_alloc_frag, which attempts
higher-order page allocations whether or not GFP_WAIT is set, falling
back to successively lower-order page allocations on failure. Part
of migration of virtio-net to per-receive queue page frag allocators.

Signed-off-by: Michael Dalton mwdal...@google.com
---
 net/core/sock.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index ab20ed9..7383d23 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1865,9 +1865,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
page_frag *pfrag, gfp_t prio)
put_page(pfrag-page);
}
 
-   /* We restrict high order allocations to users that can afford to wait 
*/
-   order = (prio  __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
-
+   order = SKB_FRAG_PAGE_ORDER;
do {
gfp_t gfp = prio;
 
-- 
1.8.5.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

2013-12-16 Thread Michael Dalton
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag
allocators) changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size = MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size =
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c (virtio-net: coalesce rx frags when possible during rx),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads  vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Signed-off-by: Michael Dalton mwdal...@google.com
---
 drivers/net/virtio_net.c | 63 +++-
 1 file changed, 46 insertions(+), 17 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d38d130..904af37 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include linux/if_vlan.h
 #include linux/slab.h
 #include linux/cpu.h
+#include linux/average.h
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -36,11 +37,15 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-L1_CACHE_BYTES))
 #define GOOD_COPY_LEN  128
 
+/* Weight used for the RX packet size EWMA. The average packet size is used to
+ * determine the packet buffer size when refilling RX rings. As the entire RX
+ * ring may be refilled at once, the weight is chosen so that the EWMA will be
+ * insensitive to short-term, transient changes in packet size.
+ */
+#define RECEIVE_AVG_WEIGHT 64
+
 #define VIRTNET_DRIVER_VERSION 1.0.0
 
 struct virtnet_stats {
@@ -78,6 +83,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Average packet length for mergeable receive buffers. */
+   struct ewma mrg_avg_pkt_len;
+
/* Page frag for GFP_ATOMIC packet buffer allocation. */
struct page_frag atomic_frag;
 
@@ -339,13 +347,11 @@ static struct sk_buff *receive_mergeable(struct 
net_device *dev,
int num_buf = hdr-mhdr.num_buffers;
struct page *page = virt_to_head_page(buf);
int offset = buf - page_address(page);
-   int truesize = max_t(int, len, MERGE_BUFFER_LEN);
-   struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
+   struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, len);
struct sk_buff *curr_skb = head_skb;
 
if (unlikely(!curr_skb))
goto err_skb;
-
while (--num_buf) {
int num_skb_frags;
 
@@ -374,23 +380,40 @@ static struct sk_buff *receive_mergeable(struct 
net_device *dev,
head_skb-truesize += nskb-truesize;
num_skb_frags = 0;
}
-   truesize = max_t(int, len, MERGE_BUFFER_LEN);
if (curr_skb != head_skb) {
head_skb-data_len += len;
head_skb-len += len;
-   head_skb-truesize += truesize;
+   head_skb-truesize += len;
}
offset = buf - page_address(page);
if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
put_page(page);
skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-len, truesize);
+len, len);
} else {
skb_add_rx_frag(curr_skb, num_skb_frags, page

Re: [PATCH v2] virtio-net: free bufs correctly on invalid packet length

2013-12-06 Thread Michael Dalton
Hi David,

This patch fixes a bug introduced by 2613af0ed18a (virtio_net:
migrate mergeable rx buffers to page frag allocators). The bug
is present in both net-next and net. Thanks

Best,

Mike

On Fri, Dec 6, 2013 at 1:32 PM, David Miller da...@davemloft.net wrote:
 From: Michael Dalton mwdal...@google.com
 Date: Thu,  5 Dec 2013 13:14:05 -0800

 When a packet with invalid length arrives, ensure that the packet
 is freed correctly if mergeable packet buffers and big packets
 (GUEST_TSO4) are both enabled.

 Signed-off-by: Michael Dalton mwdal...@google.com

 Applied, is this needed for -stable?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio-net: determine type of bufs correctly

2013-12-05 Thread Michael Dalton
Thanks Andrey, great catch. I believe this issue occurs in one more
place, when packets are dropped if they are too short. I will send
out a patch momentarily to fix that additional case.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio-net: free bufs correctly on invalid packet length

2013-12-05 Thread Michael Dalton
When a packet with invalid length arrives, ensure that the packet
is freed correctly if mergeable packet buffers and big packets
(GUEST_TSO4) are both enabled.
---
 drivers/net/virtio_net.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 916241d..6a4665c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -426,10 +426,10 @@ static void receive_buf(struct receive_queue *rq, void 
*buf, unsigned int len)
if (unlikely(len  sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
pr_debug(%s: short packet %i\n, dev-name, len);
dev-stats.rx_length_errors++;
-   if (vi-big_packets)
-   give_pages(rq, buf);
-   else if (vi-mergeable_rx_bufs)
+   if (vi-mergeable_rx_bufs)
put_page(virt_to_head_page(buf));
+   else if (vi-big_packets)
+   give_pages(rq, buf);
else
dev_kfree_skb(buf);
return;
-- 
1.8.5.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-net: free bufs correctly on invalid packet length

2013-12-05 Thread Michael Dalton
Thanks Sergei,

Yes this is a similar bugfix, the patch I saw from Andrey
fixed this issue in free_unused_bufs. The problem also occurs when
dropping a packet that is too short.  Apologies for forgetting to
sign off on the patch, I will re-send.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2] virtio-net: free bufs correctly on invalid packet length

2013-12-05 Thread Michael Dalton
When a packet with invalid length arrives, ensure that the packet
is freed correctly if mergeable packet buffers and big packets
(GUEST_TSO4) are both enabled.

Signed-off-by: Michael Dalton mwdal...@google.com
---
 drivers/net/virtio_net.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 916241d..6a4665c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -426,10 +426,10 @@ static void receive_buf(struct receive_queue *rq, void 
*buf, unsigned int len)
if (unlikely(len  sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
pr_debug(%s: short packet %i\n, dev-name, len);
dev-stats.rx_length_errors++;
-   if (vi-big_packets)
-   give_pages(rq, buf);
-   else if (vi-mergeable_rx_bufs)
+   if (vi-mergeable_rx_bufs)
put_page(virt_to_head_page(buf));
+   else if (vi-big_packets)
+   give_pages(rq, buf);
else
dev_kfree_skb(buf);
return;
-- 
1.8.5.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio-net: free bufs correctly on invalid packet length

2013-12-05 Thread Michael Dalton
Hi,

A quick note on this patch: I have confirmed that without this
patch a kernel crash occurs if we force a 'packet too short' error
sufficiently many times. This patch eliminates the kernel crash.

Since this crash would be triggered by a hypervisor bug, I made a
small change not reflected in the above patch to make the crash easier
to reproduce for testing purposes. I treated 1 out of every 128 packets
with len  MERGE_BUFFER_LEN as 'too short'. With this change in
place, just running netperf will cause the sender to crash very quickly
(the receiver will transmit pure data ACKs that meet the drop criteria).

If anyone would like to reproduce the crash using the above setup,
I added an unsigned int num_packets field to struct receive_queue and
changed the if condition for the packet too short check in receive_buf()
from:
if (unlikely(len  sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
to:
if (unlikely((len  sizeof(struct virtio_net_hdr) + ETH_HLEN) ||
 (len  MERGE_BUFFER_LEN 
  ((++rq-num_packets  127) == 0 {

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net] virtio-net: fix page refcnt leaking when fail to allocate frag skb

2013-11-19 Thread Michael Dalton
Great catch Jason. I agree this now raises the larger issue of how to
handle a memory alloc failure in the middle of receive. As Eric mentioned,
we can drop the packet and free the remaining (num_buf) frags.

Michael, perhaps I'm missing something, but why would you prefer
pre-allocating buffers in this case? If the guest kernel is OOM'ing,
dropping packets should provide backpressure.

Also, we could just as easily fail the initial skb alloc in page_to_skb,
and I think that case also needs to be handled now in the same fashion as
a memory allocation failure in receive_mergeable.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net] virtio-net: fix page refcnt leaking when fail to allocate frag skb

2013-11-19 Thread Michael Dalton
Hi,

After further reflection I think we're looking at two related issues:
(a) a memory leak that Jason has identified that occurs when a memory
allocation fails in receive_mergeable. Jasons commit solves this issue.
(b) virtio-net does not dequeue all buffers for a packet in the
case that an error occurs on receive and mergeable receive buffers is
enabled.

For (a), this bug is new and due to changes in 2613af0ed18a, and the
net impact is memory leak on the physical page. However, I believe (b)
has always been possible in some form because if page_to_skb() returns
NULL (e.g., due to SKB allocation failure), receive_mergeable is never
called. AFAICT this is also the behavior prior to 2613af0ed18a.

The net impact of (b) would be that virtio-net would interpret a packet
buffer that is in the middle of a mergeable packet as the start of a
new packet, which is definitely also a bug (and the buffer contents
could contain bytes that resembled a valid virtio-net header).

A solution for (b) will require handling both the page_to_skb memory
allocation failures and the memory allocation failures in
receive_mergeable introduced by 2613af0ed18a.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

2013-11-16 Thread Michael Dalton
Hi,

Apologies for the delay, I wanted to get answers together for all of the
open questions raised on this thread. The first patch in this patchset is
already merged, so after the merge window re-opens I'll send out new
patchsets covering the remaining 3 patches.

After reflecting on feedback from this thread, I think it makes sense to
separate out the per-receive queue page frag allocator patches from the
autotuning patch when the merge window re-opens. The per-receive queue
page frag allocator patches help deal with fragmentation (PAGE_SIZE does
not evenly divide MERGE_BUFFER_LEN), and provide benefits whether or not
auto-tuning is present. Auto-tuning can then be evaluated separately.

On Wed, 2013-11-13 at 15:10 +0800, Jason Wang wrote:
 There's one concern with EWMA. How well does it handle multiple streams
 each with different packet size? E.g there may be two flows, one with
 256 bytes each packet another is 64K.  Looks like it can result we
 allocate PAGE_SIZE buffer for 256 (which is bad since the
 payload/truesize is low) bytes or 1500+ for 64K buffer (which is ok
 since we can do coalescing).

If multiple streams of very different packet sizes are arriving on the
same receive queue, no single buffer size is ideal(e.g., large buffers
will cause small packets to take up too much memory, but small buffers
may reduce throughput somewhat for large packets). We don't know a
priori which packet will be delivered to a given receive queue packet
buffer, so any size we choose will not be optimal for all cases if we
have significant variance in packet sizes.

 Do you have perf numbers that just without this patch? We need to know
 how much EWMA help exactly.

Great point, I should have included that in my initial benchmarking. I ran
a benchmark in the same environment as my initial results, this time with
the first 3 patches in this patchset applied but without the autotuning
patch.  The average performance over 5 runs of 30-second netperf was
13760.85Gb/s.

 Is there a chance that est_buffer_len was smaller than or equal with len?

Yes, that is possible if the average packet length decreases.

 Not sure this is accurate, since buflen may change and several frags may
 share a single page. So the est_buffer_len we get in receive_mergeable()
 may not be the correct value.

I agree it may not be 100% accurate but we can choose a weight that will
cause the average packet size to change slowly. Even with an order 3 page
there will not be too many packet buffers allocated from a single page.

On Wed, 2013-11-13 at 17:42 +0800, Michael S. Tsirkin wrote:
 I'm not sure it's useful - no one is likely to tune it in practice.
 But how about a comment explaining how was the number chosen?

That makes sense, I agree a comment is needed. The weight determines
how quickly we react to a change in packet size. As we attempt to fill
all free ring entries on refill (in try_fill_recv), I chose a large
weight so that a short burst of traffic with a different average packet
size will not substantially shift the packet buffer size for the entire
ring the next time try_fill_recv is called. I'll add a comment that
compares 64 to nearby values (32, 16).

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio-net: mergeable buffer size should include virtio-net header

2013-11-14 Thread Michael Dalton
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page
frag allocators) changed the mergeable receive buffer size from PAGE_SIZE
to MTU-size. However, the merge buffer size does not take into account the
size of the virtio-net header. Consequently, packets that are MTU-size
will take two buffers intead of one (to store the virtio-net header),
substantially decreasing the throughput of MTU-size traffic due to TCP
window / SKB truesize effects.

This commit changes the mergeable buffer size to include the virtio-net
header. The buffer size is cacheline-aligned because skb_page_frag_refill
will not automatically align the requested size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs and
vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup
cpuset, using cgroups to ensure that other processes in the system will not
be scheduled on the benchmark CPUs. Transmit offloads and mergeable receive
buffers are enabled, but guest_tso4 / guest_csum are explicitly disabled to
force MTU-sized packets on the receiver.

next-net trunk before 2613af0ed18a (PAGE_SIZE buf): 3861.08Gb/s
net-next trunk (MTU 1500- packet uses two buf due to size bug): 4076.62Gb/s
net-next trunk (MTU 1480- packet fits in one buf): 6301.34Gb/s
net-next trunk w/ size fix (MTU 1500 - packet fits in one buf): 6445.44Gb/s

Suggested-by: Eric Northup digitale...@google.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
 drivers/net/virtio_net.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 01f4eb5..69fb225 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -36,7 +36,10 @@ module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
-#define MAX_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
+#define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
+#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
+sizeof(struct virtio_net_hdr_mrg_rxbuf), \
+L1_CACHE_BYTES))
 #define GOOD_COPY_LEN  128
 
 #define VIRTNET_DRIVER_VERSION 1.0.0
@@ -314,10 +317,10 @@ static int receive_mergeable(struct receive_queue *rq, 
struct sk_buff *head_skb)
head_skb-dev-stats.rx_length_errors++;
return -EINVAL;
}
-   if (unlikely(len  MAX_PACKET_LEN)) {
+   if (unlikely(len  MERGE_BUFFER_LEN)) {
pr_debug(%s: rx error: merge buffer too long\n,
 head_skb-dev-name);
-   len = MAX_PACKET_LEN;
+   len = MERGE_BUFFER_LEN;
}
if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
@@ -336,18 +339,17 @@ static int receive_mergeable(struct receive_queue *rq, 
struct sk_buff *head_skb)
if (curr_skb != head_skb) {
head_skb-data_len += len;
head_skb-len += len;
-   head_skb-truesize += MAX_PACKET_LEN;
+   head_skb-truesize += MERGE_BUFFER_LEN;
}
page = virt_to_head_page(buf);
offset = buf - (char *)page_address(page);
if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
put_page(page);
skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-len, MAX_PACKET_LEN);
+len, MERGE_BUFFER_LEN);
} else {
skb_add_rx_frag(curr_skb, num_skb_frags, page,
-   offset, len,
-   MAX_PACKET_LEN);
+   offset, len, MERGE_BUFFER_LEN);
}
--rq-num;
}
@@ -383,7 +385,7 @@ static void receive_buf(struct receive_queue *rq, void 
*buf, unsigned int len)
struct page *page = virt_to_head_page(buf);
skb = page_to_skb(rq, page,
  (char *)buf - (char *)page_address(page),
- len, MAX_PACKET_LEN);
+ len, MERGE_BUFFER_LEN);
if (unlikely(!skb)) {
dev-stats.rx_dropped++;
put_page(page);
@@ -471,11 +473,11 @@ static int add_recvbuf_small(struct receive_queue *rq, 
gfp_t gfp)
struct skb_vnet_hdr *hdr;
int err;
 
-   skb = __netdev_alloc_skb_ip_align(vi-dev, MAX_PACKET_LEN, gfp);
+   skb = __netdev_alloc_skb_ip_align(vi-dev, GOOD_PACKET_LEN, gfp);
if (unlikely(!skb

[PATCH net-next 1/4] virtio-net: mergeable buffer size should include virtio-net header

2013-11-12 Thread Michael Dalton
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page
frag allocators) changed the mergeable receive buffer size from PAGE_SIZE
to MTU-size. However, the merge buffer size does not take into account the
size of the virtio-net header. Consequently, packets that are MTU-size
will take two buffers intead of one (to store the virtio-net header),
substantially decreasing the throughput of MTU-size traffic due to TCP
window / SKB truesize effects.

This commit changes the mergeable buffer size to include the virtio-net
header. The buffer size is cacheline-aligned because skb_page_frag_refill
will not automatically align the requested size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs and
vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup
cpuset, using cgroups to ensure that other processes in the system will not
be scheduled on the benchmark CPUs. Transmit offloads and mergeable receive
buffers are enabled, but guest_tso4 / guest_csum are explicitly disabled to
force MTU-sized packets on the receiver.

next-net trunk before 2613af0ed18a (PAGE_SIZE buf): 3861.08Gb/s
net-next trunk (MTU 1500- packet uses two buf due to size bug): 4076.62Gb/s
net-next trunk (MTU 1480- packet fits in one buf): 6301.34Gb/s
net-next trunk w/ size fix (MTU 1500 - packet fits in one buf): 6445.44Gb/s

Suggested-by: Eric Northup digitale...@google.com
Signed-off-by: Michael Dalton mwdal...@google.com
---
 drivers/net/virtio_net.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 01f4eb5..69fb225 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -36,7 +36,10 @@ module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
-#define MAX_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
+#define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
+#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
+sizeof(struct virtio_net_hdr_mrg_rxbuf), \
+L1_CACHE_BYTES))
 #define GOOD_COPY_LEN  128
 
 #define VIRTNET_DRIVER_VERSION 1.0.0
@@ -314,10 +317,10 @@ static int receive_mergeable(struct receive_queue *rq, 
struct sk_buff *head_skb)
head_skb-dev-stats.rx_length_errors++;
return -EINVAL;
}
-   if (unlikely(len  MAX_PACKET_LEN)) {
+   if (unlikely(len  MERGE_BUFFER_LEN)) {
pr_debug(%s: rx error: merge buffer too long\n,
 head_skb-dev-name);
-   len = MAX_PACKET_LEN;
+   len = MERGE_BUFFER_LEN;
}
if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
@@ -336,18 +339,17 @@ static int receive_mergeable(struct receive_queue *rq, 
struct sk_buff *head_skb)
if (curr_skb != head_skb) {
head_skb-data_len += len;
head_skb-len += len;
-   head_skb-truesize += MAX_PACKET_LEN;
+   head_skb-truesize += MERGE_BUFFER_LEN;
}
page = virt_to_head_page(buf);
offset = buf - (char *)page_address(page);
if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
put_page(page);
skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-len, MAX_PACKET_LEN);
+len, MERGE_BUFFER_LEN);
} else {
skb_add_rx_frag(curr_skb, num_skb_frags, page,
-   offset, len,
-   MAX_PACKET_LEN);
+   offset, len, MERGE_BUFFER_LEN);
}
--rq-num;
}
@@ -383,7 +385,7 @@ static void receive_buf(struct receive_queue *rq, void 
*buf, unsigned int len)
struct page *page = virt_to_head_page(buf);
skb = page_to_skb(rq, page,
  (char *)buf - (char *)page_address(page),
- len, MAX_PACKET_LEN);
+ len, MERGE_BUFFER_LEN);
if (unlikely(!skb)) {
dev-stats.rx_dropped++;
put_page(page);
@@ -471,11 +473,11 @@ static int add_recvbuf_small(struct receive_queue *rq, 
gfp_t gfp)
struct skb_vnet_hdr *hdr;
int err;
 
-   skb = __netdev_alloc_skb_ip_align(vi-dev, MAX_PACKET_LEN, gfp);
+   skb = __netdev_alloc_skb_ip_align(vi-dev, GOOD_PACKET_LEN, gfp);
if (unlikely(!skb

[PATCH net-next 3/4] virtio-net: use per-receive queue page frag alloc for mergeable bufs

2013-11-12 Thread Michael Dalton
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
mergeable rx buffer allocations. This commit migrates virtio-net to use
per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
mergeable rx buffer memory allocation, which now will use skb_refill_frag()
for both atomic and GFP-WAIT buffer allocations.

To address fragmentation concerns, if after buffer allocation there
is too little space left in the page frag to allocate a subsequent
buffer, the remaining space is added to the current allocated buffer
so that the remaining space can be used to store packet data.

Signed-off-by: Michael Dalton mwdal...@google.com
---
 drivers/net/virtio_net.c | 70 +++-
 1 file changed, 39 insertions(+), 31 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 69fb225..0c93054 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -79,6 +79,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Page frag for GFP_ATOMIC packet buffer allocation. */
+   struct page_frag atomic_frag;
+
/* RX: fragments + linear part + virtio header */
struct scatterlist sg[MAX_SKB_FRAGS + 2];
 
@@ -128,9 +131,9 @@ struct virtnet_info {
struct mutex config_lock;
 
/* Page_frag for GFP_KERNEL packet buffer allocation when we run
-* low on memory.
+* low on memory. May sleep.
 */
-   struct page_frag alloc_frag;
+   struct page_frag sleep_frag;
 
/* Does the affinity hint is set for virtqueues? */
bool affinity_hint_set;
@@ -305,7 +308,7 @@ static int receive_mergeable(struct receive_queue *rq, 
struct sk_buff *head_skb)
struct sk_buff *curr_skb = head_skb;
char *buf;
struct page *page;
-   int num_buf, len, offset;
+   int num_buf, len, offset, truesize;
 
num_buf = hdr-mhdr.num_buffers;
while (--num_buf) {
@@ -317,11 +320,7 @@ static int receive_mergeable(struct receive_queue *rq, 
struct sk_buff *head_skb)
head_skb-dev-stats.rx_length_errors++;
return -EINVAL;
}
-   if (unlikely(len  MERGE_BUFFER_LEN)) {
-   pr_debug(%s: rx error: merge buffer too long\n,
-head_skb-dev-name);
-   len = MERGE_BUFFER_LEN;
-   }
+   truesize = max_t(int, len, MERGE_BUFFER_LEN);
if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
if (unlikely(!nskb)) {
@@ -339,17 +338,17 @@ static int receive_mergeable(struct receive_queue *rq, 
struct sk_buff *head_skb)
if (curr_skb != head_skb) {
head_skb-data_len += len;
head_skb-len += len;
-   head_skb-truesize += MERGE_BUFFER_LEN;
+   head_skb-truesize += truesize;
}
page = virt_to_head_page(buf);
offset = buf - (char *)page_address(page);
if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
put_page(page);
skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-len, MERGE_BUFFER_LEN);
+len, truesize);
} else {
skb_add_rx_frag(curr_skb, num_skb_frags, page,
-   offset, len, MERGE_BUFFER_LEN);
+   offset, len, truesize);
}
--rq-num;
}
@@ -383,9 +382,10 @@ static void receive_buf(struct receive_queue *rq, void 
*buf, unsigned int len)
skb_trim(skb, len);
} else if (vi-mergeable_rx_bufs) {
struct page *page = virt_to_head_page(buf);
+   int truesize = max_t(int, len, MERGE_BUFFER_LEN);
skb = page_to_skb(rq, page,
  (char *)buf - (char *)page_address(page),
- len, MERGE_BUFFER_LEN);
+ len, truesize);
if (unlikely(!skb)) {
dev-stats.rx_dropped++;
put_page(page);
@@ -540,24 +540,24 @@ static int add_recvbuf_big(struct receive_queue *rq, 
gfp_t gfp)
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
struct virtnet_info *vi = rq-vq-vdev-priv;
-   char *buf = NULL;
-   int err;
+   struct page_frag *alloc_frag;
+   char *buf;
+   int err, len, hole;
 
-   if (gfp  __GFP_WAIT) {
-   if (skb_page_frag_refill(MERGE_BUFFER_LEN, vi-alloc_frag,
-gfp

[PATCH net-next 2/4] net: allow 0 order atomic page alloc in skb_page_frag_refill

2013-11-12 Thread Michael Dalton
skb_page_frag_refill currently permits only order-0 page allocs
unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
higher-order page allocations whether or not GFP_WAIT is used. If
memory cannot be allocated, the allocator will fall back to
successively smaller page allocs (down to order-0 page allocs).

This change brings skb_page_frag_refill in line with the existing
page allocation strategy employed by netdev_alloc_frag, which attempts
higher-order page allocations whether or not GFP_WAIT is set, falling
back to successively lower-order page allocations on failure. Part
of migration of virtio-net to per-receive queue page frag allocators.

Signed-off-by: Michael Dalton mwdal...@google.com
---
 net/core/sock.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index ab20ed9..7383d23 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1865,9 +1865,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
page_frag *pfrag, gfp_t prio)
put_page(pfrag-page);
}
 
-   /* We restrict high order allocations to users that can afford to wait 
*/
-   order = (prio  __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
-
+   order = SKB_FRAG_PAGE_ORDER;
do {
gfp_t gfp = prio;
 
-- 
1.8.4.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next 4/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

2013-11-12 Thread Michael Dalton
Commit 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag
allocators) changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all workloads.
For workloads with packet size = MTU bytes, MTU + virtio-net header-sized
buffers are preferred as larger buffers reduce the TCP window due to SKB
truesize. However, single-stream workloads with large average packet sizes
have higher throughput if larger (e.g., PAGE_SIZE) buffers are used.

This commit auto-tunes the mergeable receiver buffer packet size by choosing
the packet buffer size based on an EWMA of the recent packet sizes for the
receive queue. Packet buffer sizes range from MTU_SIZE + virtio-net header
len to PAGE_SIZE. This improves throughput for large packet workloads, as
any workload with average packet size = PAGE_SIZE will use PAGE_SIZE
buffers.

These optimizations interact positively with recent commit
ba275241030c (virtio-net: coalesce rx frags when possible during rx),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads  vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next trunk w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next trunk (MTU-size bufs):  13170.01Gb/s
net-next trunk + auto-tune: 14555.94Gb/s

Signed-off-by: Michael Dalton mwdal...@google.com
---
 drivers/net/virtio_net.c | 73 +++-
 1 file changed, 53 insertions(+), 20 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0c93054..b1086e0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -27,6 +27,7 @@
 #include linux/if_vlan.h
 #include linux/slab.h
 #include linux/cpu.h
+#include linux/average.h
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -37,10 +38,8 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-L1_CACHE_BYTES))
 #define GOOD_COPY_LEN  128
+#define RECEIVE_AVG_WEIGHT 64
 
 #define VIRTNET_DRIVER_VERSION 1.0.0
 
@@ -79,6 +78,9 @@ struct receive_queue {
/* Chain pages by the private ptr. */
struct page *pages;
 
+   /* Average packet length for mergeable receive buffers. */
+   struct ewma mrg_avg_pkt_len;
+
/* Page frag for GFP_ATOMIC packet buffer allocation. */
struct page_frag atomic_frag;
 
@@ -302,14 +304,17 @@ static struct sk_buff *page_to_skb(struct receive_queue 
*rq,
return skb;
 }
 
-static int receive_mergeable(struct receive_queue *rq, struct sk_buff 
*head_skb)
+static int receive_mergeable(struct receive_queue *rq, struct sk_buff 
*head_skb,
+struct page *head_page)
 {
struct skb_vnet_hdr *hdr = skb_vnet_hdr(head_skb);
struct sk_buff *curr_skb = head_skb;
+   struct page *page = head_page;
char *buf;
-   struct page *page;
-   int num_buf, len, offset, truesize;
+   int num_buf, len, offset;
+   u32 est_buffer_len;
 
+   len = head_skb-len;
num_buf = hdr-mhdr.num_buffers;
while (--num_buf) {
int num_skb_frags = skb_shinfo(curr_skb)-nr_frags;
@@ -320,7 +325,6 @@ static int receive_mergeable(struct receive_queue *rq, 
struct sk_buff *head_skb)
head_skb-dev-stats.rx_length_errors++;
return -EINVAL;
}
-   truesize = max_t(int, len, MERGE_BUFFER_LEN);
if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
if (unlikely(!nskb)) {
@@ -338,20 +342,38 @@ static int receive_mergeable(struct receive_queue *rq, 
struct sk_buff *head_skb)
if (curr_skb != head_skb) {
head_skb-data_len += len;
head_skb-len += len;
-   head_skb-truesize += truesize;
+   head_skb-truesize += len;
}
page = virt_to_head_page(buf);
offset = buf - (char *)page_address(page);
if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
put_page(page);
skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1

Re: [PATCH net-next] virtio_net: migrate mergeable rx buffers to page frag allocators

2013-10-29 Thread Michael Dalton
Agreed Eric, the buffer size should be increased so that we can accommodate a
MTU-sized packet + mergeable virtio net header in a single buffer. I will send
a patch to fix shortly cleaning up the #define headers as Rusty indicated and
increasing the buffer size slightly by VirtioNet header size bytes per Eric.

Jason, I'll followup with you directly - I'd like to know your exact workload
(single steam or multi-stream netperf?), VM configuration, etc, and also see if
the nit that Erichas pointed out affects your results.  It is also
worth noting that
we may want to tune the queue sizes for your benchmarks, e.g, by reducing
buffer size from 4KB to MTU-sized but keeping queue length constant, we're
implicitly decreasing the number of bytes stored in the VirtioQueue for the
VirtioNet device, so increasing the queue size may help.

Best,

Mike
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next] virtio_net: migrate mergeable rx buffers to page frag allocators

2013-10-28 Thread Michael Dalton
The virtio_net driver's mergeable receive buffer allocator
uses 4KB packet buffers. For MTU-sized traffic, SKB truesize
is  4KB but only ~1500 bytes of the buffer is used to store
packet data, reducing the effective TCP window size
substantially. This patch addresses the performance concerns
with mergeable receive buffers by allocating MTU-sized packet
buffers using page frag allocators. If more than MAX_SKB_FRAGS
buffers are needed, the SKB frag_list is used.

Signed-off-by: Michael Dalton mwdal...@google.com
---
 drivers/net/virtio_net.c | 164 ++-
 1 file changed, 106 insertions(+), 58 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9fbdfcd..113ee93 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -124,6 +124,11 @@ struct virtnet_info {
/* Lock for config space updates */
struct mutex config_lock;
 
+   /* Page_frag for GFP_KERNEL packet buffer allocation when we run
+* low on memory.
+*/
+   struct page_frag alloc_frag;
+
/* Does the affinity hint is set for virtqueues? */
bool affinity_hint_set;
 
@@ -217,33 +222,18 @@ static void skb_xmit_done(struct virtqueue *vq)
netif_wake_subqueue(vi-dev, vq2txq(vq));
 }
 
-static void set_skb_frag(struct sk_buff *skb, struct page *page,
-unsigned int offset, unsigned int *len)
-{
-   int size = min((unsigned)PAGE_SIZE - offset, *len);
-   int i = skb_shinfo(skb)-nr_frags;
-
-   __skb_fill_page_desc(skb, i, page, offset, size);
-
-   skb-data_len += size;
-   skb-len += size;
-   skb-truesize += PAGE_SIZE;
-   skb_shinfo(skb)-nr_frags++;
-   skb_shinfo(skb)-tx_flags |= SKBTX_SHARED_FRAG;
-   *len -= size;
-}
-
 /* Called from bottom half context */
 static struct sk_buff *page_to_skb(struct receive_queue *rq,
-  struct page *page, unsigned int len)
+  struct page *page, unsigned int offset,
+  unsigned int len, unsigned int truesize)
 {
struct virtnet_info *vi = rq-vq-vdev-priv;
struct sk_buff *skb;
struct skb_vnet_hdr *hdr;
-   unsigned int copy, hdr_len, offset;
+   unsigned int copy, hdr_len, hdr_padded_len;
char *p;
 
-   p = page_address(page);
+   p = page_address(page) + offset;
 
/* copy small packet so we can reuse these pages for small data */
skb = netdev_alloc_skb_ip_align(vi-dev, GOOD_COPY_LEN);
@@ -254,16 +244,17 @@ static struct sk_buff *page_to_skb(struct receive_queue 
*rq,
 
if (vi-mergeable_rx_bufs) {
hdr_len = sizeof hdr-mhdr;
-   offset = hdr_len;
+   hdr_padded_len = sizeof hdr-mhdr;
} else {
hdr_len = sizeof hdr-hdr;
-   offset = sizeof(struct padded_vnet_hdr);
+   hdr_padded_len = sizeof(struct padded_vnet_hdr);
}
 
memcpy(hdr, p, hdr_len);
 
len -= hdr_len;
-   p += offset;
+   offset += hdr_padded_len;
+   p += hdr_padded_len;
 
copy = len;
if (copy  skb_tailroom(skb))
@@ -273,6 +264,14 @@ static struct sk_buff *page_to_skb(struct receive_queue 
*rq,
len -= copy;
offset += copy;
 
+   if (vi-mergeable_rx_bufs) {
+   if (len)
+   skb_add_rx_frag(skb, 0, page, offset, len, truesize);
+   else
+   put_page(page);
+   return skb;
+   }
+
/*
 * Verify that we can indeed put this data into a skb.
 * This is here to handle cases when the device erroneously
@@ -284,9 +283,12 @@ static struct sk_buff *page_to_skb(struct receive_queue 
*rq,
dev_kfree_skb(skb);
return NULL;
}
-
+   BUG_ON(offset = PAGE_SIZE);
while (len) {
-   set_skb_frag(skb, page, offset, len);
+   unsigned int frag_size = min((unsigned)PAGE_SIZE - offset, len);
+   skb_add_rx_frag(skb, skb_shinfo(skb)-nr_frags, page, offset,
+   frag_size, truesize);
+   len -= frag_size;
page = (struct page *)page-private;
offset = 0;
}
@@ -297,33 +299,52 @@ static struct sk_buff *page_to_skb(struct receive_queue 
*rq,
return skb;
 }
 
-static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
+static int receive_mergeable(struct receive_queue *rq, struct sk_buff 
*head_skb)
 {
-   struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
+   struct skb_vnet_hdr *hdr = skb_vnet_hdr(head_skb);
+   struct sk_buff *curr_skb = head_skb;
+   char *buf;
struct page *page;
-   int num_buf, i, len;
+   int num_buf, len;
 
num_buf = hdr-mhdr.num_buffers;
while (--num_buf) {
-   i = skb_shinfo(skb)-nr_frags