[ovs-dev] [PATCH v3 3/3] netdev-dpdk: Enable TSO when using multi-seg mbufs

2019-01-12 Thread Tiago Lam
TCP Segmentation Offload (TSO) is a feature which enables the TCP/IP
network stack to delegate segmentation of a TCP segment to the hardware
NIC, thus saving compute resources. This may improve performance
significantly for TCP workload in virtualized environments.

While a previous commit already added the necesary logic to netdev-dpdk
to deal with packets marked for TSO, this set of changes enables TSO by
default when using multi-segment mbufs.

Thus, to enable TSO on the physical DPDK interfaces, only the following
command needs to be issued before starting OvS:
ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Mark Kavanagh 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
---
 Documentation/automake.mk   |  1 +
 Documentation/topics/dpdk/index.rst |  1 +
 Documentation/topics/dpdk/tso.rst   | 99 +
 NEWS|  1 +
 lib/netdev-dpdk.c   | 70 --
 5 files changed, 167 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/topics/dpdk/tso.rst

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 082438e..a20deb8 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -39,6 +39,7 @@ DOC_SOURCE = \
Documentation/topics/dpdk/index.rst \
Documentation/topics/dpdk/bridge.rst \
Documentation/topics/dpdk/jumbo-frames.rst \
+   Documentation/topics/dpdk/tso.rst \
Documentation/topics/dpdk/memory.rst \
Documentation/topics/dpdk/pdump.rst \
Documentation/topics/dpdk/phy.rst \
diff --git a/Documentation/topics/dpdk/index.rst 
b/Documentation/topics/dpdk/index.rst
index cf24a7b..eb2a04d 100644
--- a/Documentation/topics/dpdk/index.rst
+++ b/Documentation/topics/dpdk/index.rst
@@ -40,4 +40,5 @@ The DPDK Datapath
/topics/dpdk/qos
/topics/dpdk/pdump
/topics/dpdk/jumbo-frames
+   /topics/dpdk/tso
/topics/dpdk/memory
diff --git a/Documentation/topics/dpdk/tso.rst 
b/Documentation/topics/dpdk/tso.rst
new file mode 100644
index 000..14f8c39
--- /dev/null
+++ b/Documentation/topics/dpdk/tso.rst
@@ -0,0 +1,99 @@
+..
+  Copyright 2018, Red Hat, Inc.
+
+  Licensed under the Apache License, Version 2.0 (the "License"); you may
+  not use this file except in compliance with the License. You may obtain
+  a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+  WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+  License for the specific language governing permissions and limitations
+  under the License.
+
+  Convention for heading levels in Open vSwitch documentation:
+
+  ===  Heading 0 (reserved for the title in a document)
+  ---  Heading 1
+  ~~~  Heading 2
+  +++  Heading 3
+  '''  Heading 4
+
+  Avoid deeper levels because they do not render well.
+
+===
+TSO
+===
+
+**Note:** This feature is considered experimental.
+
+TCP Segmentation Offload (TSO) is a mechanism which allows a TCP/IP stack to
+offload the TCP segmentation into hardware, thus saving the cycles that would
+be required to perform this same segmentation in software.
+
+TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
+of an oversized TCP segment to the underlying physical NIC. Offload of frame
+segmentation achieves computational savings in the core, freeing up CPU cycles
+for more useful work.
+
+A common use case for TSO is when using virtualization, where traffic that's
+coming in from a VM can offload the TCP segmentation, thus avoiding the
+fragmentation in software. Additionally, if the traffic is headed to a VM
+within the same host further optimization can be expected. As the traffic never
+leaves the machine, no MTU needs to be accounted for, and thus no segmentation
+and checksum calculations are required, which saves yet more cycles. Only when
+the traffic actually leaves the host the segmentation needs to happen, in which
+case it will be performed by the egress NIC.
+
+When using TSO with DPDK, the implementation relies on the multi-segment mbufs
+feature, described in :doc:`/topics/dpdk/jumbo-frames`, where each mbuf
+contains ~2KiB of the entire packet's data and is linked to the next mbuf that
+contains the next portion of data.
+
+Enabling TSO
+
+.. Important::
+
+Once multi-segment mbufs is enabled, TSO will be enabled by default, if
+there's support for it in the underlying physical NICs attached to
+OvS-DPDK.
+
+When using :doc:`vHost User ports `, TSO may be enabled in one of
+two ways, as follows.
+
+`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
+connection is established, `TSO

[ovs-dev] [PATCH v3 2/3] netdev-dpdk: Consider packets marked for TSO.

2019-01-12 Thread Tiago Lam
Previously, TSO was being explicity disabled on vhost interfaces,
meaning the guests wouldn't have TSO support negotiated in. With TSO
negotiated and enabled, packets are now marked for TSO, through the
PKT_TX_TCP_SEG flag.

In order to deal with this type of packets, a new function,
netdev_dpdk_prep_tso_packet(), has been introduced, with the main
purpose of setting correctly the l2, l3 and l4 length members of the
mbuf struct, and the appropriate ol_flags. This function supports TSO
both in IPv4 and IPv6.

netdev_dpdk_prep_tso_packet() is then only called when packets are
marked with the PKT_TX_TCP_SEG flag, meaning they have been marked for
TSO, and when the packet will be traversing the NIC.

Additionally, if a packet is marked for TSO but the egress netdev
doesn't support it, the packet is dropped.

Co-authored-by: Mark Kavanagh 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
---
 lib/dp-packet.h|  16 +++
 lib/netdev-bsd.c   |  11 -
 lib/netdev-dpdk.c  | 122 ++---
 lib/netdev-dummy.c |  11 -
 lib/netdev-linux.c |  15 +++
 5 files changed, 149 insertions(+), 26 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 1182b9c..8307c9d 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -104,6 +104,8 @@ static inline void dp_packet_set_size(struct dp_packet *, 
uint32_t);
 static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
 static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
 
+static inline bool dp_packet_is_tso(struct dp_packet *b);
+
 void *dp_packet_resize_l2(struct dp_packet *, int increment);
 void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
 static inline void *dp_packet_eth(const struct dp_packet *);
@@ -758,6 +760,14 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
 b->mbuf.buf_len = s;
 }
 
+static inline bool
+dp_packet_is_tso(struct dp_packet *b)
+{
+return (b->mbuf.ol_flags & (PKT_TX_TCP_SEG | PKT_TX_L4_MASK))
+   ? true
+   : false;
+}
+
 static inline void
 dp_packet_copy_mbuf_flags(struct dp_packet *dst, const struct dp_packet *src)
 {
@@ -977,6 +987,12 @@ dp_packet_get_allocated(const struct dp_packet *b)
 return b->allocated_;
 }
 
+static inline bool
+dp_packet_is_tso(struct dp_packet *b OVS_UNUSED)
+{
+return false;
+}
+
 static inline void
 dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
 {
diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c
index cc4a87e..21cc52e 100644
--- a/lib/netdev-bsd.c
+++ b/lib/netdev-bsd.c
@@ -700,11 +700,20 @@ netdev_bsd_send(struct netdev *netdev_, int qid 
OVS_UNUSED,
 }
 
 DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+size_t size = dp_packet_size(packet);
+
+/* TSO not supported in BSD netdev */
+if (dp_packet_is_tso(packet)) {
+VLOG_WARN_RL(, "%s: No TSO support on port, TSO packet of size "
+ "%" PRIuSIZE " dropped", name, size);
+
+continue;
+}
+
 /* We need the whole data to send the packet on the device */
 dp_packet_linearize(packet);
 
 const void *data = dp_packet_data(packet);
-size_t size = dp_packet_size(packet);
 
 while (!error) {
 ssize_t retval;
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 942a1d2..b30d791 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -1378,14 +1378,16 @@ netdev_dpdk_vhost_construct(struct netdev *netdev)
 goto out;
 }
 
-err = rte_vhost_driver_disable_features(dev->vhost_id,
-1ULL << VIRTIO_NET_F_HOST_TSO4
-| 1ULL << VIRTIO_NET_F_HOST_TSO6
-| 1ULL << VIRTIO_NET_F_CSUM);
-if (err) {
-VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
- "port: %s\n", name);
-goto out;
+if (!dpdk_multi_segment_mbufs) {
+err = rte_vhost_driver_disable_features(dev->vhost_id,
+1ULL << VIRTIO_NET_F_HOST_TSO4
+| 1ULL << VIRTIO_NET_F_HOST_TSO6
+| 1ULL << VIRTIO_NET_F_CSUM);
+if (err) {
+VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
+ "client port: %s\n", dev->up.name);
+goto out;
+}
 }
 
 err = rte_vhost_driver_start(dev->vhost_id);
@@ -2030,6 +2032,44 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
 rte_free(rx);
 }
 
+/* Should only be called if PKT_TX_TCP_SEG is set in ol_flags.
+ * Furthermore, it also sets the PKT_TX_TCP_CKSUM and PKT_TX_IP_CKSUM flags,
+ * and PKT_TX_IPV4 and PKT_TX_IPV6 in case the packet is IPv4 or IPv6,
+ * respectively. */
+static void
+netdev_dpdk_prep_t

[ovs-dev] [PATCH v3 1/3] netdev-dpdk: Validate packets burst before Tx.

2019-01-12 Thread Tiago Lam
Given that multi-segment mbufs might be sent between interfaces that
support different capabilities, and may even support different layouts
of mbufs, outgoing packets should be validated before sent on the egress
interface. Thus, netdev_dpdk_eth_tx_burst() now calls DPDK's
rte_eth_tx_prepare() function, if and only multi-segments is enbaled, in
order to validate the following (taken from the DPDK documentation), on
a device specific manner:
- Check if packet meets devices requirements for tx offloads.
- Check limitations about number of segments.
- Check additional requirements when debug is enabled.
- Update and/or reset required checksums when tx offload is set for
packet.

Signed-off-by: Tiago Lam 
---
 lib/netdev-dpdk.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 4e96b37..942a1d2 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -2032,6 +2032,10 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
 
 /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes ownership of
  * 'pkts', even in case of failure.
+ * In case multi-segment mbufs / TSO is being used, it also prepares. In such
+ * cases, only the prepared packets will be sent to Tx burst, meaning that if
+ * an invalid packet appears in 'pkts'[3] only the validated packets in indices
+ * 0, 1 and 2 will be sent.
  *
  * Returns the number of packets that weren't transmitted. */
 static inline int
@@ -2039,11 +2043,24 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int 
qid,
  struct rte_mbuf **pkts, int cnt)
 {
 uint32_t nb_tx = 0;
+uint16_t nb_prep = cnt;
+
+/* If multi-segments is enabled, validate the burst of packets for Tx. */
+if (OVS_UNLIKELY(dpdk_multi_segment_mbufs)) {
+nb_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
+if (nb_prep != cnt) {
+VLOG_WARN_RL(, "%s: Preparing packet tx burst failed (%u/%u "
+ "packets valid): %s", dev->up.name, nb_prep, cnt,
+ rte_strerror(rte_errno));
+}
+}
 
-while (nb_tx != cnt) {
+/* Tx the validated burst of packets only. */
+while (nb_tx != nb_prep) {
 uint32_t ret;
 
-ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
+ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
+   nb_prep - nb_tx);
 if (!ret) {
 break;
 }
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 0/3] dpdk: Add support for TSO

2019-01-12 Thread Tiago Lam
Enabling TSO offload allows a host stack to delegate the segmentation of
oversized TCP packets to the underlying physical NIC, if supported. In the case
of a VM this means that the segmentation of the packets is not performed by the
guest kernel, but by the host NIC itself. In turn, since the TSO calculations
and checksums are being performed in hardware, this alleviates the CPU load on
the host system. In inter VM communication this might account to significant
savings, and higher throughput, even more so if the VMs are running on the same
host.

Thus, although inter VM communication is already possible as is, there's a
sacrifice in terms of CPU, which may affect the overall throughput.

This series adds support for TSO in OvS-DPDK, by making use of the TSO
offloading feature already supported by DPDK vhost backend, having the
following scenarios in mind:
- Inter VM communication on the same host;
- Inter VM communication on different hosts;
- The same two use cases above, but on a VLAN network.

The work is based on [1]; It has been rebased to run on top of the
multi-segment mbufs work (v14) [2] and re-worked to use DPDK v18.11.

[1] https://patchwork.ozlabs.org/patch/749564/
[2] https://mail.openvswitch.org/pipermail/ovs-dev/2019-January/355103.html 

v3: - Rebase on multi-segments v14;
- Fix dp_packet_is_tso() by checking for the PKT_TX_L4_MASK
  offload flag as well;
- In netdev_dpdk_eth_tx_burst(), revert the logic to
  minimize the impact for the default case (multi-segment
  mbufs disabled) (1/3, Ian Stokes);
- Fix warnings in VLOG_WARNs for 32bits and when compiling
  dp_packet_is_tso() without DPDK (2/3 Ian Stokes);
- Fix docs and rename netdev_dpdk_filter_packet_len() to
  netdev_dpdk_filter_packet() as it now filters packet based
  on TSO (3/3, Ian Stokes).

Tiago Lam (3):
  netdev-dpdk: Validate packets burst before Tx.
  netdev-dpdk: Consider packets marked for TSO.
  netdev-dpdk: Enable TSO when using multi-seg mbufs

 Documentation/automake.mk   |   1 +
 Documentation/topics/dpdk/index.rst |   1 +
 Documentation/topics/dpdk/tso.rst   |  99 +
 NEWS|   1 +
 lib/dp-packet.h |  16 +++
 lib/netdev-bsd.c|  11 +-
 lib/netdev-dpdk.c   | 211 +++-
 lib/netdev-dummy.c  |  11 +-
 lib/netdev-linux.c  |  15 +++
 9 files changed, 334 insertions(+), 32 deletions(-)
 create mode 100644 Documentation/topics/dpdk/tso.rst

-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v14 11/11] dpdk-tests: End-to-end tests for multi-seg mbufs.

2019-01-12 Thread Tiago Lam
The following tests are added to the DPDK testsuite to add some
coverage for the multi-segment mbufs:
- Check that multi-segment mbufs are disabled by default;
- Check that providing `other_config:dpdk-multi-seg-mbufs=true` indeed
  enables mbufs;
- Using a DPDK port, send a random packet out and check that `ofctl
  dump-flows` shows the correct amount of packets and bytes sent.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/system-dpdk.at | 65 
 1 file changed, 65 insertions(+)

diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 1da020a..ca66135 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -232,3 +232,68 @@ OVS_VSWITCHD_STOP(["\@does not exist. The Open vSwitch 
kernel module is probably
 \@EAL: No free hugepages reported in hugepages-1048576kB@d"])
 AT_CLEANUP
 dnl --
+
+AT_SETUP([Jumbo frames - Multi-segment disabled by default])
+OVS_DPDK_START()
+
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [1], [])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment enabled])
+OVS_DPDK_START([dpdk-multi-seg-mbufs=true])
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [], [stdout])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment mbufs Tx])
+OVS_DPDK_PRE_CHECK()
+OVS_DPDK_START([per-port-memory=true dpdk-multi-seg-mbufs=true])
+
+dnl Add userspace bridge and attach it to OVS
+AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
+AT_CHECK([ovs-vsctl add-port br10 dpdk0 \
+-- set Interface dpdk0 type=dpdk options:dpdk-devargs=$(cat PCI_ADDR) \
+-- set Interface dpdk0 mtu_request=9000], [], [stdout], [stderr])
+
+AT_CHECK([ovs-vsctl show], [], [stdout])
+
+dnl Add flows to send packets out from the 'dpdk0' port
+AT_CHECK([
+ovs-ofctl del-flows br10
+ovs-ofctl add-flow br10 in_port=LOCAL,actions=output:dpdk0
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [], [stdout])
+
+dnl Send packet out, of the 'dpdk0' port
+AT_CHECK([
+ARP_HEADER="09000B0009000A000806000108000604000100010A\
+0100020A02"
+dnl Build a random hex string to append to the ARP_HEADER
+RANDOM_BODY=$(printf '0102030405%.0s' {1..1750})
+dnl 8792B ARP packet
+RANDOM_ARP="$ARP_HEADER$RANDOM_BODY"
+
+ovs-ofctl packet-out br10 "packet=$RANDOM_ARP,action=resubmit:LOCAL"
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [0], [stdout])
+
+dnl Confirm the single packet as been sent with correct size
+AT_CHECK([ovs-ofctl dump-flows br10 | ofctl_strip | grep in_port], [0], [dnl
+ n_packets=1, n_bytes=8792, in_port=LOCAL actions=output:1
+])
+
+dnl Clean up
+OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel module is probably 
not loaded./d
+/Failed to enable flow control/d
+/failed to connect to \/tmp\/dpdkvhostclient0: No such file or directory/d
+/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v14 10/11] dpdk-tests: Accept other configs in OVS_DPDK_START

2019-01-12 Thread Tiago Lam
As it stands, OVS_DPDK_START() won't allow other configs to be set
before starting the ovs-vswitchd daemon. This is a problem since some
configs, such as the "dpdk-multi-seg-mbufs=true" for enabling the
multi-segment mbufs, need to be set prior to start OvS.

To support other options, OVS_DPDK_START() has been modified to accept
extra configs in the form "$config_name=$config_value". It then uses
ovs-vsctl to set the configs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 tests/system-dpdk-macros.at | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tests/system-dpdk-macros.at b/tests/system-dpdk-macros.at
index f772a19..63d17c1 100644
--- a/tests/system-dpdk-macros.at
+++ b/tests/system-dpdk-macros.at
@@ -33,7 +33,7 @@ m4_define([OVS_DPDK_PRE_PHY_SKIP],
 ])
 
 
-# OVS_DPDK_START()
+# OVS_DPDK_START([other-conf-args])
 #
 # Create an empty database and start ovsdb-server. Add special configuration
 # dpdk-init to enable DPDK functionality. Start ovs-vswitchd connected to that
@@ -60,6 +60,10 @@ m4_define([OVS_DPDK_START],
AT_CHECK([lscpu], [], [stdout])
AT_CHECK([cat stdout | grep "NUMA node(s)" | awk '{c=1; while (c++<$(3)) 
{printf "1024,"}; print "1024"}' > SOCKET_MEM])
AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . 
other_config:dpdk-socket-mem="$(cat SOCKET_MEM)"])
+   dnl Iterate through $other-conf-args list and include them
+   m4_foreach_w(opt, $1, [
+   AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . other_config:opt])
+   ])
 
dnl Start ovs-vswitchd.
AT_CHECK([ovs-vswitchd --detach --no-chdir --pidfile --log-file -vvconn 
-vofproto_dpif -vunixctl], [0], [stdout], [stderr])
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v14 09/11] dpdk-tests: Add unit-tests for multi-seg mbufs.

2019-01-12 Thread Tiago Lam
In order to create a minimal environment that allows the tests to get
mbufs from an existing mempool, the following approach is taken:
- EAL is initialised (by using the main dpdk_init()) and a (very) small
  mempool is instantiated (mimicking the logic in dpdk_mp_create()).
  This mempool instance is global and used by all the tests;
- Packets are then allocated from the instantiated mempool, and tested
  on, by running some operations on them and manipulating data.

The tests introduced focus on testing DPDK dp_packets (where
source=DPBUF_DPDK), linked with a single or multiple mbufs, across
several operations, such as:
- dp_packet_put();
- dp_packet_shift();
- dp_packet_reserve();
- dp_packet_push_uninit();
- dp_packet_clear();
- dp_packet_equal();
- dp_packet_linear_data();
- And as a consequence of some of these, dp_packet_put_uninit() and
  dp_packet_resize__().

Finally, this has also been integrated with the new DPDK testsuite.
Thus, when running `$sudo make check-dpdk` one will also be running
these tests.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 tests/automake.mk  |  10 +-
 tests/dpdk-packet-mbufs.at |   7 +
 tests/system-dpdk-testsuite.at |   1 +
 tests/test-dpdk-mbufs.c| 722 +
 4 files changed, 739 insertions(+), 1 deletion(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 92d56b2..6e6237c 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -168,7 +168,8 @@ SYSTEM_DPDK_TESTSUITE_AT = \
tests/system-common-macros.at \
tests/system-dpdk-macros.at \
tests/system-dpdk-testsuite.at \
-   tests/system-dpdk.at
+   tests/system-dpdk.at \
+   tests/dpdk-packet-mbufs.at
 
 check_SCRIPTS += tests/atlocal
 
@@ -423,6 +424,10 @@ tests_ovstest_SOURCES = \
tests/test-vconn.c \
tests/test-aa.c \
tests/test-stopwatch.c
+if DPDK_NETDEV
+tests_ovstest_SOURCES += \
+   tests/test-dpdk-mbufs.c
+endif
 
 if !WIN32
 tests_ovstest_SOURCES += \
@@ -435,6 +440,9 @@ tests_ovstest_SOURCES += \
 endif
 
 tests_ovstest_LDADD = lib/libopenvswitch.la ovn/lib/libovn.la
+if DPDK_NETDEV
+tests_ovstest_LDFLAGS = $(AM_LDFLAGS) $(DPDK_vswitchd_LDFLAGS)
+endif
 
 noinst_PROGRAMS += tests/test-stream
 tests_test_stream_SOURCES = tests/test-stream.c
diff --git a/tests/dpdk-packet-mbufs.at b/tests/dpdk-packet-mbufs.at
new file mode 100644
index 000..f28e4fc
--- /dev/null
+++ b/tests/dpdk-packet-mbufs.at
@@ -0,0 +1,7 @@
+AT_BANNER([OVS-DPDK dp_packet unit tests])
+
+AT_SETUP([OVS-DPDK dp_packet - mbufs allocation])
+AT_KEYWORDS([dp_packet, multi-seg, mbufs])
+AT_CHECK(ovstest test-dpdk-packet, [], [ignore], [ignore])
+
+AT_CLEANUP
diff --git a/tests/system-dpdk-testsuite.at b/tests/system-dpdk-testsuite.at
index 382f09e..f5edf58 100644
--- a/tests/system-dpdk-testsuite.at
+++ b/tests/system-dpdk-testsuite.at
@@ -23,3 +23,4 @@ m4_include([tests/system-common-macros.at])
 m4_include([tests/system-dpdk-macros.at])
 
 m4_include([tests/system-dpdk.at])
+m4_include([tests/dpdk-packet-mbufs.at])
diff --git a/tests/test-dpdk-mbufs.c b/tests/test-dpdk-mbufs.c
new file mode 100644
index 000..0c152bf
--- /dev/null
+++ b/tests/test-dpdk-mbufs.c
@@ -0,0 +1,722 @@
+/*
+ * Copyright (c) 2018 Intel Corporation
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "dp-packet.h"
+#include "ovstest.h"
+#include "dpdk.h"
+#include "smap.h"
+#include "csum.h"
+#include "crc32c.h"
+
+#define N_MBUFS 1024
+#define MBUF_DATA_LEN 2048
+
+static int num_tests = 0;
+
+/* Global var to hold a mempool instance, "test-mp", used in all of the tests
+ * below. This instance is instantiated in dpdk_setup_eal_with_mp(). */
+static struct rte_mempool *mp;
+
+/* Test data used to fill the packets with data. Note that this isn't a string
+ * that repsents a valid packet, by any means. The pattern is generated in set_
+ * testing_pattern_str() and the sole purpose is to verify the data remains the
+ * same after inserting and operating on multi-segment mbufs. */
+static char *test_str;
+
+/* Asserts a dp_packet that holds a single mbuf, where:
+ * - nb_segs must be 1;
+ * - pkt_len must be eq

[ovs-dev] [PATCH v14 08/11] netdev-dpdk: support multi-segment jumbo frames

2019-01-12 Thread Tiago Lam
From: Mark Kavanagh 

Currently, jumbo frame support for OvS-DPDK is implemented by
increasing the size of mbufs within a mempool, such that each mbuf
within the pool is large enough to contain an entire jumbo frame of
a user-defined size. Typically, for each user-defined MTU,
'requested_mtu', a new mempool is created, containing mbufs of size
~requested_mtu.

With the multi-segment approach, a port uses a single mempool,
(containing standard/default-sized mbufs of ~2k bytes), irrespective
of the user-requested MTU value. To accommodate jumbo frames, mbufs
are chained together, where each mbuf in the chain stores a portion of
the jumbo frame. Each mbuf in the chain is termed a segment, hence the
name.

== Enabling multi-segment mbufs ==
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 Documentation/topics/dpdk/jumbo-frames.rst | 73 ++
 Documentation/topics/dpdk/memory.rst   | 36 +++
 NEWS   |  1 +
 lib/dpdk.c |  8 
 lib/netdev-dpdk.c  | 69 
 lib/netdev-dpdk.h  |  1 +
 vswitchd/vswitch.xml   | 22 +
 7 files changed, 202 insertions(+), 8 deletions(-)

diff --git a/Documentation/topics/dpdk/jumbo-frames.rst 
b/Documentation/topics/dpdk/jumbo-frames.rst
index 00360b4..9804bbb 100644
--- a/Documentation/topics/dpdk/jumbo-frames.rst
+++ b/Documentation/topics/dpdk/jumbo-frames.rst
@@ -71,3 +71,76 @@ Jumbo frame support has been validated against 9728B frames, 
which is the
 largest frame size supported by Fortville NIC using the DPDK i40e driver, but
 larger frames and other DPDK NIC drivers may be supported. These cases are
 common for use cases involving East-West traffic only.
+
+---
+Multi-segment mbufs
+---
+
+Instead of increasing the size of mbufs within a mempool, such that each mbuf
+within the pool is large enough to contain an entire jumbo frame of a
+user-defined size, mbufs can be chained together instead. In this approach each
+mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
+irrespective of the user-requested MTU value. Since each mbuf in the chain is
+termed a segment, this approach is named "multi-segment mbufs".
+
+This approach may bring more flexibility in use cases where the maximum packet
+length may be hard to guess. For example, in cases where packets originate from
+sources marked for offload (such as TSO), each packet may be larger than the
+MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
+enough to hold all of the packet's data.
+
+Multi-segment and single-segment mbufs are mutually exclusive, and the user
+must decide on which approach to adopt on initialisation. If multi-segment
+mbufs is to be enabled, it can be done so with the following command::
+
+$ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
+
+Single-segment mbufs still remain the default when using OvS-DPDK, and the
+above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
+multi-segment mbufs are to be used.
+
+~
+Performance notes
+~
+
+When using multi-segment mbufs some PMDs may not support vectorized Tx
+functions, due to its non-contiguous nature. As a result this can hit
+performance for smaller packet sizes. For example, on a setup sending 64B
+packets at line rate, a decrease of ~20% has been observed. The performance
+impact stops being noticeable for larger packet sizes, although the exact size
+will depend on each PMD, and vary between architectures.
+
+Tests performed with the i40e PMD driver only showed this limitation for 64B
+packets, and the same rate was observed when comparing multi-segment mbufs and
+single-segment mbuf for 128B packets. In other words, the 20% drop in
+performance was not observed for packets >= 128B during this test case.
+
+Because of this, multi-segment mbufs is not advised to be used with small

[ovs-dev] [PATCH v14 04/11] dp-packet: Handle multi-seg mubfs in shift() func.

2019-01-12 Thread Tiago Lam
In its current implementation dp_packet_shift() is also unaware of
multi-seg mbufs (that holds data in memory non-contiguously) and assumes
that data exists contiguously in memory, memmove'ing data to perform the
shift.

To add support for multi-seg mbufs a new set of functions was
introduced, dp_packet_mbuf_shift() and dp_packet_mbuf_write(). These
functions are used by dp_packet_shift(), when handling multi-seg mbufs,
to shift and write data within a chain of mbufs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c | 105 
 lib/dp-packet.h |   3 ++
 2 files changed, 108 insertions(+)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index f54119d..9ea3f78 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -23,6 +23,11 @@
 #include "openvswitch/dynamic-string.h"
 #include "util.h"
 
+#ifdef DPDK_NETDEV
+#define MBUF_BUF_END(BUF_ADDR, BUF_LEN) \
+(char *) (((char *) BUF_ADDR) + BUF_LEN)
+#endif
+
 static void
 dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source 
source)
 {
@@ -290,6 +295,100 @@ dp_packet_prealloc_headroom(struct dp_packet *b, size_t 
size)
 }
 }
 
+#ifdef DPDK_NETDEV
+/* Write len data bytes in a mbuf at specified offset.
+ *
+ * 'mbuf', pointer to the destination mbuf where 'ofs' is, and the mbuf where
+ * the data will first be written.
+ * 'ofs', the offset within the provided 'mbuf' where 'data' is to be written.
+ * 'len', the size of the to be written 'data'.
+ * 'data', pointer to the to be written bytes.
+ *
+ * Note: This function is the counterpart of the `rte_pktmbuf_read()` function
+ * available with DPDK, in the rte_mbuf.h */
+void
+dp_packet_mbuf_write(struct rte_mbuf *mbuf, int16_t ofs, uint32_t len,
+ const void *data)
+{
+char *dst_addr;
+uint16_t data_len;
+int len_copy;
+while (mbuf) {
+if (len == 0) {
+break;
+}
+
+dst_addr = rte_pktmbuf_mtod_offset(mbuf, char *, ofs);
+data_len = MBUF_BUF_END(mbuf->buf_addr, mbuf->buf_len) - dst_addr;
+
+len_copy = MIN(len, data_len);
+/* We don't know if 'data' is the result of a rte_pktmbuf_read() call,
+ * in which case we may end up writing to the same region of memory we
+ * are reading from and overlapping. Hence the use of memmove() here */
+memmove(dst_addr, data, len_copy);
+
+data = ((char *) data) + len_copy;
+len -= len_copy;
+ofs = 0;
+
+mbuf->data_len = len_copy;
+mbuf = mbuf->next;
+}
+}
+
+static void
+dp_packet_mbuf_shift_(struct rte_mbuf *dbuf, int16_t dst_ofs,
+  const struct rte_mbuf *sbuf, uint16_t src_ofs, int len)
+{
+char *rd = xmalloc(sizeof(*rd) * len);
+const char *wd = rte_pktmbuf_read(sbuf, src_ofs, len, rd);
+
+ovs_assert(wd);
+
+dp_packet_mbuf_write(dbuf, dst_ofs, len, wd);
+
+free(rd);
+}
+
+/* Similarly to dp_packet_shift(), shifts the data within the mbufs of a
+ * dp_packet of DPBUF_DPDK source by 'delta' bytes.
+ * Caller must make sure of the following conditions:
+ * - When shifting left, delta can't be bigger than the data_len available in
+ *   the last mbuf;
+ * - When shifting right, delta can't be bigger than the space available in the
+ *   first mbuf (buf_len - data_off).
+ * Both these conditions guarantee that a shift operation doesn't fall outside
+ * the bounds of the existing mbufs, so that the first and last mbufs (when
+ * using multi-segment mbufs), remain the same. */
+static void
+dp_packet_mbuf_shift(struct dp_packet *b, int delta)
+{
+uint16_t src_ofs;
+int16_t dst_ofs;
+
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+struct rte_mbuf *tmbuf = rte_pktmbuf_lastseg(mbuf);
+
+if (delta < 0) {
+ovs_assert(-delta <= tmbuf->data_len);
+} else {
+ovs_assert(delta < (mbuf->buf_len - mbuf->data_off));
+}
+
+/* Set the destination and source offsets to copy to */
+dst_ofs = delta;
+src_ofs = 0;
+
+/* Shift data from src mbuf and offset to dst mbuf and offset */
+dp_packet_mbuf_shift_(mbuf, dst_ofs, mbuf, src_ofs,
+  rte_pktmbuf_pkt_len(mbuf));
+
+/* Update mbufs' properties, and if using multi-segment mbufs, first and
+ * last mbuf's data_len also needs to be adjusted */
+mbuf->data_off = mbuf->data_off + dst_ofs;
+}
+#endif
+
 /* Shifts all of the data within the allocated space in 'b' by 'delta' bytes.
  * For example, a 'delta' of 1 would cause each byte of data to move one byte
  * forward (from address 'p' to 'p+1'), and a 'delta' of -1 would cause each
@@ -302,6 +401,12 @@ dp_packet_shift(struct dp_packet *b, int delta)
: true);
 
 if (delta != 0) {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+dp_packet_mbuf_shift(b, delta);
+return;
+

[ovs-dev] [PATCH v14 06/11] dp-packet: Add support for data "linearization".

2019-01-12 Thread Tiago Lam
Previous commits have added support to the dp_packet API to handle
multi-segmented packets, where data is not stored contiguously in
memory. However, in some cases, it is inevitable and data must be
provided contiguously. Examples of such cases are when performing csums
over the entire packet data, or when write()'ing to a file descriptor
(for a tap interface, for example). For such cases, the dp_packet API
has been extended to provide a way to transform a multi-segmented
DPBUF_DPDK packet into a DPBUF_MALLOC system packet (at the expense of a
copy of memory). If the packet's data is already stored in memory
contigously then there's no need to convert the packet.

Thus, the main use cases that were assuming that a dp_packet's data is
always held contiguously in memory were changed to make use of the new
"linear functions" in the dp_packet API when there's a need to traverse
the entire's packet data. Per the example above, when the packet's data
needs to be write() to the tap's file descriptor, or when the conntrack
module needs to verify a packet's checksum, the data is now linearized.

Additionally, the miniflow_extract() function has been modified to check
if the respective packet headers don't span across multiple mbufs. This
requirement is needed to guarantee that callers can assume headers are
always in contiguous memory.

Signed-off-by: Tiago Lam 
---
 lib/conntrack.c   |   5 ++
 lib/crc32c.c  |  17 +++-
 lib/crc32c.h  |   2 +
 lib/dp-packet.c   |  18 +
 lib/dp-packet.h   | 184 +++---
 lib/dpif-netdev.c |  18 -
 lib/dpif-netlink.c|   3 +
 lib/dpif.c|   6 ++
 lib/flow.c| 113 ++
 lib/flow.h|   4 +-
 lib/mcast-snooping.c  |   2 +
 lib/netdev-bsd.c  |   3 +
 lib/netdev-dummy.c|   6 ++
 lib/netdev-linux.c|   6 ++
 lib/netdev-native-tnl.c   |  26 +++---
 lib/odp-execute.c |  24 +-
 lib/packets.c |  96 +++---
 lib/packets.h |   7 ++
 ofproto/ofproto-dpif-upcall.c |  21 +++--
 ofproto/ofproto-dpif-xlate.c  |  27 ++-
 tests/test-rstp.c |   9 ++-
 tests/test-stp.c  |   9 ++-
 22 files changed, 528 insertions(+), 78 deletions(-)

diff --git a/lib/conntrack.c b/lib/conntrack.c
index 6f6021a..f80b806 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -1318,6 +1318,11 @@ conntrack_execute(struct conntrack *ct, struct 
dp_packet_batch *pkt_batch,
 struct conn_lookup_ctx ctx;
 
 DP_PACKET_BATCH_FOR_EACH (i, packet, pkt_batch) {
+/* Linearize the packet to ensure conntrack has the whole data */
+if (!dp_packet_is_linear(packet)) {
+dp_packet_linearize(packet);
+}
+
 if (!conn_key_extract(ct, packet, dl_type, , zone)) {
 packet->md.ct_state = CS_INVALID;
 write_ct_md(packet, zone, NULL, NULL, NULL);
diff --git a/lib/crc32c.c b/lib/crc32c.c
index e8dd6ee..83beec7 100644
--- a/lib/crc32c.c
+++ b/lib/crc32c.c
@@ -141,19 +141,30 @@ ovs_be32
 crc32c(const uint8_t *data, size_t size)
 {
 uint32_t crc = 0xL;
+return crc32c_finish(crc32c_continue(crc, data, size));
+}
 
+uint32_t
+crc32c_continue(uint32_t partial, const uint8_t *data, size_t size)
+{
 while (size--) {
-crc = crc32Table[(crc ^ *data++) & 0xff] ^ (crc >> 8);
+partial = crc32Table[(partial ^ *data++) & 0xff] ^ (partial >> 8);
 }
 
+return partial;
+}
+
+ovs_be32
+crc32c_finish(uint32_t partial)
+{
 /* The result of this CRC calculation provides us a value in the reverse
  * byte-order as compared with our architecture. On big-endian systems,
  * this is opposite to our return type. So, to return a big-endian
  * value, we must swap the byte-order. */
 #if defined(WORDS_BIGENDIAN)
-crc = uint32_byteswap(crc);
+crc = uint32_byteswap(partial);
 #endif
 
 /* Our value is in network byte-order. OVS_FORCE keeps sparse happy. */
-return (OVS_FORCE ovs_be32) ~crc;
+return (OVS_FORCE ovs_be32) ~partial;
 }
diff --git a/lib/crc32c.h b/lib/crc32c.h
index 92c7d7f..17c8190 100644
--- a/lib/crc32c.h
+++ b/lib/crc32c.h
@@ -20,6 +20,8 @@
 
 #include "openvswitch/types.h"
 
+uint32_t crc32c_continue(uint32_t partial, const uint8_t *data, size_t size);
+ovs_be32 crc32c_finish(uint32_t partial);
 ovs_be32 crc32c(const uint8_t *data, size_t);
 
 #endif /* crc32c.h */
diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 4e4b8fb..32cc881 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -103,6 +103,9 @@ void
 dp_packet_init_dpdk(struct dp_packet *b)
 {
 b->source = DPBUF_DPDK;
+#ifdef DPDK_NETDEV
+b->mstate = NULL;
+#endif
 }
 
 /* Initializes 'b' as an empty dp_packet with an initial capacity of 'size'
@@ -120,

[ovs-dev] [PATCH v14 05/11] dp-packet: copy data from multi-seg. DPDK mbuf

2019-01-12 Thread Tiago Lam
From: Michael Qiu 

When doing packet clone, if packet source is from DPDK driver,
multi-segment must be considered, and copy the segment's data one by
one.

Also, lots of DPDK mbuf's info is missed during a copy, like packet
type, ol_flags, etc.  That information is very important for DPDK to do
packets processing.

Co-authored-by: Mark Kavanagh 
Co-authored-by: Tiago Lam 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c   | 62 ++-
 lib/dp-packet.h   | 29 ++
 lib/netdev-dpdk.c |  1 +
 3 files changed, 73 insertions(+), 19 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 9ea3f78..4e4b8fb 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -159,39 +159,63 @@ dp_packet_clone(const struct dp_packet *buffer)
 return dp_packet_clone_with_headroom(buffer, 0);
 }
 
+#ifdef DPDK_NETDEV
+struct dp_packet *
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom) {
+struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
+
+/* Copy multi-seg data. */
+if (b->source == DPBUF_DPDK && !rte_pktmbuf_is_contiguous(>mbuf)) {
+void *dst = NULL;
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+new_buffer = dp_packet_new_with_headroom(pkt_len, headroom);
+dst = dp_packet_data(new_buffer);
+dp_packet_set_size(new_buffer, pkt_len);
+
+if (!rte_pktmbuf_read(mbuf, 0, pkt_len, dst)) {
+dp_packet_delete(new_buffer);
+return NULL;
+}
+} else {
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+dp_packet_size(b),
+headroom);
+}
+
+dp_packet_copy_common_members(new_buffer, b);
+
+dp_packet_copy_mbuf_flags(new_buffer, b);
+if (dp_packet_rss_valid(new_buffer)) {
+new_buffer->mbuf.hash.rss = b->mbuf.hash.rss;
+}
+
+return new_buffer;
+}
+#else
 /* Creates and returns a new dp_packet whose data are copied from 'buffer'.
  * The returned dp_packet will additionally have 'headroom' bytes of
  * headroom. */
 struct dp_packet *
-dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t headroom)
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom)
 {
 struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
 
-new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(buffer),
- dp_packet_size(buffer),
- headroom);
-/* Copy the following fields into the returned buffer: l2_pad_size,
- * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
-memcpy(_buffer->l2_pad_size, >l2_pad_size,
-sizeof(struct dp_packet) -
-offsetof(struct dp_packet, l2_pad_size));
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+pkt_len, headroom);
 
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.ol_flags = buffer->mbuf.ol_flags;
-#else
-new_buffer->rss_hash_valid = buffer->rss_hash_valid;
-#endif
+dp_packet_copy_common_members(new_buffer, b);
 
+new_buffer->rss_hash_valid = b->rss_hash_valid;
 if (dp_packet_rss_valid(new_buffer)) {
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.hash.rss = buffer->mbuf.hash.rss;
-#else
-new_buffer->rss_hash = buffer->rss_hash;
-#endif
+new_buffer->rss_hash = b->rss_hash;
 }
 
 return new_buffer;
 }
+#endif
 
 /* Creates and returns a new dp_packet that initially contains a copy of the
  * 'size' bytes of data starting at 'data' with no headroom or tailroom. */
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 19dc911..2951fb8 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -129,6 +129,10 @@ struct dp_packet *dp_packet_clone_data_with_headroom(const 
void *, size_t,
  size_t headroom);
 static inline void dp_packet_delete(struct dp_packet *);
 
+static inline void
+dp_packet_copy_common_members(struct dp_packet *new_b,
+  const struct dp_packet *b);
+
 static inline void *dp_packet_at(const struct dp_packet *, size_t offset,
  size_t size);
 static inline void *dp_packet_at_assert(const struct dp_packet *,
@@ -139,6 +143,8 @@ dp_packet_mbuf_from_offset(const struct dp_packet *b, 
size_t *offset);
 void
 dp_packet_mbuf_write(struct rte_mbuf *mbuf, int16_t ofs, uint32_t len,
  const void *data);
+static inline void
+dp_packet_copy_mbuf_flags(struct dp_packet *dst, const struct dp_packet *src);
 #endif
 static inline void *dp_packet_tail(co

[ovs-dev] [PATCH v14 03/11] dp-packet: Handle multi-seg mbufs in helper funcs.

2019-01-12 Thread Tiago Lam
Most helper functions in dp-packet assume that the data held by a
dp_packet is contiguous, and perform operations such as pointer
arithmetic under that assumption. However, with the introduction of
multi-segment mbufs, where data is non-contiguous, such assumptions are
no longer possible. Some examples of Such helper functions are
dp_packet_tail(), dp_packet_tailroom(), dp_packet_end(),
dp_packet_get_allocated() and dp_packet_at().

Thus, instead of assuming contiguous data in dp_packet, they  now
iterate over the (non-contiguous) data in mbufs to perform their
calculations.

Finally, dp_packet_use__() has also been modified to perform the
initialisation of the packet (and setting the source) before continuing
to set its size and data length, which now depends on the type of
packet.

Co-authored-by: Mark Kavanagh 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/dp-packet.c |   4 +-
 lib/dp-packet.h | 172 
 2 files changed, 152 insertions(+), 24 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 93b0e9c..f54119d 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -41,11 +41,11 @@ static void
 dp_packet_use__(struct dp_packet *b, void *base, size_t allocated,
  enum dp_packet_source source)
 {
+dp_packet_init__(b, allocated, source);
+
 dp_packet_set_base(b, base);
 dp_packet_set_data(b, base);
 dp_packet_set_size(b, 0);
-
-dp_packet_init__(b, allocated, source);
 }
 
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 72a8043..814fc47 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -133,6 +133,10 @@ static inline void *dp_packet_at(const struct dp_packet *, 
size_t offset,
  size_t size);
 static inline void *dp_packet_at_assert(const struct dp_packet *,
 size_t offset, size_t size);
+#ifdef DPDK_NETDEV
+static inline const struct rte_mbuf *
+dp_packet_mbuf_from_offset(const struct dp_packet *b, size_t *offset);
+#endif
 static inline void *dp_packet_tail(const struct dp_packet *);
 static inline void *dp_packet_end(const struct dp_packet *);
 
@@ -181,13 +185,28 @@ dp_packet_delete(struct dp_packet *b)
 }
 
 /* If 'b' contains at least 'offset + size' bytes of data, returns a pointer to
- * byte 'offset'.  Otherwise, returns a null pointer. */
+ * byte 'offset'.  Otherwise, returns a null pointer. For DPDK packets, this
+ * means the 'offset' + 'size' must fall within the same mbuf (not necessarily
+ * the first mbuf), otherwise null is returned */
 static inline void *
 dp_packet_at(const struct dp_packet *b, size_t offset, size_t size)
 {
-return offset + size <= dp_packet_size(b)
-   ? (char *) dp_packet_data(b) + offset
-   : NULL;
+if (offset + size > dp_packet_size(b)) {
+return NULL;
+}
+
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+const struct rte_mbuf *mbuf = dp_packet_mbuf_from_offset(b, );
+
+if (!mbuf || offset + size > mbuf->data_len) {
+return NULL;
+}
+
+return rte_pktmbuf_mtod_offset(mbuf, char *, offset);
+}
+#endif
+return (char *) dp_packet_data(b) + offset;
 }
 
 /* Returns a pointer to byte 'offset' in 'b', which must contain at least
@@ -196,13 +215,23 @@ static inline void *
 dp_packet_at_assert(const struct dp_packet *b, size_t offset, size_t size)
 {
 ovs_assert(offset + size <= dp_packet_size(b));
-return ((char *) dp_packet_data(b)) + offset;
+return dp_packet_at(b, offset, size);
 }
 
 /* Returns a pointer to byte following the last byte of data in use in 'b'. */
 static inline void *
 dp_packet_tail(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
+/* Find last segment where data ends, meaning the tail of the chained
+ *  mbufs must be there */
+buf = rte_pktmbuf_lastseg(buf);
+
+return rte_pktmbuf_mtod_offset(buf, void *, buf->data_len);
+}
+#endif
 return (char *) dp_packet_data(b) + dp_packet_size(b);
 }
 
@@ -211,6 +240,15 @@ dp_packet_tail(const struct dp_packet *b)
 static inline void *
 dp_packet_end(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, &(b->mbuf));
+
+buf = rte_pktmbuf_lastseg(buf);
+
+return (char *) buf->buf_addr + buf->buf_len;
+}
+#endif
 return (char *) dp_packet_base(b) + dp_packet_get_allocated(b);
 }
 
@@ -236,6 +274,15 @@ dp_packet_tailroom(const struct dp_packet *b)
 static inline void
 dp_packet_clear(struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+/* sets

[ovs-dev] [PATCH v14 02/11] dp-packet: Fix data_len handling multi-seg mbufs.

2019-01-12 Thread Tiago Lam
When a dp_packet is from a DPDK source, and it contains multi-segment
mbufs, the data_len is not equal to the packet size, pkt_len. Instead,
the data_len of each mbuf in the chain should be considered while
distributing the new (provided) size.

To account for the above dp_packet_set_size() has been changed so that,
in the multi-segment mbufs case, only the data_len on the last mbuf of
the chain and the total size of the packet, pkt_len, are changed. The
data_len on the intermediate mbufs preceeding the last mbuf is not
changed by dp_packet_set_size(). Furthermore, in some cases
dp_packet_set_size() may be used to set a smaller size than the current
packet size, thus effectively trimming the end of the packet. In the
multi-segment mbufs case this may lead to lingering mbufs that may need
freeing.

__dp_packet_set_data() now also updates an mbufs' data_len after setting
the data offset. This is so that both fields are always in sync for each
mbuf in a chain.

Co-authored-by: Michael Qiu 
Co-authored-by: Mark Kavanagh 
Co-authored-by: Przemyslaw Lal 
Co-authored-by: Marcin Ksiadz 
Co-authored-by: Yuanhan Liu 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Przemyslaw Lal 
Signed-off-by: Marcin Ksiadz 
Signed-off-by: Yuanhan Liu 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.h | 101 +---
 1 file changed, 89 insertions(+), 12 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 7b85dd9..72a8043 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -426,20 +426,78 @@ dp_packet_size(const struct dp_packet *b)
 return b->mbuf.pkt_len;
 }
 
+/* Sets the size of the packet 'b' to 'v'. For non-DPDK packets this only means
+ * setting b->size_, but if used in a DPDK packet it means adjusting the first
+ * mbuf pkt_len and last mbuf data_len, to reflect the real size, which can
+ * lead to free'ing tail mbufs that are no longer used.
+ *
+ * This function should be used for setting the size only, and if there's an
+ * assumption that the tail end of 'b' will be trimmed. For adjusting the head
+ * 'end' of 'b', dp_packet_pull() should be used instead. */
 static inline void
 dp_packet_set_size(struct dp_packet *b, uint32_t v)
 {
-/* netdev-dpdk does not currently support segmentation; consequently, for
- * all intents and purposes, 'data_len' (16 bit) and 'pkt_len' (32 bit) may
- * be used interchangably.
- *
- * On the datapath, it is expected that the size of packets
- * (and thus 'v') will always be <= UINT16_MAX; this means that there is no
- * loss of accuracy in assigning 'v' to 'data_len'.
- */
-b->mbuf.data_len = (uint16_t)v;  /* Current seg length. */
-b->mbuf.pkt_len = v; /* Total length of all segments linked to
-  * this segment. */
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *mbuf = >mbuf;
+uint16_t new_len = v;
+uint16_t data_len;
+uint16_t nb_segs = 0;
+uint16_t pkt_len = 0;
+
+/* Trim 'v' length bytes from the end of the chained buffers, freeing
+ * any buffers that may be left floating.
+ *
+ * For that traverse over the entire mbuf chain and, for each mbuf,
+ * subtract its 'data_len' from 'new_len' (initially set to 'v'), which
+ * essentially spreads 'new_len' between all existing mbufs in the
+ * chain. While traversing the mbuf chain, we end the traversal if:
+ * - 'new_size' reaches 0, meaning the passed 'v' has been
+ *   appropriately spread over the mbuf chain. The remaining mbufs are
+ *   freed;
+ * - We reach the last mbuf in the chain, in which case we set the last
+ *   mbuf's 'data_len' to the minimum value between the current
+ *   'new_len' (what's leftover from 'v') size and the maximum data the
+ *   mbuf can hold (mbuf->buf_len - mbuf->data_off).
+ *
+ * The above formula will thus make sure that when a 'v' is smaller
+ * than the overall 'pkt_len' (sum of all 'data_len'), it sets the new
+ * size and frees the leftover mbufs. In the other hand, if 'v' is
+ * bigger, it sets the size to the maximum available space, but no more
+ * than that. */
+while (mbuf) {
+data_len = MIN(new_len, mbuf->data_len);
+mbuf->data_len = data_len;
+
+if (new_len - data_len <= 0) {
+/* Free the rest of chained mbufs */
+free_dpdk_buf(CONTAINER_OF(mbuf->next, struct dp_packet,
+   mbuf));
+mbuf->next = NULL;
+} else if (!mbuf->next) {
+/* Don't assign more than what we have available */
+mbuf->data_len = MIN(new_len,
+   

[ovs-dev] [PATCH v14 01/11] netdev-dpdk: Serialise non-pmds mbufs' alloc/free.

2019-01-12 Thread Tiago Lam
A new mutex, 'nonpmd_mp_mutex', has been introduced to serialise
allocation and free operations by non-pmd threads on a given mempool.

free_dpdk_buf() has been modified to make use of the introduced mutex.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/netdev-dpdk.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 320422b..ea84c41 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -318,6 +318,16 @@ static struct ovs_mutex dpdk_mp_mutex 
OVS_ACQ_AFTER(dpdk_mutex)
 static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
 = OVS_LIST_INITIALIZER(_mp_list);
 
+/* This mutex must be used by non pmd threads when allocating or freeing
+ * mbufs through mempools, when outside of the `non_pmd_mutex` mutex, in struct
+ * dp_netdev.
+ * The reason, as pointed out in the "Known Issues" section in DPDK's EAL docs,
+ * is that the implementation on which mempool is based off is non-preemptable.
+ * Since non-pmds may end up not being pinned this could lead to the preemption
+ * between non-pmds performing operations on the same mempool, which could lead
+ * to memory corruption. */
+static struct ovs_mutex nonpmd_mp_mutex = OVS_MUTEX_INITIALIZER;
+
 struct dpdk_mp {
  struct rte_mempool *mp;
  int mtu;
@@ -490,6 +500,8 @@ struct netdev_rxq_dpdk {
 dpdk_port_t port_id;
 };
 
+static bool dpdk_thread_is_pmd(void);
+
 static void netdev_dpdk_destruct(struct netdev *netdev);
 static void netdev_dpdk_vhost_destruct(struct netdev *netdev);
 
@@ -523,6 +535,12 @@ dpdk_buf_size(int mtu)
 + RTE_PKTMBUF_HEADROOM;
 }
 
+static bool
+dpdk_thread_is_pmd(void)
+{
+ return rte_lcore_id() != NON_PMD_CORE_ID;
+}
+
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
  *
  * Unlike xmalloc(), this function can return NULL on failure. */
@@ -535,9 +553,18 @@ dpdk_rte_mzalloc(size_t sz)
 void
 free_dpdk_buf(struct dp_packet *p)
 {
-struct rte_mbuf *pkt = (struct rte_mbuf *) p;
+/* If non-pmd we need to lock on nonpmd_mp_mutex mutex */
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_lock(_mp_mutex);
+
+rte_pktmbuf_free(>mbuf);
+
+ovs_mutex_unlock(_mp_mutex);
+
+return;
+}
 
-rte_pktmbuf_free(pkt);
+rte_pktmbuf_free(>mbuf);
 }
 
 static void
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v14 00/11] Support multi-segment mbufs

2019-01-12 Thread Tiago Lam
i-seg mbufs and
   two others for the shift() and put_uninit() functionality;
 - Move patch 4 to before patch 3, so that ihelper functions come
   before functionality improvement that rely on those helpers.

v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
  from getting redundantly incremented");
- Sugesh's comments have been addressed:
  - Changed dp_packet_set_data() and dp_packet_set_size() logic to
make them independent of each other;
  - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
size() are independent;
  - dp_packet_clone_with_headroom() now has split functions for
handling DPDK sourced packets and non-DPDK packets;
- Modified various functions in dp-packet.h to account for multi-seg
  mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
  and dp_packet_at();
- Added support for shifting packet data in multi-seg mbufs, using
  dp_packet_shift();
- Fixed some minor inconsistencies.

Note that some of the changes in v5 have been contributed by Mark
Kavanagh as well.

v4: - restructure patchset
- account for 128B ARM cacheline when sizing mbufs

Mark Kavanagh (2):
  netdev-dpdk: copy large packet to multi-seg. mbufs
  netdev-dpdk: support multi-segment jumbo frames

Michael Qiu (1):
  dp-packet: copy data from multi-seg. DPDK mbuf

Tiago Lam (8):
  netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
  dp-packet: Fix data_len handling multi-seg mbufs.
  dp-packet: Handle multi-seg mbufs in helper funcs.
  dp-packet: Handle multi-seg mubfs in shift() func.
  dp-packet: Add support for data "linearization".
  dpdk-tests: Add unit-tests for multi-seg mbufs.
  dpdk-tests: Accept other configs in OVS_DPDK_START
  dpdk-tests: End-to-end tests for multi-seg mbufs.

 Documentation/topics/dpdk/jumbo-frames.rst |  73 +++
 Documentation/topics/dpdk/memory.rst   |  36 ++
 NEWS   |   1 +
 lib/conntrack.c|   5 +
 lib/crc32c.c   |  17 +-
 lib/crc32c.h   |   2 +
 lib/dp-packet.c| 189 +++-
 lib/dp-packet.h| 485 +--
 lib/dpdk.c |   8 +
 lib/dpif-netdev.c  |  18 +-
 lib/dpif-netlink.c |   3 +
 lib/dpif.c |   6 +
 lib/flow.c | 113 -
 lib/flow.h |   4 +-
 lib/mcast-snooping.c   |   2 +
 lib/netdev-bsd.c   |   3 +
 lib/netdev-dpdk.c  | 190 +++-
 lib/netdev-dpdk.h  |   1 +
 lib/netdev-dummy.c |   6 +
 lib/netdev-linux.c |   6 +
 lib/netdev-native-tnl.c|  26 +-
 lib/odp-execute.c  |  24 +-
 lib/packets.c  |  96 +++-
 lib/packets.h  |   7 +
 ofproto/ofproto-dpif-upcall.c  |  21 +-
 ofproto/ofproto-dpif-xlate.c   |  27 +-
 tests/automake.mk  |  10 +-
 tests/dpdk-packet-mbufs.at |   7 +
 tests/system-dpdk-macros.at|   6 +-
 tests/system-dpdk-testsuite.at |   1 +
 tests/system-dpdk.at   |  65 +++
 tests/test-dpdk-mbufs.c| 722 +
 tests/test-rstp.c  |   9 +-
 tests/test-stp.c   |   9 +-
 vswitchd/vswitch.xml   |  22 +
 35 files changed, 2068 insertions(+), 152 deletions(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 3/3] netdev-dpdk: Enable TSO when using multi-seg mbufs

2019-01-10 Thread Tiago Lam
TCP Segmentation Offload (TSO) is a feature which enables the TCP/IP
network stack to delegate segmentation of a TCP segment to the hardware
NIC, thus saving compute resources. This may improve performance
significantly for TCP workload in virtualized environments.

While a previous commit already added the necesary logic to netdev-dpdk
to deal with packets marked for TSO, this set of changes enables TSO by
default when using multi-segment mbufs.

Thus, to enable TSO on the physical DPDK interfaces, only the following
command needs to be issued before starting OvS:
ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Mark Kavanagh 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
---
 Documentation/automake.mk   |   1 +
 Documentation/topics/dpdk/index.rst |   1 +
 Documentation/topics/dpdk/tso.rst   | 111 
 NEWS|   1 +
 lib/netdev-dpdk.c   |  63 +---
 5 files changed, 168 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/topics/dpdk/tso.rst

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 082438e..a20deb8 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -39,6 +39,7 @@ DOC_SOURCE = \
Documentation/topics/dpdk/index.rst \
Documentation/topics/dpdk/bridge.rst \
Documentation/topics/dpdk/jumbo-frames.rst \
+   Documentation/topics/dpdk/tso.rst \
Documentation/topics/dpdk/memory.rst \
Documentation/topics/dpdk/pdump.rst \
Documentation/topics/dpdk/phy.rst \
diff --git a/Documentation/topics/dpdk/index.rst 
b/Documentation/topics/dpdk/index.rst
index cf24a7b..eb2a04d 100644
--- a/Documentation/topics/dpdk/index.rst
+++ b/Documentation/topics/dpdk/index.rst
@@ -40,4 +40,5 @@ The DPDK Datapath
/topics/dpdk/qos
/topics/dpdk/pdump
/topics/dpdk/jumbo-frames
+   /topics/dpdk/tso
/topics/dpdk/memory
diff --git a/Documentation/topics/dpdk/tso.rst 
b/Documentation/topics/dpdk/tso.rst
new file mode 100644
index 000..503354f
--- /dev/null
+++ b/Documentation/topics/dpdk/tso.rst
@@ -0,0 +1,111 @@
+..
+  Copyright 2018, Red Hat, Inc.
+
+  Licensed under the Apache License, Version 2.0 (the "License"); you may
+  not use this file except in compliance with the License. You may obtain
+  a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+  WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+  License for the specific language governing permissions and limitations
+  under the License.
+
+  Convention for heading levels in Open vSwitch documentation:
+
+  ===  Heading 0 (reserved for the title in a document)
+  ---  Heading 1
+  ~~~  Heading 2
+  +++  Heading 3
+  '''  Heading 4
+
+  Avoid deeper levels because they do not render well.
+
+===
+TSO
+===
+
+.. versionadded:: 2.11.0
+
+TCP Segmentation Offload (TSO) is a mechanism which allows a TCP/IP stack to
+offload the TCP segmentation into hardware, thus saving the cycles that would
+be required to perform this same segmentation in Software.
+
+TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
+of an oversized TCP segment to the underlying physical NIC. Offload of frame
+segmentation achieves computational savings in the core, freeing up CPU cycles
+for more useful work.
+
+A common use case for TSO is when using virtualization, where traffic that's
+coming in from a VM can offload the TCP segmentation, thus avoiding the
+fragmentation in Software. Additionally, if the traffic is headed to a VM
+within the same host further optimization can be expected. As the traffic never
+leaves the machine, no MTU needs to be accounted for, and thus no segmentation
+and checksum calculations are required, which saves yet more cycles. Only when
+the traffic actually leaves the host the segmentation needs to happen, in which
+case it will be performed by the egress NIC.
+
+When using TSO with DPDK, the implementation relies on the multi-segment mbufs
+feature, described in :doc:`/topics/dpdk/jumbo-frames`, where each mbuf
+contains ~2KiB of the entire packet's data and is linked to the next mbuf that
+contains the next portion of data.
+
+Enabling TSO
+
+Once multi-segment mbufs is enabled, TSO will be enabled by default, if there's
+support for it in the underlying physical NICs attached to OvS-DPDK.
+
+When using :doc:`vHost User ports `, TSO may be enabled in one of
+two ways, as follows.
+
+`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
+connection is established, `TSO` is thus advertised to the guest as an
+available feature:
+
+  

[ovs-dev] [PATCH v2 2/3] netdev-dpdk: Consider packets marked for TSO.

2019-01-10 Thread Tiago Lam
Previously, TSO was being explicity disabled on vhost interfaces,
meaning the guests wouldn't have TSO support negotiated in. With TSO
negotiated and enabled, packets are now marked for TSO, through the
PKT_TX_TCP_SEG flag.

In order to deal with this type of packets, a new function,
netdev_dpdk_prep_tso_packet(), has been introduced, with the main
purpose of setting correctly the l2, l3 and l4 length members of the
mbuf struct, and the appropriate ol_flags. This function supports TSO
both in IPv4 and IPv6.

netdev_dpdk_prep_tso_packet() is then only called when packets are
marked with the PKT_TX_TCP_SEG flag, meaning they have been marked for
TSO, and when the packet will be traversing the NIC.

Additionally, if a packet is marked for TSO but the egress netdev
doesn't support it, the packet is dropped.

Co-authored-by: Mark Kavanagh 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
---
 lib/dp-packet.h|  14 +++
 lib/netdev-bsd.c   |  11 -
 lib/netdev-dpdk.c  | 121 ++---
 lib/netdev-dummy.c |  11 -
 lib/netdev-linux.c |  15 +++
 5 files changed, 146 insertions(+), 26 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 970aaf2..c384416 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -104,6 +104,8 @@ static inline void dp_packet_set_size(struct dp_packet *, 
uint32_t);
 static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
 static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
 
+static inline bool dp_packet_is_tso(struct dp_packet *b);
+
 void *dp_packet_resize_l2(struct dp_packet *, int increment);
 void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
 static inline void *dp_packet_eth(const struct dp_packet *);
@@ -761,6 +763,12 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
 b->mbuf.buf_len = s;
 }
 
+static inline bool
+dp_packet_is_tso(struct dp_packet *b)
+{
+return (b->mbuf.ol_flags & PKT_TX_TCP_SEG) ? true : false;
+}
+
 static inline void
 dp_packet_copy_mbuf_flags(struct dp_packet *dst, const struct dp_packet *src)
 {
@@ -972,6 +980,12 @@ dp_packet_get_allocated(const struct dp_packet *b)
 return b->allocated_;
 }
 
+static inline bool
+dp_packet_is_tso(struct dp_packet *b)
+{
+return false;
+}
+
 static inline void
 dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
 {
diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c
index 278c8a9..5e8c5cc 100644
--- a/lib/netdev-bsd.c
+++ b/lib/netdev-bsd.c
@@ -700,13 +700,22 @@ netdev_bsd_send(struct netdev *netdev_, int qid 
OVS_UNUSED,
 }
 
 DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+size_t size = dp_packet_size(packet);
+
+/* TSO not supported in BSD netdev */
+if (dp_packet_is_tso(packet)) {
+VLOG_WARN_RL(, "%s: No TSO enabled on port, TSO packet dropped "
+ "%" PRIu32 " ", name, size);
+
+continue;
+}
+
 /* We need the whole data to send the packet on the device */
 if (!dp_packet_is_linear(packet)) {
 dp_packet_linearize(packet);
 }
 
 const void *data = dp_packet_data(packet);
-size_t size = dp_packet_size(packet);
 
 while (!error) {
 ssize_t retval;
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 77d04fc..ad7223a 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -1375,14 +1375,16 @@ netdev_dpdk_vhost_construct(struct netdev *netdev)
 goto out;
 }
 
-err = rte_vhost_driver_disable_features(dev->vhost_id,
-1ULL << VIRTIO_NET_F_HOST_TSO4
-| 1ULL << VIRTIO_NET_F_HOST_TSO6
-| 1ULL << VIRTIO_NET_F_CSUM);
-if (err) {
-VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
- "port: %s\n", name);
-goto out;
+if (!dpdk_multi_segment_mbufs) {
+err = rte_vhost_driver_disable_features(dev->vhost_id,
+1ULL << VIRTIO_NET_F_HOST_TSO4
+| 1ULL << VIRTIO_NET_F_HOST_TSO6
+| 1ULL << VIRTIO_NET_F_CSUM);
+if (err) {
+VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
+ "client port: %s\n", dev->up.name);
+goto out;
+}
 }
 
 err = rte_vhost_driver_start(dev->vhost_id);
@@ -2027,6 +2029,44 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
 rte_free(rx);
 }
 
+/* Should only be called if PKT_TX_TCP_SEG is set in ol_flags.
+ * Furthermore, it also sets the PKT_TX_TCP_CKSUM and PKT_TX_IP_CKSUM flags,
+ * and PKT_TX_IPV4 and PKT_TX_IPV6 in case the packet is IPv4 or IPv6,
+ * respectiveoly. */
+static void
+netdev_dpdk_prep_t

[ovs-dev] [PATCH v2 1/3] netdev-dpdk: Validate packets burst before Tx.

2019-01-10 Thread Tiago Lam
Given that multi-segment mbufs might be sent between interfaces that
support different capabilities, and may even support different layouts
of mbufs, outgoing packets should be validated before sent on the egress
interface. Thus, netdev_dpdk_eth_tx_burst() now calls DPDK's
rte_eth_tx_prepare() function, if and only multi-segments is enbaled, in
order to validate the following (taken from the DPDK documentation), on
a device specific manner:
- Check if packet meets devices requirements for tx offloads.
- Check limitations about number of segments.
- Check additional requirements when debug is enabled.
- Update and/or reset required checksums when tx offload is set for
packet.

Signed-off-by: Tiago Lam 
---
 lib/netdev-dpdk.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index d6114ee..77d04fc 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -2029,6 +2029,10 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
 
 /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes ownership of
  * 'pkts', even in case of failure.
+ * In case multi-segment mbufs / TSO is being used, it also prepares. In such
+ * cases, only the prepared packets will be sent to Tx burst, meaning that if
+ * an invalid packet appears in 'pkts'[3] only the validated packets in indices
+ * 0, 1 and 2 will be sent.
  *
  * Returns the number of packets that weren't transmitted. */
 static inline int
@@ -2036,11 +2040,24 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int 
qid,
  struct rte_mbuf **pkts, int cnt)
 {
 uint32_t nb_tx = 0;
+uint16_t nb_prep = cnt;
 
-while (nb_tx != cnt) {
+if (dpdk_multi_segment_mbufs) {
+/* Validate the burst of packets for Tx. */
+nb_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
+if (nb_prep != cnt) {
+VLOG_WARN_RL(, "%s: Preparing packet tx burst failed (%u/%u "
+ "packets valid): %s", dev->up.name, nb_prep, cnt,
+ rte_strerror(rte_errno));
+}
+}
+
+/* Tx the validated burst of packets only. */
+while (nb_tx != nb_prep) {
 uint32_t ret;
 
-ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
+ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
+   nb_prep - nb_tx);
 if (!ret) {
 break;
 }
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 0/3] dpdk: Add support for TSO

2019-01-10 Thread Tiago Lam
Enabling TSO offload allows a host stack to delegate the segmentation of
oversized TCP packets to the underlying physical NIC, if supported. In the case
of a VM this means that the segmentation of the packets is not performed by the
guest kernel, but by the host NIC itself. In turn, since the TSO calculations
and checksums are being performed in hardware, this alleviates the CPU load on
the host system. In inter VM communication this might account to significant
savings, and higher throughput, even more so if the VMs are running on the same
host.

Thus, although inter VM communication is already possible as is, there's a
sacrifice in terms of CPU, which may affect the overall throughput.

This series adds support for TSO in OvS-DPDK, by making use of the TSO
offloading feature already supported by DPDK vhost backend, having the
following scenarios in mind:
- Inter VM communication on the same host;
- Inter VM communication on different hosts;
- The same two use cases above, but on a VLAN network.

The work is based on [1]; It has been rebased to run on top of the
multi-segment mbufs work (v13) [2] and re-worked to use DPDK v18.11.

[1] https://patchwork.ozlabs.org/patch/749564/
[2] https://mail.openvswitch.org/pipermail/ovs-dev/2019-January/354950.html

Considerations:
- As mentioned above, this series depends on the multi-segment mbuf series
  (v13) and can't be applied on master as is;
- The `rte_eth_tx_prepare()` API in DPDK is marked experimental, and although
  I'm not getting any errors / warnings while compiling, do shout if get into
  trouble while testing;
- I'm due to send v3 in the next few days, but sending v2 now to enable early
  testing;

Tiago Lam (3):
  netdev-dpdk: Validate packets burst before Tx.
  netdev-dpdk: Consider packets marked for TSO.
  netdev-dpdk: Enable TSO when using multi-seg mbufs

 Documentation/automake.mk   |   1 +
 Documentation/topics/dpdk/index.rst |   1 +
 Documentation/topics/dpdk/tso.rst   | 111 
 NEWS|   1 +
 lib/dp-packet.h |  14 +++
 lib/netdev-bsd.c|  11 +-
 lib/netdev-dpdk.c   | 203 ++--
 lib/netdev-dummy.c  |  11 +-
 lib/netdev-linux.c  |  15 +++
 9 files changed, 332 insertions(+), 36 deletions(-)
 create mode 100644 Documentation/topics/dpdk/tso.rst

-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v13 11/11] dpdk-tests: End-to-end tests for multi-seg mbufs.

2019-01-09 Thread Tiago Lam
The following tests are added to the DPDK testsuite to add some
coverage for the multi-segment mbufs:
- Check that multi-segment mbufs are disabled by default;
- Check that providing `other_config:dpdk-multi-seg-mbufs=true` indeed
  enables mbufs;
- Using a DPDK port, send a random packet out and check that `ofctl
  dump-flows` shows the correct amount of packets and bytes sent.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/system-dpdk.at | 65 
 1 file changed, 65 insertions(+)

diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 1da020a..ca66135 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -232,3 +232,68 @@ OVS_VSWITCHD_STOP(["\@does not exist. The Open vSwitch 
kernel module is probably
 \@EAL: No free hugepages reported in hugepages-1048576kB@d"])
 AT_CLEANUP
 dnl --
+
+AT_SETUP([Jumbo frames - Multi-segment disabled by default])
+OVS_DPDK_START()
+
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [1], [])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment enabled])
+OVS_DPDK_START([dpdk-multi-seg-mbufs=true])
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [], [stdout])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment mbufs Tx])
+OVS_DPDK_PRE_CHECK()
+OVS_DPDK_START([per-port-memory=true dpdk-multi-seg-mbufs=true])
+
+dnl Add userspace bridge and attach it to OVS
+AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
+AT_CHECK([ovs-vsctl add-port br10 dpdk0 \
+-- set Interface dpdk0 type=dpdk options:dpdk-devargs=$(cat PCI_ADDR) \
+-- set Interface dpdk0 mtu_request=9000], [], [stdout], [stderr])
+
+AT_CHECK([ovs-vsctl show], [], [stdout])
+
+dnl Add flows to send packets out from the 'dpdk0' port
+AT_CHECK([
+ovs-ofctl del-flows br10
+ovs-ofctl add-flow br10 in_port=LOCAL,actions=output:dpdk0
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [], [stdout])
+
+dnl Send packet out, of the 'dpdk0' port
+AT_CHECK([
+ARP_HEADER="09000B0009000A000806000108000604000100010A\
+0100020A02"
+dnl Build a random hex string to append to the ARP_HEADER
+RANDOM_BODY=$(printf '0102030405%.0s' {1..1750})
+dnl 8792B ARP packet
+RANDOM_ARP="$ARP_HEADER$RANDOM_BODY"
+
+ovs-ofctl packet-out br10 "packet=$RANDOM_ARP,action=resubmit:LOCAL"
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [0], [stdout])
+
+dnl Confirm the single packet as been sent with correct size
+AT_CHECK([ovs-ofctl dump-flows br10 | ofctl_strip | grep in_port], [0], [dnl
+ n_packets=1, n_bytes=8792, in_port=LOCAL actions=output:1
+])
+
+dnl Clean up
+OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel module is probably 
not loaded./d
+/Failed to enable flow control/d
+/failed to connect to \/tmp\/dpdkvhostclient0: No such file or directory/d
+/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v13 06/11] dp-packet: Add support for data "linearization".

2019-01-09 Thread Tiago Lam
Previous commits have added support to the dp_packet API to handle
multi-segmented packets, where data is not stored contiguously in
memory. However, in some cases, it is inevitable and data must be
provided contiguously. Examples of such cases are when performing csums
over the entire packet data, or when write()'ing to a file descriptor
(for a tap interface, for example). For such cases, the dp_packet API
has been extended to provide a way to transform a multi-segmented
DPBUF_DPDK packet into a DPBUF_MALLOC system packet (at the expense of a
copy of memory). If the packet's data is already stored in memory
contigously then there's no need to convert the packet.

Thus, the main use cases that were assuming that a dp_packet's data is
always held contiguously in memory were changed to make use of the new
"linear functions" in the dp_packet API when there's a need to traverse
the entire's packet data. Per the example above, when the packet's data
needs to be write() to the tap's file descriptor, or when the conntrack
module needs to verify a packet's checksum, the data is now linearized.

Additionally, the miniflow_extract() function has been modified to check
if the respective packet headers don't span across multiple mbufs. This
requirement is needed to guarantee that callers can assume headers are
always in contiguous memory.

Signed-off-by: Tiago Lam 
---
 lib/bfd.c |   3 +-
 lib/conntrack-private.h   |   4 +-
 lib/conntrack.c   |  43 +--
 lib/crc32c.c  |  35 --
 lib/crc32c.h  |   2 +
 lib/dp-packet.c   |  18 +++
 lib/dp-packet.h   | 267 ++
 lib/dpif-netdev.c |  13 +-
 lib/dpif-netlink.c|   5 +
 lib/dpif.c|   9 ++
 lib/flow.c|  97 ---
 lib/flow.h|   2 +-
 lib/mcast-snooping.c  |   2 +
 lib/netdev-bsd.c  |   5 +
 lib/netdev-dummy.c|  13 +-
 lib/netdev-linux.c|  13 +-
 lib/netdev-native-tnl.c   |  26 ++--
 lib/odp-execute.c |  12 +-
 lib/ovs-lldp.c|   3 +-
 lib/packets.c |  85 --
 lib/packets.h |   7 ++
 ofproto/ofproto-dpif-upcall.c |  20 +++-
 ofproto/ofproto-dpif-xlate.c  |  32 -
 tests/test-rstp.c |   8 +-
 tests/test-stp.c  |   8 +-
 25 files changed, 579 insertions(+), 153 deletions(-)

diff --git a/lib/bfd.c b/lib/bfd.c
index cc8c685..12e076a 100644
--- a/lib/bfd.c
+++ b/lib/bfd.c
@@ -721,7 +721,8 @@ bfd_process_packet(struct bfd *bfd, const struct flow *flow,
 if (!msg) {
 VLOG_INFO_RL(, "%s: Received too-short BFD control message (only "
  "%"PRIdPTR" bytes long, at least %d required).",
- bfd->name, (uint8_t *) dp_packet_tail(p) - l7,
+ bfd->name, dp_packet_size(p) -
+ (l7 - (uint8_t *) dp_packet_data(p)),
  BFD_PACKET_LEN);
 goto out;
 }
diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index a344801..1be2df1 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -159,8 +159,8 @@ tcp_payload_length(struct dp_packet *pkt)
 {
 const char *tcp_payload = dp_packet_get_tcp_payload(pkt);
 if (tcp_payload) {
-return ((char *) dp_packet_tail(pkt) - dp_packet_l2_pad_size(pkt)
-- tcp_payload);
+return dp_packet_l4_size(pkt) -
+   (tcp_payload - (char *) dp_packet_l4(pkt));
 } else {
 return 0;
 }
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 6f6021a..0dd2dcc 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -636,12 +636,22 @@ reverse_pat_packet(struct dp_packet *pkt, const struct 
conn *conn)
 static void
 reverse_nat_packet(struct dp_packet *pkt, const struct conn *conn)
 {
-char *tail = dp_packet_tail(pkt);
-char pad = dp_packet_l2_pad_size(pkt);
+char *tail;
+char pad;
 struct conn_key inner_key;
 const char *inner_l4 = NULL;
-uint16_t orig_l3_ofs = pkt->l3_ofs;
-uint16_t orig_l4_ofs = pkt->l4_ofs;
+uint16_t orig_l3_ofs;
+uint16_t orig_l4_ofs;
+
+/* We need the whole packet to parse the packet below */
+if (!dp_packet_is_linear(pkt)) {
+dp_packet_linearize(pkt);
+}
+
+tail = dp_packet_tail(pkt);
+pad = dp_packet_l2_pad_size(pkt);
+orig_l3_ofs = pkt->l3_ofs;
+orig_l4_ofs = pkt->l4_ofs;
 
 if (conn->key.dl_type == htons(ETH_TYPE_IP)) {
 struct ip_header *nh = dp_packet_l3(pkt);
@@ -1323,6 +1333,7 @@ conntrack_execute(struct conntrack *ct, struct 
dp_packet_batch *pkt_batch,
 write_ct_md(packet, zone, NULL, NULL, NULL);
 continue;
 }
+
 process_one(ct, packet, , zone, force, commit, now, setmark,
 

[ovs-dev] [PATCH v13 07/11] netdev-dpdk: copy large packet to multi-seg. mbufs

2019-01-09 Thread Tiago Lam
From: Mark Kavanagh 

Currently, packets are only copied to a single segment in the function
dpdk_do_tx_copy(). This could be an issue in the case of jumbo frames,
particularly when multi-segment mbufs are involved.

This patch calculates the number of segments needed by a packet and
copies the data to each segment.

A new function, dpdk_buf_alloc(), has also been introduced as a wrapper
around the nonpmd_mp_mutex to serialise allocations from a non-pmd
context.

Co-authored-by: Michael Qiu 
Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Michael Qiu 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 89 +--
 1 file changed, 80 insertions(+), 9 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index a0df9fc..7a9add7 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -550,6 +550,25 @@ dpdk_rte_mzalloc(size_t sz)
 return rte_zmalloc(OVS_VPORT_DPDK, sz, OVS_CACHE_LINE_SIZE);
 }
 
+static struct rte_mbuf *
+dpdk_buf_alloc(struct rte_mempool *mp)
+{
+struct rte_mbuf *mbuf = NULL;
+
+/* If non-pmd we need to lock on nonpmd_mp_mutex mutex. */
+if (dpdk_thread_is_pmd()) {
+mbuf = rte_pktmbuf_alloc(mp);
+} else {
+ovs_mutex_lock(_mp_mutex);
+
+mbuf = rte_pktmbuf_alloc(mp);
+
+ovs_mutex_unlock(_mp_mutex);
+}
+
+return mbuf;
+}
+
 void
 free_dpdk_buf(struct dp_packet *p)
 {
@@ -2333,6 +2352,56 @@ out:
 }
 }
 
+static int
+dpdk_copy_dp_packet_to_mbuf(struct dp_packet *packet, struct rte_mbuf **head,
+struct rte_mempool *mp)
+{
+struct rte_mbuf *mbuf, *fmbuf;
+uint16_t max_data_len;
+uint32_t nb_segs = 0;
+uint32_t size = 0;
+
+/* We will need the whole data for copying below. */
+if (!dp_packet_is_linear(packet)) {
+dp_packet_linearize(packet);
+}
+
+/* Allocate first mbuf to know the size of data available. */
+fmbuf = mbuf = *head = dpdk_buf_alloc(mp);
+if (OVS_UNLIKELY(!mbuf)) {
+return ENOMEM;
+}
+
+size = dp_packet_size(packet);
+
+/* All new allocated mbuf's max data len is the same. */
+max_data_len = mbuf->buf_len - mbuf->data_off;
+
+/* Calculate # of output mbufs. */
+nb_segs = size / max_data_len;
+if (size % max_data_len) {
+nb_segs = nb_segs + 1;
+}
+
+/* Allocate additional mbufs, less the one alredy allocated above. */
+for (int i = 1; i < nb_segs; i++) {
+mbuf->next = dpdk_buf_alloc(mp);
+if (!mbuf->next) {
+free_dpdk_buf(CONTAINER_OF(fmbuf, struct dp_packet, mbuf));
+fmbuf = NULL;
+return ENOMEM;
+}
+mbuf = mbuf->next;
+}
+
+fmbuf->nb_segs = nb_segs;
+fmbuf->pkt_len = size;
+
+dp_packet_mbuf_write(fmbuf, 0, size, dp_packet_data(packet));
+
+return 0;
+}
+
 /* Tx function. Transmit packets indefinitely */
 static void
 dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
@@ -2349,6 +2418,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
 uint32_t cnt = batch_cnt;
 uint32_t dropped = 0;
+uint32_t i;
 
 if (dev->type != DPDK_DEV_VHOST) {
 /* Check if QoS has been configured for this netdev. */
@@ -2359,28 +2429,29 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 
 uint32_t txcnt = 0;
 
-for (uint32_t i = 0; i < cnt; i++) {
+for (i = 0; i < cnt; i++) {
 struct dp_packet *packet = batch->packets[i];
 uint32_t size = dp_packet_size(packet);
+int err = 0;
 
 if (OVS_UNLIKELY(size > dev->max_packet_len)) {
 VLOG_WARN_RL(, "Too big size %u max_packet_len %d",
  size, dev->max_packet_len);
-
 dropped++;
 continue;
 }
 
-pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
-if (OVS_UNLIKELY(!pkts[txcnt])) {
+err = dpdk_copy_dp_packet_to_mbuf(packet, [txcnt],
+  dev->dpdk_mp->mp);
+if (err != 0) {
+if (err == ENOMEM) {
+VLOG_ERR_RL(, "Failed to alloc mbufs! %u packets dropped",
+cnt - i);
+}
+
 dropped += cnt - i;
 break;
 }
-
-/* We have to do a copy for now */
-memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
-   dp_packet_data(packet), size);
-dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
 dp_packet_copy_mbuf_flags((struct dp_packet *)pkts[txcnt], packet);
 
 txcnt++;
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v13 09/11] dpdk-tests: Add unit-tests for multi-seg mbufs.

2019-01-09 Thread Tiago Lam
In order to create a minimal environment that allows the tests to get
mbufs from an existing mempool, the following approach is taken:
- EAL is initialised (by using the main dpdk_init()) and a (very) small
  mempool is instantiated (mimicking the logic in dpdk_mp_create()).
  This mempool instance is global and used by all the tests;
- Packets are then allocated from the instantiated mempool, and tested
  on, by running some operations on them and manipulating data.

The tests introduced focus on testing DPDK dp_packets (where
source=DPBUF_DPDK), linked with a single or multiple mbufs, across
several operations, such as:
- dp_packet_put();
- dp_packet_shift();
- dp_packet_reserve();
- dp_packet_push_uninit();
- dp_packet_clear();
- dp_packet_equal();
- dp_packet_linear_data();
- And as a consequence of some of these, dp_packet_put_uninit() and
  dp_packet_resize__().

Finally, this has also been integrated with the new DPDK testsuite.
Thus, when running `$sudo make check-dpdk` one will also be running
these tests.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 tests/automake.mk  |  10 +-
 tests/dpdk-packet-mbufs.at |   7 +
 tests/system-dpdk-testsuite.at |   1 +
 tests/test-dpdk-mbufs.c| 722 +
 4 files changed, 739 insertions(+), 1 deletion(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 92d56b2..6e6237c 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -168,7 +168,8 @@ SYSTEM_DPDK_TESTSUITE_AT = \
tests/system-common-macros.at \
tests/system-dpdk-macros.at \
tests/system-dpdk-testsuite.at \
-   tests/system-dpdk.at
+   tests/system-dpdk.at \
+   tests/dpdk-packet-mbufs.at
 
 check_SCRIPTS += tests/atlocal
 
@@ -423,6 +424,10 @@ tests_ovstest_SOURCES = \
tests/test-vconn.c \
tests/test-aa.c \
tests/test-stopwatch.c
+if DPDK_NETDEV
+tests_ovstest_SOURCES += \
+   tests/test-dpdk-mbufs.c
+endif
 
 if !WIN32
 tests_ovstest_SOURCES += \
@@ -435,6 +440,9 @@ tests_ovstest_SOURCES += \
 endif
 
 tests_ovstest_LDADD = lib/libopenvswitch.la ovn/lib/libovn.la
+if DPDK_NETDEV
+tests_ovstest_LDFLAGS = $(AM_LDFLAGS) $(DPDK_vswitchd_LDFLAGS)
+endif
 
 noinst_PROGRAMS += tests/test-stream
 tests_test_stream_SOURCES = tests/test-stream.c
diff --git a/tests/dpdk-packet-mbufs.at b/tests/dpdk-packet-mbufs.at
new file mode 100644
index 000..f28e4fc
--- /dev/null
+++ b/tests/dpdk-packet-mbufs.at
@@ -0,0 +1,7 @@
+AT_BANNER([OVS-DPDK dp_packet unit tests])
+
+AT_SETUP([OVS-DPDK dp_packet - mbufs allocation])
+AT_KEYWORDS([dp_packet, multi-seg, mbufs])
+AT_CHECK(ovstest test-dpdk-packet, [], [ignore], [ignore])
+
+AT_CLEANUP
diff --git a/tests/system-dpdk-testsuite.at b/tests/system-dpdk-testsuite.at
index 382f09e..f5edf58 100644
--- a/tests/system-dpdk-testsuite.at
+++ b/tests/system-dpdk-testsuite.at
@@ -23,3 +23,4 @@ m4_include([tests/system-common-macros.at])
 m4_include([tests/system-dpdk-macros.at])
 
 m4_include([tests/system-dpdk.at])
+m4_include([tests/dpdk-packet-mbufs.at])
diff --git a/tests/test-dpdk-mbufs.c b/tests/test-dpdk-mbufs.c
new file mode 100644
index 000..0c152bf
--- /dev/null
+++ b/tests/test-dpdk-mbufs.c
@@ -0,0 +1,722 @@
+/*
+ * Copyright (c) 2018 Intel Corporation
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "dp-packet.h"
+#include "ovstest.h"
+#include "dpdk.h"
+#include "smap.h"
+#include "csum.h"
+#include "crc32c.h"
+
+#define N_MBUFS 1024
+#define MBUF_DATA_LEN 2048
+
+static int num_tests = 0;
+
+/* Global var to hold a mempool instance, "test-mp", used in all of the tests
+ * below. This instance is instantiated in dpdk_setup_eal_with_mp(). */
+static struct rte_mempool *mp;
+
+/* Test data used to fill the packets with data. Note that this isn't a string
+ * that repsents a valid packet, by any means. The pattern is generated in set_
+ * testing_pattern_str() and the sole purpose is to verify the data remains the
+ * same after inserting and operating on multi-segment mbufs. */
+static char *test_str;
+
+/* Asserts a dp_packet that holds a single mbuf, where:
+ * - nb_segs must be 1;
+ * - pkt_len must be eq

[ovs-dev] [PATCH v13 08/11] netdev-dpdk: support multi-segment jumbo frames

2019-01-09 Thread Tiago Lam
From: Mark Kavanagh 

Currently, jumbo frame support for OvS-DPDK is implemented by
increasing the size of mbufs within a mempool, such that each mbuf
within the pool is large enough to contain an entire jumbo frame of
a user-defined size. Typically, for each user-defined MTU,
'requested_mtu', a new mempool is created, containing mbufs of size
~requested_mtu.

With the multi-segment approach, a port uses a single mempool,
(containing standard/default-sized mbufs of ~2k bytes), irrespective
of the user-requested MTU value. To accommodate jumbo frames, mbufs
are chained together, where each mbuf in the chain stores a portion of
the jumbo frame. Each mbuf in the chain is termed a segment, hence the
name.

== Enabling multi-segment mbufs ==
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 Documentation/topics/dpdk/jumbo-frames.rst | 73 ++
 Documentation/topics/dpdk/memory.rst   | 36 +++
 NEWS   |  1 +
 lib/dpdk.c |  8 
 lib/netdev-dpdk.c  | 66 +++
 lib/netdev-dpdk.h  |  1 +
 vswitchd/vswitch.xml   | 22 +
 7 files changed, 199 insertions(+), 8 deletions(-)

diff --git a/Documentation/topics/dpdk/jumbo-frames.rst 
b/Documentation/topics/dpdk/jumbo-frames.rst
index 00360b4..9804bbb 100644
--- a/Documentation/topics/dpdk/jumbo-frames.rst
+++ b/Documentation/topics/dpdk/jumbo-frames.rst
@@ -71,3 +71,76 @@ Jumbo frame support has been validated against 9728B frames, 
which is the
 largest frame size supported by Fortville NIC using the DPDK i40e driver, but
 larger frames and other DPDK NIC drivers may be supported. These cases are
 common for use cases involving East-West traffic only.
+
+---
+Multi-segment mbufs
+---
+
+Instead of increasing the size of mbufs within a mempool, such that each mbuf
+within the pool is large enough to contain an entire jumbo frame of a
+user-defined size, mbufs can be chained together instead. In this approach each
+mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
+irrespective of the user-requested MTU value. Since each mbuf in the chain is
+termed a segment, this approach is named "multi-segment mbufs".
+
+This approach may bring more flexibility in use cases where the maximum packet
+length may be hard to guess. For example, in cases where packets originate from
+sources marked for offload (such as TSO), each packet may be larger than the
+MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
+enough to hold all of the packet's data.
+
+Multi-segment and single-segment mbufs are mutually exclusive, and the user
+must decide on which approach to adopt on initialisation. If multi-segment
+mbufs is to be enabled, it can be done so with the following command::
+
+$ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
+
+Single-segment mbufs still remain the default when using OvS-DPDK, and the
+above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
+multi-segment mbufs are to be used.
+
+~
+Performance notes
+~
+
+When using multi-segment mbufs some PMDs may not support vectorized Tx
+functions, due to its non-contiguous nature. As a result this can hit
+performance for smaller packet sizes. For example, on a setup sending 64B
+packets at line rate, a decrease of ~20% has been observed. The performance
+impact stops being noticeable for larger packet sizes, although the exact size
+will depend on each PMD, and vary between architectures.
+
+Tests performed with the i40e PMD driver only showed this limitation for 64B
+packets, and the same rate was observed when comparing multi-segment mbufs and
+single-segment mbuf for 128B packets. In other words, the 20% drop in
+performance was not observed for packets >= 128B during this test case.
+
+Because of this, multi-segment mbufs is not advised to be used with small

[ovs-dev] [PATCH v13 05/11] dp-packet: copy data from multi-seg. DPDK mbuf

2019-01-09 Thread Tiago Lam
From: Michael Qiu 

When doing packet clone, if packet source is from DPDK driver,
multi-segment must be considered, and copy the segment's data one by
one.

Also, lots of DPDK mbuf's info is missed during a copy, like packet
type, ol_flags, etc.  That information is very important for DPDK to do
packets processing.

Co-authored-by: Mark Kavanagh 
Co-authored-by: Tiago Lam 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c   | 62 ++-
 lib/dp-packet.h   | 29 ++
 lib/netdev-dpdk.c |  1 +
 3 files changed, 73 insertions(+), 19 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 9ea3f78..4e4b8fb 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -159,39 +159,63 @@ dp_packet_clone(const struct dp_packet *buffer)
 return dp_packet_clone_with_headroom(buffer, 0);
 }
 
+#ifdef DPDK_NETDEV
+struct dp_packet *
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom) {
+struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
+
+/* Copy multi-seg data. */
+if (b->source == DPBUF_DPDK && !rte_pktmbuf_is_contiguous(>mbuf)) {
+void *dst = NULL;
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+new_buffer = dp_packet_new_with_headroom(pkt_len, headroom);
+dst = dp_packet_data(new_buffer);
+dp_packet_set_size(new_buffer, pkt_len);
+
+if (!rte_pktmbuf_read(mbuf, 0, pkt_len, dst)) {
+dp_packet_delete(new_buffer);
+return NULL;
+}
+} else {
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+dp_packet_size(b),
+headroom);
+}
+
+dp_packet_copy_common_members(new_buffer, b);
+
+dp_packet_copy_mbuf_flags(new_buffer, b);
+if (dp_packet_rss_valid(new_buffer)) {
+new_buffer->mbuf.hash.rss = b->mbuf.hash.rss;
+}
+
+return new_buffer;
+}
+#else
 /* Creates and returns a new dp_packet whose data are copied from 'buffer'.
  * The returned dp_packet will additionally have 'headroom' bytes of
  * headroom. */
 struct dp_packet *
-dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t headroom)
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom)
 {
 struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
 
-new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(buffer),
- dp_packet_size(buffer),
- headroom);
-/* Copy the following fields into the returned buffer: l2_pad_size,
- * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
-memcpy(_buffer->l2_pad_size, >l2_pad_size,
-sizeof(struct dp_packet) -
-offsetof(struct dp_packet, l2_pad_size));
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+pkt_len, headroom);
 
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.ol_flags = buffer->mbuf.ol_flags;
-#else
-new_buffer->rss_hash_valid = buffer->rss_hash_valid;
-#endif
+dp_packet_copy_common_members(new_buffer, b);
 
+new_buffer->rss_hash_valid = b->rss_hash_valid;
 if (dp_packet_rss_valid(new_buffer)) {
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.hash.rss = buffer->mbuf.hash.rss;
-#else
-new_buffer->rss_hash = buffer->rss_hash;
-#endif
+new_buffer->rss_hash = b->rss_hash;
 }
 
 return new_buffer;
 }
+#endif
 
 /* Creates and returns a new dp_packet that initially contains a copy of the
  * 'size' bytes of data starting at 'data' with no headroom or tailroom. */
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 08e79af..a003546 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -129,6 +129,10 @@ struct dp_packet *dp_packet_clone_data_with_headroom(const 
void *, size_t,
  size_t headroom);
 static inline void dp_packet_delete(struct dp_packet *);
 
+static inline void
+dp_packet_copy_common_members(struct dp_packet *new_b,
+  const struct dp_packet *b);
+
 static inline void *dp_packet_at(const struct dp_packet *, size_t offset,
  size_t size);
 static inline void *dp_packet_at_assert(const struct dp_packet *,
@@ -139,6 +143,8 @@ dp_packet_mbuf_from_offset(const struct dp_packet *b, 
size_t *offset);
 void
 dp_packet_mbuf_write(struct rte_mbuf *mbuf, int16_t ofs, uint32_t len,
  const void *data);
+static inline void
+dp_packet_copy_mbuf_flags(struct dp_packet *dst, const struct dp_packet *src);
 #endif
 static inline void *dp_packet_tail(co

[ovs-dev] [PATCH v13 03/11] dp-packet: Handle multi-seg mbufs in helper funcs.

2019-01-09 Thread Tiago Lam
Most helper functions in dp-packet assume that the data held by a
dp_packet is contiguous, and perform operations such as pointer
arithmetic under that assumption. However, with the introduction of
multi-segment mbufs, where data is non-contiguous, such assumptions are
no longer possible. Some examples of Such helper functions are
dp_packet_tail(), dp_packet_tailroom(), dp_packet_end(),
dp_packet_get_allocated() and dp_packet_at().

Thus, instead of assuming contiguous data in dp_packet, they  now
iterate over the (non-contiguous) data in mbufs to perform their
calculations.

Finally, dp_packet_use__() has also been modified to perform the
initialisation of the packet (and setting the source) before continuing
to set its size and data length, which now depends on the type of
packet.

Co-authored-by: Mark Kavanagh 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/dp-packet.c |   4 +-
 lib/dp-packet.h | 192 +++-
 2 files changed, 178 insertions(+), 18 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 93b0e9c..f54119d 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -41,11 +41,11 @@ static void
 dp_packet_use__(struct dp_packet *b, void *base, size_t allocated,
  enum dp_packet_source source)
 {
+dp_packet_init__(b, allocated, source);
+
 dp_packet_set_base(b, base);
 dp_packet_set_data(b, base);
 dp_packet_set_size(b, 0);
-
-dp_packet_init__(b, allocated, source);
 }
 
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 72a8043..8947477 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -133,6 +133,10 @@ static inline void *dp_packet_at(const struct dp_packet *, 
size_t offset,
  size_t size);
 static inline void *dp_packet_at_assert(const struct dp_packet *,
 size_t offset, size_t size);
+#ifdef DPDK_NETDEV
+static inline const struct rte_mbuf *
+dp_packet_mbuf_from_offset(const struct dp_packet *b, size_t *offset);
+#endif
 static inline void *dp_packet_tail(const struct dp_packet *);
 static inline void *dp_packet_end(const struct dp_packet *);
 
@@ -185,9 +189,25 @@ dp_packet_delete(struct dp_packet *b)
 static inline void *
 dp_packet_at(const struct dp_packet *b, size_t offset, size_t size)
 {
-return offset + size <= dp_packet_size(b)
-   ? (char *) dp_packet_data(b) + offset
-   : NULL;
+if (offset + size > dp_packet_size(b)) {
+return NULL;
+}
+
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+while (buf && offset > buf->data_len) {
+offset -= buf->data_len;
+
+buf = buf->next;
+}
+
+return buf ? rte_pktmbuf_mtod_offset(buf, char *, offset) : NULL;
+}
+#endif
+
+return (char *) dp_packet_data(b) + offset;
 }
 
 /* Returns a pointer to byte 'offset' in 'b', which must contain at least
@@ -196,13 +216,23 @@ static inline void *
 dp_packet_at_assert(const struct dp_packet *b, size_t offset, size_t size)
 {
 ovs_assert(offset + size <= dp_packet_size(b));
-return ((char *) dp_packet_data(b)) + offset;
+return dp_packet_at(b, offset, size);
 }
 
 /* Returns a pointer to byte following the last byte of data in use in 'b'. */
 static inline void *
 dp_packet_tail(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
+/* Find last segment where data ends, meaning the tail of the chained
+ *  mbufs must be there */
+buf = rte_pktmbuf_lastseg(buf);
+
+return rte_pktmbuf_mtod_offset(buf, void *, buf->data_len);
+}
+#endif
 return (char *) dp_packet_data(b) + dp_packet_size(b);
 }
 
@@ -211,6 +241,15 @@ dp_packet_tail(const struct dp_packet *b)
 static inline void *
 dp_packet_end(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, &(b->mbuf));
+
+buf = rte_pktmbuf_lastseg(buf);
+
+return (char *) buf->buf_addr + buf->buf_len;
+}
+#endif
 return (char *) dp_packet_base(b) + dp_packet_get_allocated(b);
 }
 
@@ -236,6 +275,15 @@ dp_packet_tailroom(const struct dp_packet *b)
 static inline void
 dp_packet_clear(struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+/* sets pkt_len and data_len to zero and frees unused mbufs */
+dp_packet_set_size(b, 0);
+rte_pktmbuf_reset(>mbuf);
+
+return;
+}
+#endif
 dp_packet_set_data(b, dp_packet_base(b));
 dp_packet_set_size(b, 0);
 }
@@ -248,28 +296,47 @@ dp_packet_pull(st

[ovs-dev] [PATCH v13 00/11] Support multi-segment mbufs

2019-01-09 Thread Tiago Lam
and dp_packet_at();
- Added support for shifting packet data in multi-seg mbufs, using
  dp_packet_shift();
- Fixed some minor inconsistencies.

Note that some of the changes in v5 have been contributed by Mark
    Kavanagh as well.

v4: - restructure patchset
- account for 128B ARM cacheline when sizing mbufs

Mark Kavanagh (2):
  netdev-dpdk: copy large packet to multi-seg. mbufs
  netdev-dpdk: support multi-segment jumbo frames

Michael Qiu (1):
  dp-packet: copy data from multi-seg. DPDK mbuf

Tiago Lam (8):
  netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
  dp-packet: Fix data_len handling multi-seg mbufs.
  dp-packet: Handle multi-seg mbufs in helper funcs.
  dp-packet: Handle multi-seg mubfs in shift() func.
  dp-packet: Add support for data "linearization".
  dpdk-tests: Add unit-tests for multi-seg mbufs.
  dpdk-tests: Accept other configs in OVS_DPDK_START
  dpdk-tests: End-to-end tests for multi-seg mbufs.

 Documentation/topics/dpdk/jumbo-frames.rst |  73 +++
 Documentation/topics/dpdk/memory.rst   |  36 ++
 NEWS   |   1 +
 lib/bfd.c  |   3 +-
 lib/conntrack-private.h|   4 +-
 lib/conntrack.c|  43 +-
 lib/crc32c.c   |  35 +-
 lib/crc32c.h   |   2 +
 lib/dp-packet.c| 189 +++-
 lib/dp-packet.h| 480 +--
 lib/dpdk.c |   8 +
 lib/dpif-netdev.c  |  13 +-
 lib/dpif-netlink.c |   5 +
 lib/dpif.c |   9 +
 lib/flow.c |  97 +++-
 lib/flow.h |   2 +-
 lib/mcast-snooping.c   |   2 +
 lib/netdev-bsd.c   |   5 +
 lib/netdev-dpdk.c  | 187 +++-
 lib/netdev-dpdk.h  |   1 +
 lib/netdev-dummy.c |  13 +-
 lib/netdev-linux.c |  13 +-
 lib/netdev-native-tnl.c|  26 +-
 lib/odp-execute.c  |  12 +-
 lib/ovs-lldp.c |   3 +-
 lib/packets.c  |  85 +++-
 lib/packets.h  |   7 +
 ofproto/ofproto-dpif-upcall.c  |  20 +-
 ofproto/ofproto-dpif-xlate.c   |  32 +-
 tests/automake.mk  |  10 +-
 tests/dpdk-packet-mbufs.at |   7 +
 tests/system-dpdk-macros.at|   6 +-
 tests/system-dpdk-testsuite.at |   1 +
 tests/system-dpdk.at   |  65 +++
 tests/test-dpdk-mbufs.c| 722 +
 tests/test-rstp.c  |   8 +-
 tests/test-stp.c   |   8 +-
 vswitchd/vswitch.xml   |  22 +
 38 files changed, 2088 insertions(+), 167 deletions(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v13 01/11] netdev-dpdk: Serialise non-pmds mbufs' alloc/free.

2019-01-09 Thread Tiago Lam
A new mutex, 'nonpmd_mp_mutex', has been introduced to serialise
allocation and free operations by non-pmd threads on a given mempool.

free_dpdk_buf() has been modified to make use of the introduced mutex.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/netdev-dpdk.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 320422b..ea84c41 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -318,6 +318,16 @@ static struct ovs_mutex dpdk_mp_mutex 
OVS_ACQ_AFTER(dpdk_mutex)
 static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
 = OVS_LIST_INITIALIZER(_mp_list);
 
+/* This mutex must be used by non pmd threads when allocating or freeing
+ * mbufs through mempools, when outside of the `non_pmd_mutex` mutex, in struct
+ * dp_netdev.
+ * The reason, as pointed out in the "Known Issues" section in DPDK's EAL docs,
+ * is that the implementation on which mempool is based off is non-preemptable.
+ * Since non-pmds may end up not being pinned this could lead to the preemption
+ * between non-pmds performing operations on the same mempool, which could lead
+ * to memory corruption. */
+static struct ovs_mutex nonpmd_mp_mutex = OVS_MUTEX_INITIALIZER;
+
 struct dpdk_mp {
  struct rte_mempool *mp;
  int mtu;
@@ -490,6 +500,8 @@ struct netdev_rxq_dpdk {
 dpdk_port_t port_id;
 };
 
+static bool dpdk_thread_is_pmd(void);
+
 static void netdev_dpdk_destruct(struct netdev *netdev);
 static void netdev_dpdk_vhost_destruct(struct netdev *netdev);
 
@@ -523,6 +535,12 @@ dpdk_buf_size(int mtu)
 + RTE_PKTMBUF_HEADROOM;
 }
 
+static bool
+dpdk_thread_is_pmd(void)
+{
+ return rte_lcore_id() != NON_PMD_CORE_ID;
+}
+
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
  *
  * Unlike xmalloc(), this function can return NULL on failure. */
@@ -535,9 +553,18 @@ dpdk_rte_mzalloc(size_t sz)
 void
 free_dpdk_buf(struct dp_packet *p)
 {
-struct rte_mbuf *pkt = (struct rte_mbuf *) p;
+/* If non-pmd we need to lock on nonpmd_mp_mutex mutex */
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_lock(_mp_mutex);
+
+rte_pktmbuf_free(>mbuf);
+
+ovs_mutex_unlock(_mp_mutex);
+
+return;
+}
 
-rte_pktmbuf_free(pkt);
+rte_pktmbuf_free(>mbuf);
 }
 
 static void
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v13 04/11] dp-packet: Handle multi-seg mubfs in shift() func.

2019-01-09 Thread Tiago Lam
In its current implementation dp_packet_shift() is also unaware of
multi-seg mbufs (that holds data in memory non-contiguously) and assumes
that data exists contiguously in memory, memmove'ing data to perform the
shift.

To add support for multi-seg mbufs a new set of functions was
introduced, dp_packet_mbuf_shift() and dp_packet_mbuf_write(). These
functions are used by dp_packet_shift(), when handling multi-seg mbufs,
to shift and write data within a chain of mbufs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c | 105 
 lib/dp-packet.h |   3 ++
 2 files changed, 108 insertions(+)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index f54119d..9ea3f78 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -23,6 +23,11 @@
 #include "openvswitch/dynamic-string.h"
 #include "util.h"
 
+#ifdef DPDK_NETDEV
+#define MBUF_BUF_END(BUF_ADDR, BUF_LEN) \
+(char *) (((char *) BUF_ADDR) + BUF_LEN)
+#endif
+
 static void
 dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source 
source)
 {
@@ -290,6 +295,100 @@ dp_packet_prealloc_headroom(struct dp_packet *b, size_t 
size)
 }
 }
 
+#ifdef DPDK_NETDEV
+/* Write len data bytes in a mbuf at specified offset.
+ *
+ * 'mbuf', pointer to the destination mbuf where 'ofs' is, and the mbuf where
+ * the data will first be written.
+ * 'ofs', the offset within the provided 'mbuf' where 'data' is to be written.
+ * 'len', the size of the to be written 'data'.
+ * 'data', pointer to the to be written bytes.
+ *
+ * Note: This function is the counterpart of the `rte_pktmbuf_read()` function
+ * available with DPDK, in the rte_mbuf.h */
+void
+dp_packet_mbuf_write(struct rte_mbuf *mbuf, int16_t ofs, uint32_t len,
+ const void *data)
+{
+char *dst_addr;
+uint16_t data_len;
+int len_copy;
+while (mbuf) {
+if (len == 0) {
+break;
+}
+
+dst_addr = rte_pktmbuf_mtod_offset(mbuf, char *, ofs);
+data_len = MBUF_BUF_END(mbuf->buf_addr, mbuf->buf_len) - dst_addr;
+
+len_copy = MIN(len, data_len);
+/* We don't know if 'data' is the result of a rte_pktmbuf_read() call,
+ * in which case we may end up writing to the same region of memory we
+ * are reading from and overlapping. Hence the use of memmove() here */
+memmove(dst_addr, data, len_copy);
+
+data = ((char *) data) + len_copy;
+len -= len_copy;
+ofs = 0;
+
+mbuf->data_len = len_copy;
+mbuf = mbuf->next;
+}
+}
+
+static void
+dp_packet_mbuf_shift_(struct rte_mbuf *dbuf, int16_t dst_ofs,
+  const struct rte_mbuf *sbuf, uint16_t src_ofs, int len)
+{
+char *rd = xmalloc(sizeof(*rd) * len);
+const char *wd = rte_pktmbuf_read(sbuf, src_ofs, len, rd);
+
+ovs_assert(wd);
+
+dp_packet_mbuf_write(dbuf, dst_ofs, len, wd);
+
+free(rd);
+}
+
+/* Similarly to dp_packet_shift(), shifts the data within the mbufs of a
+ * dp_packet of DPBUF_DPDK source by 'delta' bytes.
+ * Caller must make sure of the following conditions:
+ * - When shifting left, delta can't be bigger than the data_len available in
+ *   the last mbuf;
+ * - When shifting right, delta can't be bigger than the space available in the
+ *   first mbuf (buf_len - data_off).
+ * Both these conditions guarantee that a shift operation doesn't fall outside
+ * the bounds of the existing mbufs, so that the first and last mbufs (when
+ * using multi-segment mbufs), remain the same. */
+static void
+dp_packet_mbuf_shift(struct dp_packet *b, int delta)
+{
+uint16_t src_ofs;
+int16_t dst_ofs;
+
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+struct rte_mbuf *tmbuf = rte_pktmbuf_lastseg(mbuf);
+
+if (delta < 0) {
+ovs_assert(-delta <= tmbuf->data_len);
+} else {
+ovs_assert(delta < (mbuf->buf_len - mbuf->data_off));
+}
+
+/* Set the destination and source offsets to copy to */
+dst_ofs = delta;
+src_ofs = 0;
+
+/* Shift data from src mbuf and offset to dst mbuf and offset */
+dp_packet_mbuf_shift_(mbuf, dst_ofs, mbuf, src_ofs,
+  rte_pktmbuf_pkt_len(mbuf));
+
+/* Update mbufs' properties, and if using multi-segment mbufs, first and
+ * last mbuf's data_len also needs to be adjusted */
+mbuf->data_off = mbuf->data_off + dst_ofs;
+}
+#endif
+
 /* Shifts all of the data within the allocated space in 'b' by 'delta' bytes.
  * For example, a 'delta' of 1 would cause each byte of data to move one byte
  * forward (from address 'p' to 'p+1'), and a 'delta' of -1 would cause each
@@ -302,6 +401,12 @@ dp_packet_shift(struct dp_packet *b, int delta)
: true);
 
 if (delta != 0) {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+dp_packet_mbuf_shift(b, delta);
+return;
+

[ovs-dev] [PATCH v13 02/11] dp-packet: Fix data_len handling multi-seg mbufs.

2019-01-09 Thread Tiago Lam
When a dp_packet is from a DPDK source, and it contains multi-segment
mbufs, the data_len is not equal to the packet size, pkt_len. Instead,
the data_len of each mbuf in the chain should be considered while
distributing the new (provided) size.

To account for the above dp_packet_set_size() has been changed so that,
in the multi-segment mbufs case, only the data_len on the last mbuf of
the chain and the total size of the packet, pkt_len, are changed. The
data_len on the intermediate mbufs preceeding the last mbuf is not
changed by dp_packet_set_size(). Furthermore, in some cases
dp_packet_set_size() may be used to set a smaller size than the current
packet size, thus effectively trimming the end of the packet. In the
multi-segment mbufs case this may lead to lingering mbufs that may need
freeing.

__dp_packet_set_data() now also updates an mbufs' data_len after setting
the data offset. This is so that both fields are always in sync for each
mbuf in a chain.

Co-authored-by: Michael Qiu 
Co-authored-by: Mark Kavanagh 
Co-authored-by: Przemyslaw Lal 
Co-authored-by: Marcin Ksiadz 
Co-authored-by: Yuanhan Liu 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Przemyslaw Lal 
Signed-off-by: Marcin Ksiadz 
Signed-off-by: Yuanhan Liu 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.h | 101 +---
 1 file changed, 89 insertions(+), 12 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 7b85dd9..72a8043 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -426,20 +426,78 @@ dp_packet_size(const struct dp_packet *b)
 return b->mbuf.pkt_len;
 }
 
+/* Sets the size of the packet 'b' to 'v'. For non-DPDK packets this only means
+ * setting b->size_, but if used in a DPDK packet it means adjusting the first
+ * mbuf pkt_len and last mbuf data_len, to reflect the real size, which can
+ * lead to free'ing tail mbufs that are no longer used.
+ *
+ * This function should be used for setting the size only, and if there's an
+ * assumption that the tail end of 'b' will be trimmed. For adjusting the head
+ * 'end' of 'b', dp_packet_pull() should be used instead. */
 static inline void
 dp_packet_set_size(struct dp_packet *b, uint32_t v)
 {
-/* netdev-dpdk does not currently support segmentation; consequently, for
- * all intents and purposes, 'data_len' (16 bit) and 'pkt_len' (32 bit) may
- * be used interchangably.
- *
- * On the datapath, it is expected that the size of packets
- * (and thus 'v') will always be <= UINT16_MAX; this means that there is no
- * loss of accuracy in assigning 'v' to 'data_len'.
- */
-b->mbuf.data_len = (uint16_t)v;  /* Current seg length. */
-b->mbuf.pkt_len = v; /* Total length of all segments linked to
-  * this segment. */
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *mbuf = >mbuf;
+uint16_t new_len = v;
+uint16_t data_len;
+uint16_t nb_segs = 0;
+uint16_t pkt_len = 0;
+
+/* Trim 'v' length bytes from the end of the chained buffers, freeing
+ * any buffers that may be left floating.
+ *
+ * For that traverse over the entire mbuf chain and, for each mbuf,
+ * subtract its 'data_len' from 'new_len' (initially set to 'v'), which
+ * essentially spreads 'new_len' between all existing mbufs in the
+ * chain. While traversing the mbuf chain, we end the traversal if:
+ * - 'new_size' reaches 0, meaning the passed 'v' has been
+ *   appropriately spread over the mbuf chain. The remaining mbufs are
+ *   freed;
+ * - We reach the last mbuf in the chain, in which case we set the last
+ *   mbuf's 'data_len' to the minimum value between the current
+ *   'new_len' (what's leftover from 'v') size and the maximum data the
+ *   mbuf can hold (mbuf->buf_len - mbuf->data_off).
+ *
+ * The above formula will thus make sure that when a 'v' is smaller
+ * than the overall 'pkt_len' (sum of all 'data_len'), it sets the new
+ * size and frees the leftover mbufs. In the other hand, if 'v' is
+ * bigger, it sets the size to the maximum available space, but no more
+ * than that. */
+while (mbuf) {
+data_len = MIN(new_len, mbuf->data_len);
+mbuf->data_len = data_len;
+
+if (new_len - data_len <= 0) {
+/* Free the rest of chained mbufs */
+free_dpdk_buf(CONTAINER_OF(mbuf->next, struct dp_packet,
+   mbuf));
+mbuf->next = NULL;
+} else if (!mbuf->next) {
+/* Don't assign more than what we have available */
+mbuf->data_len = MIN(new_len,
+   

[ovs-dev] [PATCH v12 11/11] dpdk-tests: End-to-end tests for multi-seg mbufs.

2019-01-04 Thread Tiago Lam
The following tests are added to the DPDK testsuite to add some
coverage for the multi-segment mbufs:
- Check that multi-segment mbufs are disabled by default;
- Check that providing `other_config:dpdk-multi-seg-mbufs=true` indeed
  enables mbufs;
- Using a DPDK port, send a random packet out and check that `ofctl
  dump-flows` shows the correct amount of packets and bytes sent.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/system-dpdk.at | 65 
 1 file changed, 65 insertions(+)

diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 1da020a..ca66135 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -232,3 +232,68 @@ OVS_VSWITCHD_STOP(["\@does not exist. The Open vSwitch 
kernel module is probably
 \@EAL: No free hugepages reported in hugepages-1048576kB@d"])
 AT_CLEANUP
 dnl --
+
+AT_SETUP([Jumbo frames - Multi-segment disabled by default])
+OVS_DPDK_START()
+
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [1], [])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment enabled])
+OVS_DPDK_START([dpdk-multi-seg-mbufs=true])
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [], [stdout])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment mbufs Tx])
+OVS_DPDK_PRE_CHECK()
+OVS_DPDK_START([per-port-memory=true dpdk-multi-seg-mbufs=true])
+
+dnl Add userspace bridge and attach it to OVS
+AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
+AT_CHECK([ovs-vsctl add-port br10 dpdk0 \
+-- set Interface dpdk0 type=dpdk options:dpdk-devargs=$(cat PCI_ADDR) \
+-- set Interface dpdk0 mtu_request=9000], [], [stdout], [stderr])
+
+AT_CHECK([ovs-vsctl show], [], [stdout])
+
+dnl Add flows to send packets out from the 'dpdk0' port
+AT_CHECK([
+ovs-ofctl del-flows br10
+ovs-ofctl add-flow br10 in_port=LOCAL,actions=output:dpdk0
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [], [stdout])
+
+dnl Send packet out, of the 'dpdk0' port
+AT_CHECK([
+ARP_HEADER="09000B0009000A000806000108000604000100010A\
+0100020A02"
+dnl Build a random hex string to append to the ARP_HEADER
+RANDOM_BODY=$(printf '0102030405%.0s' {1..1750})
+dnl 8792B ARP packet
+RANDOM_ARP="$ARP_HEADER$RANDOM_BODY"
+
+ovs-ofctl packet-out br10 "packet=$RANDOM_ARP,action=resubmit:LOCAL"
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [0], [stdout])
+
+dnl Confirm the single packet as been sent with correct size
+AT_CHECK([ovs-ofctl dump-flows br10 | ofctl_strip | grep in_port], [0], [dnl
+ n_packets=1, n_bytes=8792, in_port=LOCAL actions=output:1
+])
+
+dnl Clean up
+OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel module is probably 
not loaded./d
+/Failed to enable flow control/d
+/failed to connect to \/tmp\/dpdkvhostclient0: No such file or directory/d
+/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v12 10/11] dpdk-tests: Accept other configs in OVS_DPDK_START

2019-01-04 Thread Tiago Lam
As it stands, OVS_DPDK_START() won't allow other configs to be set
before starting the ovs-vswitchd daemon. This is a problem since some
configs, such as the "dpdk-multi-seg-mbufs=true" for enabling the
multi-segment mbufs, need to be set prior to start OvS.

To support other options, OVS_DPDK_START() has been modified to accept
extra configs in the form "$config_name=$config_value". It then uses
ovs-vsctl to set the configs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 tests/system-dpdk-macros.at | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tests/system-dpdk-macros.at b/tests/system-dpdk-macros.at
index f772a19..63d17c1 100644
--- a/tests/system-dpdk-macros.at
+++ b/tests/system-dpdk-macros.at
@@ -33,7 +33,7 @@ m4_define([OVS_DPDK_PRE_PHY_SKIP],
 ])
 
 
-# OVS_DPDK_START()
+# OVS_DPDK_START([other-conf-args])
 #
 # Create an empty database and start ovsdb-server. Add special configuration
 # dpdk-init to enable DPDK functionality. Start ovs-vswitchd connected to that
@@ -60,6 +60,10 @@ m4_define([OVS_DPDK_START],
AT_CHECK([lscpu], [], [stdout])
AT_CHECK([cat stdout | grep "NUMA node(s)" | awk '{c=1; while (c++<$(3)) 
{printf "1024,"}; print "1024"}' > SOCKET_MEM])
AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . 
other_config:dpdk-socket-mem="$(cat SOCKET_MEM)"])
+   dnl Iterate through $other-conf-args list and include them
+   m4_foreach_w(opt, $1, [
+   AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . other_config:opt])
+   ])
 
dnl Start ovs-vswitchd.
AT_CHECK([ovs-vswitchd --detach --no-chdir --pidfile --log-file -vvconn 
-vofproto_dpif -vunixctl], [0], [stdout], [stderr])
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v12 09/11] dpdk-tests: Add unit-tests for multi-seg mbufs.

2019-01-04 Thread Tiago Lam
In order to create a minimal environment that allows the tests to get
mbufs from an existing mempool, the following approach is taken:
- EAL is initialised (by using the main dpdk_init()) and a (very) small
  mempool is instantiated (mimicking the logic in dpdk_mp_create()).
  This mempool instance is global and used by all the tests;
- Packets are then allocated from the instantiated mempool, and tested
  on, by running some operations on them and manipulating data.

The tests introduced focus on testing DPDK dp_packets (where
source=DPBUF_DPDK), linked with a single or multiple mbufs, across
several operations, such as:
- dp_packet_put();
- dp_packet_shift();
- dp_packet_reserve();
- dp_packet_push_uninit();
- dp_packet_clear();
- dp_packet_equal();
- dp_packet_linear_data();
- And as a consequence of some of these, dp_packet_put_uninit() and
  dp_packet_resize__().

Finally, this has also been integrated with the new DPDK testsuite.
Thus, when running `$sudo make check-dpdk` one will also be running
these tests.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 tests/automake.mk  |  10 +-
 tests/dpdk-packet-mbufs.at |   7 +
 tests/system-dpdk-testsuite.at |   1 +
 tests/test-dpdk-mbufs.c| 722 +
 4 files changed, 739 insertions(+), 1 deletion(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 92d56b2..6e6237c 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -168,7 +168,8 @@ SYSTEM_DPDK_TESTSUITE_AT = \
tests/system-common-macros.at \
tests/system-dpdk-macros.at \
tests/system-dpdk-testsuite.at \
-   tests/system-dpdk.at
+   tests/system-dpdk.at \
+   tests/dpdk-packet-mbufs.at
 
 check_SCRIPTS += tests/atlocal
 
@@ -423,6 +424,10 @@ tests_ovstest_SOURCES = \
tests/test-vconn.c \
tests/test-aa.c \
tests/test-stopwatch.c
+if DPDK_NETDEV
+tests_ovstest_SOURCES += \
+   tests/test-dpdk-mbufs.c
+endif
 
 if !WIN32
 tests_ovstest_SOURCES += \
@@ -435,6 +440,9 @@ tests_ovstest_SOURCES += \
 endif
 
 tests_ovstest_LDADD = lib/libopenvswitch.la ovn/lib/libovn.la
+if DPDK_NETDEV
+tests_ovstest_LDFLAGS = $(AM_LDFLAGS) $(DPDK_vswitchd_LDFLAGS)
+endif
 
 noinst_PROGRAMS += tests/test-stream
 tests_test_stream_SOURCES = tests/test-stream.c
diff --git a/tests/dpdk-packet-mbufs.at b/tests/dpdk-packet-mbufs.at
new file mode 100644
index 000..f28e4fc
--- /dev/null
+++ b/tests/dpdk-packet-mbufs.at
@@ -0,0 +1,7 @@
+AT_BANNER([OVS-DPDK dp_packet unit tests])
+
+AT_SETUP([OVS-DPDK dp_packet - mbufs allocation])
+AT_KEYWORDS([dp_packet, multi-seg, mbufs])
+AT_CHECK(ovstest test-dpdk-packet, [], [ignore], [ignore])
+
+AT_CLEANUP
diff --git a/tests/system-dpdk-testsuite.at b/tests/system-dpdk-testsuite.at
index 382f09e..f5edf58 100644
--- a/tests/system-dpdk-testsuite.at
+++ b/tests/system-dpdk-testsuite.at
@@ -23,3 +23,4 @@ m4_include([tests/system-common-macros.at])
 m4_include([tests/system-dpdk-macros.at])
 
 m4_include([tests/system-dpdk.at])
+m4_include([tests/dpdk-packet-mbufs.at])
diff --git a/tests/test-dpdk-mbufs.c b/tests/test-dpdk-mbufs.c
new file mode 100644
index 000..0c152bf
--- /dev/null
+++ b/tests/test-dpdk-mbufs.c
@@ -0,0 +1,722 @@
+/*
+ * Copyright (c) 2018 Intel Corporation
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "dp-packet.h"
+#include "ovstest.h"
+#include "dpdk.h"
+#include "smap.h"
+#include "csum.h"
+#include "crc32c.h"
+
+#define N_MBUFS 1024
+#define MBUF_DATA_LEN 2048
+
+static int num_tests = 0;
+
+/* Global var to hold a mempool instance, "test-mp", used in all of the tests
+ * below. This instance is instantiated in dpdk_setup_eal_with_mp(). */
+static struct rte_mempool *mp;
+
+/* Test data used to fill the packets with data. Note that this isn't a string
+ * that repsents a valid packet, by any means. The pattern is generated in set_
+ * testing_pattern_str() and the sole purpose is to verify the data remains the
+ * same after inserting and operating on multi-segment mbufs. */
+static char *test_str;
+
+/* Asserts a dp_packet that holds a single mbuf, where:
+ * - nb_segs must be 1;
+ * - pkt_len must be eq

[ovs-dev] [PATCH v12 08/11] netdev-dpdk: support multi-segment jumbo frames

2019-01-04 Thread Tiago Lam
From: Mark Kavanagh 

Currently, jumbo frame support for OvS-DPDK is implemented by
increasing the size of mbufs within a mempool, such that each mbuf
within the pool is large enough to contain an entire jumbo frame of
a user-defined size. Typically, for each user-defined MTU,
'requested_mtu', a new mempool is created, containing mbufs of size
~requested_mtu.

With the multi-segment approach, a port uses a single mempool,
(containing standard/default-sized mbufs of ~2k bytes), irrespective
of the user-requested MTU value. To accommodate jumbo frames, mbufs
are chained together, where each mbuf in the chain stores a portion of
the jumbo frame. Each mbuf in the chain is termed a segment, hence the
name.

== Enabling multi-segment mbufs ==
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 Documentation/topics/dpdk/jumbo-frames.rst | 73 ++
 Documentation/topics/dpdk/memory.rst   | 36 +++
 NEWS   |  1 +
 lib/dpdk.c |  8 
 lib/netdev-dpdk.c  | 66 +++
 lib/netdev-dpdk.h  |  1 +
 vswitchd/vswitch.xml   | 22 +
 7 files changed, 199 insertions(+), 8 deletions(-)

diff --git a/Documentation/topics/dpdk/jumbo-frames.rst 
b/Documentation/topics/dpdk/jumbo-frames.rst
index 00360b4..9804bbb 100644
--- a/Documentation/topics/dpdk/jumbo-frames.rst
+++ b/Documentation/topics/dpdk/jumbo-frames.rst
@@ -71,3 +71,76 @@ Jumbo frame support has been validated against 9728B frames, 
which is the
 largest frame size supported by Fortville NIC using the DPDK i40e driver, but
 larger frames and other DPDK NIC drivers may be supported. These cases are
 common for use cases involving East-West traffic only.
+
+---
+Multi-segment mbufs
+---
+
+Instead of increasing the size of mbufs within a mempool, such that each mbuf
+within the pool is large enough to contain an entire jumbo frame of a
+user-defined size, mbufs can be chained together instead. In this approach each
+mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
+irrespective of the user-requested MTU value. Since each mbuf in the chain is
+termed a segment, this approach is named "multi-segment mbufs".
+
+This approach may bring more flexibility in use cases where the maximum packet
+length may be hard to guess. For example, in cases where packets originate from
+sources marked for offload (such as TSO), each packet may be larger than the
+MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
+enough to hold all of the packet's data.
+
+Multi-segment and single-segment mbufs are mutually exclusive, and the user
+must decide on which approach to adopt on initialisation. If multi-segment
+mbufs is to be enabled, it can be done so with the following command::
+
+$ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
+
+Single-segment mbufs still remain the default when using OvS-DPDK, and the
+above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
+multi-segment mbufs are to be used.
+
+~
+Performance notes
+~
+
+When using multi-segment mbufs some PMDs may not support vectorized Tx
+functions, due to its non-contiguous nature. As a result this can hit
+performance for smaller packet sizes. For example, on a setup sending 64B
+packets at line rate, a decrease of ~20% has been observed. The performance
+impact stops being noticeable for larger packet sizes, although the exact size
+will depend on each PMD, and vary between architectures.
+
+Tests performed with the i40e PMD driver only showed this limitation for 64B
+packets, and the same rate was observed when comparing multi-segment mbufs and
+single-segment mbuf for 128B packets. In other words, the 20% drop in
+performance was not observed for packets >= 128B during this test case.
+
+Because of this, multi-segment mbufs is not advised to be used with small

[ovs-dev] [PATCH v12 07/11] netdev-dpdk: copy large packet to multi-seg. mbufs

2019-01-04 Thread Tiago Lam
From: Mark Kavanagh 

Currently, packets are only copied to a single segment in the function
dpdk_do_tx_copy(). This could be an issue in the case of jumbo frames,
particularly when multi-segment mbufs are involved.

This patch calculates the number of segments needed by a packet and
copies the data to each segment.

A new function, dpdk_buf_alloc(), has also been introduced as a wrapper
around the nonpmd_mp_mutex to serialise allocations from a non-pmd
context.

Co-authored-by: Michael Qiu 
Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Michael Qiu 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 89 +--
 1 file changed, 80 insertions(+), 9 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index a0df9fc..7a9add7 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -550,6 +550,25 @@ dpdk_rte_mzalloc(size_t sz)
 return rte_zmalloc(OVS_VPORT_DPDK, sz, OVS_CACHE_LINE_SIZE);
 }
 
+static struct rte_mbuf *
+dpdk_buf_alloc(struct rte_mempool *mp)
+{
+struct rte_mbuf *mbuf = NULL;
+
+/* If non-pmd we need to lock on nonpmd_mp_mutex mutex. */
+if (dpdk_thread_is_pmd()) {
+mbuf = rte_pktmbuf_alloc(mp);
+} else {
+ovs_mutex_lock(_mp_mutex);
+
+mbuf = rte_pktmbuf_alloc(mp);
+
+ovs_mutex_unlock(_mp_mutex);
+}
+
+return mbuf;
+}
+
 void
 free_dpdk_buf(struct dp_packet *p)
 {
@@ -2333,6 +2352,56 @@ out:
 }
 }
 
+static int
+dpdk_copy_dp_packet_to_mbuf(struct dp_packet *packet, struct rte_mbuf **head,
+struct rte_mempool *mp)
+{
+struct rte_mbuf *mbuf, *fmbuf;
+uint16_t max_data_len;
+uint32_t nb_segs = 0;
+uint32_t size = 0;
+
+/* We will need the whole data for copying below. */
+if (!dp_packet_is_linear(packet)) {
+dp_packet_linearize(packet);
+}
+
+/* Allocate first mbuf to know the size of data available. */
+fmbuf = mbuf = *head = dpdk_buf_alloc(mp);
+if (OVS_UNLIKELY(!mbuf)) {
+return ENOMEM;
+}
+
+size = dp_packet_size(packet);
+
+/* All new allocated mbuf's max data len is the same. */
+max_data_len = mbuf->buf_len - mbuf->data_off;
+
+/* Calculate # of output mbufs. */
+nb_segs = size / max_data_len;
+if (size % max_data_len) {
+nb_segs = nb_segs + 1;
+}
+
+/* Allocate additional mbufs, less the one alredy allocated above. */
+for (int i = 1; i < nb_segs; i++) {
+mbuf->next = dpdk_buf_alloc(mp);
+if (!mbuf->next) {
+free_dpdk_buf(CONTAINER_OF(fmbuf, struct dp_packet, mbuf));
+fmbuf = NULL;
+return ENOMEM;
+}
+mbuf = mbuf->next;
+}
+
+fmbuf->nb_segs = nb_segs;
+fmbuf->pkt_len = size;
+
+dp_packet_mbuf_write(fmbuf, 0, size, dp_packet_data(packet));
+
+return 0;
+}
+
 /* Tx function. Transmit packets indefinitely */
 static void
 dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
@@ -2349,6 +2418,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
 uint32_t cnt = batch_cnt;
 uint32_t dropped = 0;
+uint32_t i;
 
 if (dev->type != DPDK_DEV_VHOST) {
 /* Check if QoS has been configured for this netdev. */
@@ -2359,28 +2429,29 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 
 uint32_t txcnt = 0;
 
-for (uint32_t i = 0; i < cnt; i++) {
+for (i = 0; i < cnt; i++) {
 struct dp_packet *packet = batch->packets[i];
 uint32_t size = dp_packet_size(packet);
+int err = 0;
 
 if (OVS_UNLIKELY(size > dev->max_packet_len)) {
 VLOG_WARN_RL(, "Too big size %u max_packet_len %d",
  size, dev->max_packet_len);
-
 dropped++;
 continue;
 }
 
-pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
-if (OVS_UNLIKELY(!pkts[txcnt])) {
+err = dpdk_copy_dp_packet_to_mbuf(packet, [txcnt],
+  dev->dpdk_mp->mp);
+if (err != 0) {
+if (err == ENOMEM) {
+VLOG_ERR_RL(, "Failed to alloc mbufs! %u packets dropped",
+cnt - i);
+}
+
 dropped += cnt - i;
 break;
 }
-
-/* We have to do a copy for now */
-memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
-   dp_packet_data(packet), size);
-dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
 dp_packet_copy_mbuf_flags((struct dp_packet *)pkts[txcnt], packet);
 
 txcnt++;
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v12 06/11] dp-packet: Add support for data "linearization".

2019-01-04 Thread Tiago Lam
Previous commits have added support to the dp_packet API to handle
multi-segmented packets, where data is not stored contiguously in
memory. However, in some cases, it is inevitable and data must be
provided contiguously. Examples of such cases are when performing csums
over the entire packet data, or when write()'ing to a file descriptor
(for a tap interface, for example). For such cases, the dp_packet API
has been extended to provide a way to transform a multi-segmented
DPBUF_DPDK packet into a DPBUF_MALLOC system packet (at the expense of a
copy of memory). If the packet's data is already stored in memory
contigously then there's no need to convert the packet.

Thus, the main use cases that were assuming that a dp_packet's data is
always held contiguously in memory were changed to make use of the new
"linear functions" in the dp_packet API when there's a need to traverse
the entire's packet data. Per the example above, when the packet's data
needs to be write() to the tap's file descriptor, or when the conntrack
module needs to verify a packet's checksum, the data is now linearized.

Additionally, the miniflow_extract() function has been modified to check
if the respective packet headers don't span across multiple mbufs. This
requirement is needed to guarantee that callers can assume headers are
always in contiguous memory.

Signed-off-by: Tiago Lam 
---
 lib/bfd.c |   3 +-
 lib/conntrack-private.h   |   4 +-
 lib/conntrack.c   |  43 +--
 lib/crc32c.c  |  35 --
 lib/crc32c.h  |   2 +
 lib/dp-packet.c   |  18 +++
 lib/dp-packet.h   | 279 ++
 lib/dpif-netdev.c |  13 +-
 lib/dpif-netlink.c|   5 +
 lib/dpif.c|   9 ++
 lib/flow.c|  97 ---
 lib/flow.h|   2 +-
 lib/mcast-snooping.c  |   2 +
 lib/netdev-bsd.c  |   5 +
 lib/netdev-dummy.c|  13 +-
 lib/netdev-linux.c|  13 +-
 lib/netdev-native-tnl.c   |  26 ++--
 lib/odp-execute.c |  12 +-
 lib/ovs-lldp.c|   3 +-
 lib/packets.c |  85 +++--
 lib/packets.h |   7 ++
 ofproto/ofproto-dpif-upcall.c |  20 ++-
 ofproto/ofproto-dpif-xlate.c  |  32 -
 tests/test-rstp.c |   8 +-
 tests/test-stp.c  |   8 +-
 25 files changed, 583 insertions(+), 161 deletions(-)

diff --git a/lib/bfd.c b/lib/bfd.c
index cc8c685..12e076a 100644
--- a/lib/bfd.c
+++ b/lib/bfd.c
@@ -721,7 +721,8 @@ bfd_process_packet(struct bfd *bfd, const struct flow *flow,
 if (!msg) {
 VLOG_INFO_RL(, "%s: Received too-short BFD control message (only "
  "%"PRIdPTR" bytes long, at least %d required).",
- bfd->name, (uint8_t *) dp_packet_tail(p) - l7,
+ bfd->name, dp_packet_size(p) -
+ (l7 - (uint8_t *) dp_packet_data(p)),
  BFD_PACKET_LEN);
 goto out;
 }
diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index a344801..1be2df1 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -159,8 +159,8 @@ tcp_payload_length(struct dp_packet *pkt)
 {
 const char *tcp_payload = dp_packet_get_tcp_payload(pkt);
 if (tcp_payload) {
-return ((char *) dp_packet_tail(pkt) - dp_packet_l2_pad_size(pkt)
-- tcp_payload);
+return dp_packet_l4_size(pkt) -
+   (tcp_payload - (char *) dp_packet_l4(pkt));
 } else {
 return 0;
 }
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 6f6021a..0dd2dcc 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -636,12 +636,22 @@ reverse_pat_packet(struct dp_packet *pkt, const struct 
conn *conn)
 static void
 reverse_nat_packet(struct dp_packet *pkt, const struct conn *conn)
 {
-char *tail = dp_packet_tail(pkt);
-char pad = dp_packet_l2_pad_size(pkt);
+char *tail;
+char pad;
 struct conn_key inner_key;
 const char *inner_l4 = NULL;
-uint16_t orig_l3_ofs = pkt->l3_ofs;
-uint16_t orig_l4_ofs = pkt->l4_ofs;
+uint16_t orig_l3_ofs;
+uint16_t orig_l4_ofs;
+
+/* We need the whole packet to parse the packet below */
+if (!dp_packet_is_linear(pkt)) {
+dp_packet_linearize(pkt);
+}
+
+tail = dp_packet_tail(pkt);
+pad = dp_packet_l2_pad_size(pkt);
+orig_l3_ofs = pkt->l3_ofs;
+orig_l4_ofs = pkt->l4_ofs;
 
 if (conn->key.dl_type == htons(ETH_TYPE_IP)) {
 struct ip_header *nh = dp_packet_l3(pkt);
@@ -1323,6 +1333,7 @@ conntrack_execute(struct conntrack *ct, struct 
dp_packet_batch *pkt_batch,
 write_ct_md(packet, zone, NULL, NULL, NULL);
 continue;
 }
+
 process_one(ct, packet, , zone, force, commit, now, setmark,
 

[ovs-dev] [PATCH v12 05/11] dp-packet: copy data from multi-seg. DPDK mbuf

2019-01-04 Thread Tiago Lam
From: Michael Qiu 

When doing packet clone, if packet source is from DPDK driver,
multi-segment must be considered, and copy the segment's data one by
one.

Also, lots of DPDK mbuf's info is missed during a copy, like packet
type, ol_flags, etc.  That information is very important for DPDK to do
packets processing.

Co-authored-by: Mark Kavanagh 
Co-authored-by: Tiago Lam 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c   | 57 ---
 lib/dp-packet.h   | 33 
 lib/netdev-dpdk.c |  1 +
 3 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 9ea3f78..d9be2dc 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -159,39 +159,58 @@ dp_packet_clone(const struct dp_packet *buffer)
 return dp_packet_clone_with_headroom(buffer, 0);
 }
 
+#ifdef DPDK_NETDEV
+struct dp_packet *
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom) {
+struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
+
+/* Copy multi-seg data. */
+if (b->source == DPBUF_DPDK && !rte_pktmbuf_is_contiguous(>mbuf)) {
+void *dst = NULL;
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+new_buffer = dp_packet_new_with_headroom(pkt_len, headroom);
+dst = dp_packet_data(new_buffer);
+dp_packet_set_size(new_buffer, pkt_len);
+
+if (!rte_pktmbuf_read(mbuf, 0, pkt_len, dst)) {
+dp_packet_delete(new_buffer);
+return NULL;
+}
+} else {
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+dp_packet_size(b),
+headroom);
+}
+
+dp_packet_copy_common_members(new_buffer, b);
+
+return new_buffer;
+}
+#else
 /* Creates and returns a new dp_packet whose data are copied from 'buffer'.
  * The returned dp_packet will additionally have 'headroom' bytes of
  * headroom. */
 struct dp_packet *
-dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t headroom)
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom)
 {
 struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
 
-new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(buffer),
- dp_packet_size(buffer),
- headroom);
-/* Copy the following fields into the returned buffer: l2_pad_size,
- * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
-memcpy(_buffer->l2_pad_size, >l2_pad_size,
-sizeof(struct dp_packet) -
-offsetof(struct dp_packet, l2_pad_size));
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+pkt_len, headroom);
 
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.ol_flags = buffer->mbuf.ol_flags;
-#else
-new_buffer->rss_hash_valid = buffer->rss_hash_valid;
-#endif
+dp_packet_copy_common_members(new_buffer, b);
 
+new_buffer->rss_hash_valid = b->rss_hash_valid;
 if (dp_packet_rss_valid(new_buffer)) {
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.hash.rss = buffer->mbuf.hash.rss;
-#else
-new_buffer->rss_hash = buffer->rss_hash;
-#endif
+new_buffer->rss_hash = b->rss_hash;
 }
 
 return new_buffer;
 }
+#endif
 
 /* Creates and returns a new dp_packet that initially contains a copy of the
  * 'size' bytes of data starting at 'data' with no headroom or tailroom. */
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 08e79af..3ebf494 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -91,6 +91,9 @@ static inline void dp_packet_set_size(struct dp_packet *, 
uint32_t);
 static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
 static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
 
+static inline void
+dp_packet_copy_mbuf_flags(struct dp_packet *dst, const struct dp_packet *src);
+
 void *dp_packet_resize_l2(struct dp_packet *, int increment);
 void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
 static inline void *dp_packet_eth(const struct dp_packet *);
@@ -119,6 +122,9 @@ void dp_packet_init_dpdk(struct dp_packet *);
 void dp_packet_init(struct dp_packet *, size_t);
 void dp_packet_uninit(struct dp_packet *);
 
+void dp_packet_copy_mbuf_flags(struct dp_packet *dst,
+   const struct dp_packet *src);
+
 struct dp_packet *dp_packet_new(size_t);
 struct dp_packet *dp_packet_new_with_headroom(size_t, size_t headroom);
 struct dp_packet *dp_packet_clone(const struct dp_packet *);
@@ -129,6 +135,10 @@ struct dp_packe

[ovs-dev] [PATCH v12 04/11] dp-packet: Handle multi-seg mubfs in shift() func.

2019-01-04 Thread Tiago Lam
In its current implementation dp_packet_shift() is also unaware of
multi-seg mbufs (that holds data in memory non-contiguously) and assumes
that data exists contiguously in memory, memmove'ing data to perform the
shift.

To add support for multi-seg mbufs a new set of functions was
introduced, dp_packet_mbuf_shift() and dp_packet_mbuf_write(). These
functions are used by dp_packet_shift(), when handling multi-seg mbufs,
to shift and write data within a chain of mbufs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c | 105 
 lib/dp-packet.h |   3 ++
 2 files changed, 108 insertions(+)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index f54119d..9ea3f78 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -23,6 +23,11 @@
 #include "openvswitch/dynamic-string.h"
 #include "util.h"
 
+#ifdef DPDK_NETDEV
+#define MBUF_BUF_END(BUF_ADDR, BUF_LEN) \
+(char *) (((char *) BUF_ADDR) + BUF_LEN)
+#endif
+
 static void
 dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source 
source)
 {
@@ -290,6 +295,100 @@ dp_packet_prealloc_headroom(struct dp_packet *b, size_t 
size)
 }
 }
 
+#ifdef DPDK_NETDEV
+/* Write len data bytes in a mbuf at specified offset.
+ *
+ * 'mbuf', pointer to the destination mbuf where 'ofs' is, and the mbuf where
+ * the data will first be written.
+ * 'ofs', the offset within the provided 'mbuf' where 'data' is to be written.
+ * 'len', the size of the to be written 'data'.
+ * 'data', pointer to the to be written bytes.
+ *
+ * Note: This function is the counterpart of the `rte_pktmbuf_read()` function
+ * available with DPDK, in the rte_mbuf.h */
+void
+dp_packet_mbuf_write(struct rte_mbuf *mbuf, int16_t ofs, uint32_t len,
+ const void *data)
+{
+char *dst_addr;
+uint16_t data_len;
+int len_copy;
+while (mbuf) {
+if (len == 0) {
+break;
+}
+
+dst_addr = rte_pktmbuf_mtod_offset(mbuf, char *, ofs);
+data_len = MBUF_BUF_END(mbuf->buf_addr, mbuf->buf_len) - dst_addr;
+
+len_copy = MIN(len, data_len);
+/* We don't know if 'data' is the result of a rte_pktmbuf_read() call,
+ * in which case we may end up writing to the same region of memory we
+ * are reading from and overlapping. Hence the use of memmove() here */
+memmove(dst_addr, data, len_copy);
+
+data = ((char *) data) + len_copy;
+len -= len_copy;
+ofs = 0;
+
+mbuf->data_len = len_copy;
+mbuf = mbuf->next;
+}
+}
+
+static void
+dp_packet_mbuf_shift_(struct rte_mbuf *dbuf, int16_t dst_ofs,
+  const struct rte_mbuf *sbuf, uint16_t src_ofs, int len)
+{
+char *rd = xmalloc(sizeof(*rd) * len);
+const char *wd = rte_pktmbuf_read(sbuf, src_ofs, len, rd);
+
+ovs_assert(wd);
+
+dp_packet_mbuf_write(dbuf, dst_ofs, len, wd);
+
+free(rd);
+}
+
+/* Similarly to dp_packet_shift(), shifts the data within the mbufs of a
+ * dp_packet of DPBUF_DPDK source by 'delta' bytes.
+ * Caller must make sure of the following conditions:
+ * - When shifting left, delta can't be bigger than the data_len available in
+ *   the last mbuf;
+ * - When shifting right, delta can't be bigger than the space available in the
+ *   first mbuf (buf_len - data_off).
+ * Both these conditions guarantee that a shift operation doesn't fall outside
+ * the bounds of the existing mbufs, so that the first and last mbufs (when
+ * using multi-segment mbufs), remain the same. */
+static void
+dp_packet_mbuf_shift(struct dp_packet *b, int delta)
+{
+uint16_t src_ofs;
+int16_t dst_ofs;
+
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+struct rte_mbuf *tmbuf = rte_pktmbuf_lastseg(mbuf);
+
+if (delta < 0) {
+ovs_assert(-delta <= tmbuf->data_len);
+} else {
+ovs_assert(delta < (mbuf->buf_len - mbuf->data_off));
+}
+
+/* Set the destination and source offsets to copy to */
+dst_ofs = delta;
+src_ofs = 0;
+
+/* Shift data from src mbuf and offset to dst mbuf and offset */
+dp_packet_mbuf_shift_(mbuf, dst_ofs, mbuf, src_ofs,
+  rte_pktmbuf_pkt_len(mbuf));
+
+/* Update mbufs' properties, and if using multi-segment mbufs, first and
+ * last mbuf's data_len also needs to be adjusted */
+mbuf->data_off = mbuf->data_off + dst_ofs;
+}
+#endif
+
 /* Shifts all of the data within the allocated space in 'b' by 'delta' bytes.
  * For example, a 'delta' of 1 would cause each byte of data to move one byte
  * forward (from address 'p' to 'p+1'), and a 'delta' of -1 would cause each
@@ -302,6 +401,12 @@ dp_packet_shift(struct dp_packet *b, int delta)
: true);
 
 if (delta != 0) {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+dp_packet_mbuf_shift(b, delta);
+return;
+

[ovs-dev] [PATCH v12 03/11] dp-packet: Handle multi-seg mbufs in helper funcs.

2019-01-04 Thread Tiago Lam
Most helper functions in dp-packet assume that the data held by a
dp_packet is contiguous, and perform operations such as pointer
arithmetic under that assumption. However, with the introduction of
multi-segment mbufs, where data is non-contiguous, such assumptions are
no longer possible. Some examples of Such helper functions are
dp_packet_tail(), dp_packet_tailroom(), dp_packet_end(),
dp_packet_get_allocated() and dp_packet_at().

Thus, instead of assuming contiguous data in dp_packet, they  now
iterate over the (non-contiguous) data in mbufs to perform their
calculations.

Finally, dp_packet_use__() has also been modified to perform the
initialisation of the packet (and setting the source) before continuing
to set its size and data length, which now depends on the type of
packet.

Co-authored-by: Mark Kavanagh 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/dp-packet.c |   4 +-
 lib/dp-packet.h | 192 +++-
 2 files changed, 178 insertions(+), 18 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 93b0e9c..f54119d 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -41,11 +41,11 @@ static void
 dp_packet_use__(struct dp_packet *b, void *base, size_t allocated,
  enum dp_packet_source source)
 {
+dp_packet_init__(b, allocated, source);
+
 dp_packet_set_base(b, base);
 dp_packet_set_data(b, base);
 dp_packet_set_size(b, 0);
-
-dp_packet_init__(b, allocated, source);
 }
 
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 72a8043..8947477 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -133,6 +133,10 @@ static inline void *dp_packet_at(const struct dp_packet *, 
size_t offset,
  size_t size);
 static inline void *dp_packet_at_assert(const struct dp_packet *,
 size_t offset, size_t size);
+#ifdef DPDK_NETDEV
+static inline const struct rte_mbuf *
+dp_packet_mbuf_from_offset(const struct dp_packet *b, size_t *offset);
+#endif
 static inline void *dp_packet_tail(const struct dp_packet *);
 static inline void *dp_packet_end(const struct dp_packet *);
 
@@ -185,9 +189,25 @@ dp_packet_delete(struct dp_packet *b)
 static inline void *
 dp_packet_at(const struct dp_packet *b, size_t offset, size_t size)
 {
-return offset + size <= dp_packet_size(b)
-   ? (char *) dp_packet_data(b) + offset
-   : NULL;
+if (offset + size > dp_packet_size(b)) {
+return NULL;
+}
+
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+while (buf && offset > buf->data_len) {
+offset -= buf->data_len;
+
+buf = buf->next;
+}
+
+return buf ? rte_pktmbuf_mtod_offset(buf, char *, offset) : NULL;
+}
+#endif
+
+return (char *) dp_packet_data(b) + offset;
 }
 
 /* Returns a pointer to byte 'offset' in 'b', which must contain at least
@@ -196,13 +216,23 @@ static inline void *
 dp_packet_at_assert(const struct dp_packet *b, size_t offset, size_t size)
 {
 ovs_assert(offset + size <= dp_packet_size(b));
-return ((char *) dp_packet_data(b)) + offset;
+return dp_packet_at(b, offset, size);
 }
 
 /* Returns a pointer to byte following the last byte of data in use in 'b'. */
 static inline void *
 dp_packet_tail(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
+/* Find last segment where data ends, meaning the tail of the chained
+ *  mbufs must be there */
+buf = rte_pktmbuf_lastseg(buf);
+
+return rte_pktmbuf_mtod_offset(buf, void *, buf->data_len);
+}
+#endif
 return (char *) dp_packet_data(b) + dp_packet_size(b);
 }
 
@@ -211,6 +241,15 @@ dp_packet_tail(const struct dp_packet *b)
 static inline void *
 dp_packet_end(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, &(b->mbuf));
+
+buf = rte_pktmbuf_lastseg(buf);
+
+return (char *) buf->buf_addr + buf->buf_len;
+}
+#endif
 return (char *) dp_packet_base(b) + dp_packet_get_allocated(b);
 }
 
@@ -236,6 +275,15 @@ dp_packet_tailroom(const struct dp_packet *b)
 static inline void
 dp_packet_clear(struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+/* sets pkt_len and data_len to zero and frees unused mbufs */
+dp_packet_set_size(b, 0);
+rte_pktmbuf_reset(>mbuf);
+
+return;
+}
+#endif
 dp_packet_set_data(b, dp_packet_base(b));
 dp_packet_set_size(b, 0);
 }
@@ -248,28 +296,47 @@ dp_packet_pull(st

[ovs-dev] [PATCH v12 02/11] dp-packet: Fix data_len handling multi-seg mbufs.

2019-01-04 Thread Tiago Lam
When a dp_packet is from a DPDK source, and it contains multi-segment
mbufs, the data_len is not equal to the packet size, pkt_len. Instead,
the data_len of each mbuf in the chain should be considered while
distributing the new (provided) size.

To account for the above dp_packet_set_size() has been changed so that,
in the multi-segment mbufs case, only the data_len on the last mbuf of
the chain and the total size of the packet, pkt_len, are changed. The
data_len on the intermediate mbufs preceeding the last mbuf is not
changed by dp_packet_set_size(). Furthermore, in some cases
dp_packet_set_size() may be used to set a smaller size than the current
packet size, thus effectively trimming the end of the packet. In the
multi-segment mbufs case this may lead to lingering mbufs that may need
freeing.

__dp_packet_set_data() now also updates an mbufs' data_len after setting
the data offset. This is so that both fields are always in sync for each
mbuf in a chain.

Co-authored-by: Michael Qiu 
Co-authored-by: Mark Kavanagh 
Co-authored-by: Przemyslaw Lal 
Co-authored-by: Marcin Ksiadz 
Co-authored-by: Yuanhan Liu 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Przemyslaw Lal 
Signed-off-by: Marcin Ksiadz 
Signed-off-by: Yuanhan Liu 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.h | 101 +---
 1 file changed, 89 insertions(+), 12 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 7b85dd9..72a8043 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -426,20 +426,78 @@ dp_packet_size(const struct dp_packet *b)
 return b->mbuf.pkt_len;
 }
 
+/* Sets the size of the packet 'b' to 'v'. For non-DPDK packets this only means
+ * setting b->size_, but if used in a DPDK packet it means adjusting the first
+ * mbuf pkt_len and last mbuf data_len, to reflect the real size, which can
+ * lead to free'ing tail mbufs that are no longer used.
+ *
+ * This function should be used for setting the size only, and if there's an
+ * assumption that the tail end of 'b' will be trimmed. For adjusting the head
+ * 'end' of 'b', dp_packet_pull() should be used instead. */
 static inline void
 dp_packet_set_size(struct dp_packet *b, uint32_t v)
 {
-/* netdev-dpdk does not currently support segmentation; consequently, for
- * all intents and purposes, 'data_len' (16 bit) and 'pkt_len' (32 bit) may
- * be used interchangably.
- *
- * On the datapath, it is expected that the size of packets
- * (and thus 'v') will always be <= UINT16_MAX; this means that there is no
- * loss of accuracy in assigning 'v' to 'data_len'.
- */
-b->mbuf.data_len = (uint16_t)v;  /* Current seg length. */
-b->mbuf.pkt_len = v; /* Total length of all segments linked to
-  * this segment. */
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *mbuf = >mbuf;
+uint16_t new_len = v;
+uint16_t data_len;
+uint16_t nb_segs = 0;
+uint16_t pkt_len = 0;
+
+/* Trim 'v' length bytes from the end of the chained buffers, freeing
+ * any buffers that may be left floating.
+ *
+ * For that traverse over the entire mbuf chain and, for each mbuf,
+ * subtract its 'data_len' from 'new_len' (initially set to 'v'), which
+ * essentially spreads 'new_len' between all existing mbufs in the
+ * chain. While traversing the mbuf chain, we end the traversal if:
+ * - 'new_size' reaches 0, meaning the passed 'v' has been
+ *   appropriately spread over the mbuf chain. The remaining mbufs are
+ *   freed;
+ * - We reach the last mbuf in the chain, in which case we set the last
+ *   mbuf's 'data_len' to the minimum value between the current
+ *   'new_len' (what's leftover from 'v') size and the maximum data the
+ *   mbuf can hold (mbuf->buf_len - mbuf->data_off).
+ *
+ * The above formula will thus make sure that when a 'v' is smaller
+ * than the overall 'pkt_len' (sum of all 'data_len'), it sets the new
+ * size and frees the leftover mbufs. In the other hand, if 'v' is
+ * bigger, it sets the size to the maximum available space, but no more
+ * than that. */
+while (mbuf) {
+data_len = MIN(new_len, mbuf->data_len);
+mbuf->data_len = data_len;
+
+if (new_len - data_len <= 0) {
+/* Free the rest of chained mbufs */
+free_dpdk_buf(CONTAINER_OF(mbuf->next, struct dp_packet,
+   mbuf));
+mbuf->next = NULL;
+} else if (!mbuf->next) {
+/* Don't assign more than what we have available */
+mbuf->data_len = MIN(new_len,
+   

[ovs-dev] [PATCH v12 01/11] netdev-dpdk: Serialise non-pmds mbufs' alloc/free.

2019-01-04 Thread Tiago Lam
A new mutex, 'nonpmd_mp_mutex', has been introduced to serialise
allocation and free operations by non-pmd threads on a given mempool.

free_dpdk_buf() has been modified to make use of the introduced mutex.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/netdev-dpdk.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 320422b..ea84c41 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -318,6 +318,16 @@ static struct ovs_mutex dpdk_mp_mutex 
OVS_ACQ_AFTER(dpdk_mutex)
 static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
 = OVS_LIST_INITIALIZER(_mp_list);
 
+/* This mutex must be used by non pmd threads when allocating or freeing
+ * mbufs through mempools, when outside of the `non_pmd_mutex` mutex, in struct
+ * dp_netdev.
+ * The reason, as pointed out in the "Known Issues" section in DPDK's EAL docs,
+ * is that the implementation on which mempool is based off is non-preemptable.
+ * Since non-pmds may end up not being pinned this could lead to the preemption
+ * between non-pmds performing operations on the same mempool, which could lead
+ * to memory corruption. */
+static struct ovs_mutex nonpmd_mp_mutex = OVS_MUTEX_INITIALIZER;
+
 struct dpdk_mp {
  struct rte_mempool *mp;
  int mtu;
@@ -490,6 +500,8 @@ struct netdev_rxq_dpdk {
 dpdk_port_t port_id;
 };
 
+static bool dpdk_thread_is_pmd(void);
+
 static void netdev_dpdk_destruct(struct netdev *netdev);
 static void netdev_dpdk_vhost_destruct(struct netdev *netdev);
 
@@ -523,6 +535,12 @@ dpdk_buf_size(int mtu)
 + RTE_PKTMBUF_HEADROOM;
 }
 
+static bool
+dpdk_thread_is_pmd(void)
+{
+ return rte_lcore_id() != NON_PMD_CORE_ID;
+}
+
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
  *
  * Unlike xmalloc(), this function can return NULL on failure. */
@@ -535,9 +553,18 @@ dpdk_rte_mzalloc(size_t sz)
 void
 free_dpdk_buf(struct dp_packet *p)
 {
-struct rte_mbuf *pkt = (struct rte_mbuf *) p;
+/* If non-pmd we need to lock on nonpmd_mp_mutex mutex */
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_lock(_mp_mutex);
+
+rte_pktmbuf_free(>mbuf);
+
+ovs_mutex_unlock(_mp_mutex);
+
+return;
+}
 
-rte_pktmbuf_free(pkt);
+rte_pktmbuf_free(>mbuf);
 }
 
 static void
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v12 00/11] Support multi-segment mbufs

2019-01-04 Thread Tiago Lam
quot; in jumbo-frames.rst doc,
  after further testing;
- Structure changes:
  - Drop patch 07/13 which is now unneeded;
  - Two more patches added for extra test coverage. This is what accounts
for the increase in size (+1 patch) in the series.

v8 (non-RFC): 
- Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
  temporary directory (also for RHEL)");
- Address Ciara's and Ilya's comments:
- Drop the dp_packet_mbuf_tail() function and use only the
  already existing dp_packet_tail();
- Fix bug in dpdk_do_tx_copy() where the error return from
  dpdk_prep_tx_buf() was being wrongly checked;
- Use dpdk_buf_alloc() and free_dpdk_buf() instead of
  rte_pktmbuf_alloc() and rte_pktmbuf_free();
- Fix some other code style and duplication issues pointed out.
- Refactored dp_packet_shift(), dp_packet_resize__() and
  dp_packet_put*() functions to work within the bounds of existing
  mbufs only;
- Fix dp_packet_clear() which wasn't correctly clearing / freeing
  other mbufs in the chain for chains with more than a single mbuf;
- dp_packet layer functions (such as dp_packet_l3()) now check if
  the header is within the first mbuf, when using mbufs;
- Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
  was refactored to use free_dpdk_buf();
- Fix wrong rte_memcpy() when performing dp_packet_clone() which was
  leading to memory corruption; 
- Modified the added tests to account for some of the above changes;
- Run performance tests, compiling results and adding them to the
  cover letter;
- Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
  together with a "Performance notes" sub-section reflecting the
  findings mentioned above in the cover letter.

v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
 - Address Ilya comments;
- Fix non-DPDK build;
- Serialise the access of non pmds to allocation and free of mbufs by
  using a newly introduced mutex.
 - Add a new set of tests that integrates with the recently added DPDK
   testsuite. These focus on allocating dp_packets, with a single or
   multiple mbufs, from an instantiated mempool and performing several
   operations on those, verifying if the data at the end matches what's
   expected;
 - Fix bugs found by the new tests:
- dp_packet_shift() wasn't taking into account shift lefts;
- dp_packet_resize__() was misusing and miscalculating the tailrooms
  and headrooms, ending up calculating the wrong number of segments
  that needed allocation;
- An mbuf's end was being miscalculated both in dp_packet_tail,
  dp_packet_mbuf_tail() and dp_packet_end();
- dp_packet_set_size() was not updating the number of chained segments
  'nb_segs';
 - Add support for multi-seg mbufs in dp_packet_clear().

v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
   reuse.");
 - Further improve dp_packet_put_uninit() and dp_packet_shift() to
   support multi-seg mbufs;
 - Add support for multi-seg mbufs in dp_packet_l4_size() and
   improve other helper funcs, such as dp_packet_tail() and dp_
   packet_tailroom().
 - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
   put_zeros(), as well as dp_packet_resize__() - allocating new
   mbufs and linking them together;
 Restructured patchset:
 - Squash patch 5 into patch 6, since they were both related to
   copying data while handling multi-seg mbufs;
 - Split patch 4 into two separate patches - one that introduces the
   changes in helper functions to deal with multi-seg mbufs and
   two others for the shift() and put_uninit() functionality;
 - Move patch 4 to before patch 3, so that ihelper functions come
   before functionality improvement that rely on those helpers.

v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
  from getting redundantly incremented");
- Sugesh's comments have been addressed:
  - Changed dp_packet_set_data() and dp_packet_set_size() logic to
make them independent of each other;
  - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
size() are independent;
  - dp_packet_clone_with_headroom() now has split functions for
handling DPDK sourced packets and non-DPDK packets;
- Modified various functions in dp-packet.h to account for multi-seg
  mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
  and dp_packet_at();
- Added support for shifting packet data in multi-seg mbufs, using
  dp_packet_shift();
- Fixed some minor inconsistencies.

Note that some of the changes in v5 have been contributed by Mark
    Kavanagh as 

[ovs-dev] [PATCH v3] netdev-dpdk: Add mbuf HEADROOM after alignment.

2018-11-27 Thread Tiago Lam
Commit dfaf00e started using the result of dpdk_buf_size() to calculate
the available size on each mbuf, as opposed to using the previous
MBUF_SIZE macro. However, this was calculating the mbuf size by adding
up the MTU with RTE_PKTMBUF_HEADROOM and only then aligning to
NETDEV_DPDK_MBUF_ALIGN. Instead, the accounting for the
RTE_PKTMBUF_HEADROOM should only happen after alignment, as per below.

Before alignment:
ROUNDUP(MTU(1500) + RTE_PKTMBUF_HEADROOM(128), 1024) = 2048

After aligment:
ROUNDUP(MTU(1500), 1024) + 128 = 2176

This might seem insignificant, however, it might have performance
implications in DPDK, where each mbuf is expected to have 2k +
RTE_PKTMBUF_HEADROOM of available space. This is because not only some
NICs have course grained alignments of 1k, they will also take
RTE_PKTMBUF_HEADROOM bytes from the overall available space in an mbuf
when setting up their Rx requirements. Thus, only the "After alignment"
case above would guarantee a 2k of available room, as the "Before
alignment" would report only 1920B.

Some extra information can be found at:
https://mails.dpdk.org/archives/dev/2018-November/119219.html

Note: This has been found by Ian Stokes while going through some
af_packet checks.

Reported-by: Ian Stokes 
Fixes: dfaf00e ("netdev-dpdk: fix mbuf sizing")
Signed-off-by: Tiago Lam 
---
v3:
   - Take trailer_size into account when calculating mbuf size - Ian.

v2:
   - Rebase to master 85706c3 ("ovn: Avoid tunneling for VLAN packets
 redirected to a gateway chassis").

   - Fix mbuf size calculations under Documentation/topics/dpdk/memory.rst
 to take into account the header_size added to each mepool element (64
 bytes) - Ian.
---

 Documentation/topics/dpdk/memory.rst | 28 ++--
 lib/netdev-dpdk.c|  6 --
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/Documentation/topics/dpdk/memory.rst 
b/Documentation/topics/dpdk/memory.rst
index c9b739f..9ebfd11 100644
--- a/Documentation/topics/dpdk/memory.rst
+++ b/Documentation/topics/dpdk/memory.rst
@@ -107,8 +107,8 @@ Example 1
 
  MTU = 1500 Bytes
  Number of mbufs = 262144
- Mbuf size = 2752 Bytes
- Memory required = 262144 * 2752 = 721 MB
+ Mbuf size = 3008 Bytes
+ Memory required = 262144 * 3008 = 788 MB
 
 Example 2
 +
@@ -116,8 +116,8 @@ Example 2
 
  MTU = 1800 Bytes
  Number of mbufs = 262144
- Mbuf size = 2752 Bytes
- Memory required = 262144 * 2752 = 721 MB
+ Mbuf size = 3008 Bytes
+ Memory required = 262144 * 3008 = 788 MB
 
 .. note::
 
@@ -130,8 +130,8 @@ Example 3
 
  MTU = 6000 Bytes
  Number of mbufs = 262144
- Mbuf size = 8000 Bytes
- Memory required = 262144 * 8000 = 2097 MB
+ Mbuf size = 7104 Bytes
+ Memory required = 262144 * 7104 = 1862 MB
 
 Example 4
 +
@@ -139,8 +139,8 @@ Example 4
 
  MTU = 9000 Bytes
  Number of mbufs = 262144
- Mbuf size = 10048 Bytes
- Memory required = 262144 * 10048 = 2634 MB
+ Mbuf size = 10176 Bytes
+ Memory required = 262144 * 10176 = 2667 MB
 
 Per Port Memory Calculations
 
@@ -194,8 +194,8 @@ Example 1: (1 rxq, 1 PMD, 1500 MTU)
 
  MTU = 1500
  Number of mbufs = (1 * 2048) + (2 * 2048) + (1 * 32) + (16384) = 22560
- Mbuf size = 2752 Bytes
- Memory required = 22560 * 2752 = 62 MB
+ Mbuf size = 3008 Bytes
+ Memory required = 22560 * 3008 = 67 MB
 
 Example 2: (1 rxq, 2 PMD, 6000 MTU)
 +++
@@ -203,8 +203,8 @@ Example 2: (1 rxq, 2 PMD, 6000 MTU)
 
  MTU = 6000
  Number of mbufs = (1 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 24608
- Mbuf size = 8000 Bytes
- Memory required = 24608 * 8000 = 196 MB
+ Mbuf size = 7104 Bytes
+ Memory required = 24608 * 7104 = 175 MB
 
 Example 3: (2 rxq, 2 PMD, 9000 MTU)
 +++
@@ -212,5 +212,5 @@ Example 3: (2 rxq, 2 PMD, 9000 MTU)
 
  MTU = 9000
  Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
- Mbuf size = 10048 Bytes
- Memory required = 26656 * 10048 = 267 MB
+ Mbuf size = 10176 Bytes
+ Memory required = 26656 * 10176 = 271 MB
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index e8618a6..a871743 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -521,8 +521,8 @@ is_dpdk_class(const struct netdev_class *class)
 static uint32_t
 dpdk_buf_size(int mtu)
 {
-return ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM),
- NETDEV_DPDK_MBUF_ALIGN);
+return ROUND_UP(MTU_TO_MAX_FRAME_LEN(mtu), NETDEV_DPDK_MBUF_ALIGN)
++ RTE_PKTMBUF_HEADROOM;
 }
 
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
@@ -681,6 +681,8 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
   dev->requested_n_rxq, dev->requested_n_txq,
   RTE_CACHE_LINE_SIZE);
 
+/* The size of the mbuf's private area (i.e. area that holds OvS'
+ * dp_packet data)*/
 mbuf_priv_data_len = sizeof(struct dp_pack

[ovs-dev] [PATCH v3] netdev-dpdk: Add mbuf HEADROOM after alignment.

2018-11-27 Thread Tiago Lam
Commit dfaf00e started using the result of dpdk_buf_size() to calculate
the available size on each mbuf, as opposed to using the previous
MBUF_SIZE macro. However, this was calculating the mbuf size by adding
up the MTU with RTE_PKTMBUF_HEADROOM and only then aligning to
NETDEV_DPDK_MBUF_ALIGN. Instead, the accounting for the
RTE_PKTMBUF_HEADROOM should only happen after alignment, as per below.

Before alignment:
ROUNDUP(MTU(1500) + RTE_PKTMBUF_HEADROOM(128), 1024) = 2048

After aligment:
ROUNDUP(MTU(1500), 1024) + 128 = 2176

This might seem insignificant, however, it might have performance
implications in DPDK, where each mbuf is expected to have 2k +
RTE_PKTMBUF_HEADROOM of available space. This is because not only some
NICs have course grained alignments of 1k, they will also take
RTE_PKTMBUF_HEADROOM bytes from the overall available space in an mbuf
when setting up their Rx requirements. Thus, only the "After alignment"
case above would guarantee a 2k of available room, as the "Before
alignment" would report only 1920B.

Some extra information can be found at:
https://mails.dpdk.org/archives/dev/2018-November/119219.html

Note: This has been found by Ian Stokes while going through some
af_packet checks.

Reported-by: Ian Stokes 
Fixes: dfaf00e ("netdev-dpdk: fix mbuf sizing")
Signed-off-by: Tiago Lam 
---
v3:
   - Take trailer_size into account when calculating mbuf size - Ian.

v2:
   - Rebase to master 85706c3 ("ovn: Avoid tunneling for VLAN packets
 redirected to a gateway chassis").

   - Fix mbuf size calculations under Documentation/topics/dpdk/memory.rst
 to take into account the header_size added to each mepool element (64
 bytes) - Ian.
---
 Documentation/topics/dpdk/memory.rst | 28 ++--
 lib/netdev-dpdk.c|  6 --
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/Documentation/topics/dpdk/memory.rst 
b/Documentation/topics/dpdk/memory.rst
index c9b739f..c20dfed 100644
--- a/Documentation/topics/dpdk/memory.rst
+++ b/Documentation/topics/dpdk/memory.rst
@@ -107,8 +107,8 @@ Example 1
 
  MTU = 1500 Bytes
  Number of mbufs = 262144
- Mbuf size = 2752 Bytes
- Memory required = 262144 * 2752 = 721 MB
+ Mbuf size = 2944 Bytes
+ Memory required = 262144 * 2944 = 772 MB
 
 Example 2
 +
@@ -116,8 +116,8 @@ Example 2
 
  MTU = 1800 Bytes
  Number of mbufs = 262144
- Mbuf size = 2752 Bytes
- Memory required = 262144 * 2752 = 721 MB
+ Mbuf size = 2944 Bytes
+ Memory required = 262144 * 2944 = 772 MB
 
 .. note::
 
@@ -130,8 +130,8 @@ Example 3
 
  MTU = 6000 Bytes
  Number of mbufs = 262144
- Mbuf size = 8000 Bytes
- Memory required = 262144 * 8000 = 2097 MB
+ Mbuf size = 7040 Bytes
+ Memory required = 262144 * 7040 = 1845 MB
 
 Example 4
 +
@@ -139,8 +139,8 @@ Example 4
 
  MTU = 9000 Bytes
  Number of mbufs = 262144
- Mbuf size = 10048 Bytes
- Memory required = 262144 * 10048 = 2634 MB
+ Mbuf size = 10112 Bytes
+ Memory required = 262144 * 10112 = 2651 MB
 
 Per Port Memory Calculations
 
@@ -194,8 +194,8 @@ Example 1: (1 rxq, 1 PMD, 1500 MTU)
 
  MTU = 1500
  Number of mbufs = (1 * 2048) + (2 * 2048) + (1 * 32) + (16384) = 22560
- Mbuf size = 2752 Bytes
- Memory required = 22560 * 2752 = 62 MB
+ Mbuf size = 2944 Bytes
+ Memory required = 22560 * 2944 = 65 MB
 
 Example 2: (1 rxq, 2 PMD, 6000 MTU)
 +++
@@ -203,8 +203,8 @@ Example 2: (1 rxq, 2 PMD, 6000 MTU)
 
  MTU = 6000
  Number of mbufs = (1 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 24608
- Mbuf size = 8000 Bytes
- Memory required = 24608 * 8000 = 196 MB
+ Mbuf size = 7040 Bytes
+ Memory required = 24608 * 7040 = 173 MB
 
 Example 3: (2 rxq, 2 PMD, 9000 MTU)
 +++
@@ -212,5 +212,5 @@ Example 3: (2 rxq, 2 PMD, 9000 MTU)
 
  MTU = 9000
  Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
- Mbuf size = 10048 Bytes
- Memory required = 26656 * 10048 = 267 MB
+ Mbuf size = 10112 Bytes
+ Memory required = 26656 * 10112 = 270 MB
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index e8618a6..a871743 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -521,8 +521,8 @@ is_dpdk_class(const struct netdev_class *class)
 static uint32_t
 dpdk_buf_size(int mtu)
 {
-return ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM),
- NETDEV_DPDK_MBUF_ALIGN);
+return ROUND_UP(MTU_TO_MAX_FRAME_LEN(mtu), NETDEV_DPDK_MBUF_ALIGN)
++ RTE_PKTMBUF_HEADROOM;
 }
 
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
@@ -681,6 +681,8 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
   dev->requested_n_rxq, dev->requested_n_txq,
   RTE_CACHE_LINE_SIZE);
 
+/* The size of the mbuf's private area (i.e. area that holds OvS'
+ * dp_packet data)*/
 mbuf_priv_data_len = sizeof(struct dp_pack

[ovs-dev] [PATCH v2] netdev-dpdk: Add mbuf HEADROOM after alignment.

2018-11-27 Thread Tiago Lam
Commit dfaf00e started using the result of dpdk_buf_size() to calculate
the available size on each mbuf, as opposed to using the previous
MBUF_SIZE macro. However, this was calculating the mbuf size by adding
up the MTU with RTE_PKTMBUF_HEADROOM and only then aligning to
NETDEV_DPDK_MBUF_ALIGN. Instead, the accounting for the
RTE_PKTMBUF_HEADROOM should only happen after alignment, as per below.

Before alignment:
ROUNDUP(MTU(1500) + RTE_PKTMBUF_HEADROOM(128), 1024) = 2048

After aligment:
ROUNDUP(MTU(1500), 1024) + 128 = 2176

This might seem insignificant, however, it might have performance
implications in DPDK, where each mbuf is expected to have 2k +
RTE_PKTMBUF_HEADROOM of available space. This is because not only some
NICs have course grained alignments of 1k, they will also take
RTE_PKTMBUF_HEADROOM bytes from the overall available space in an mbuf
when setting up their Rx requirements. Thus, only the "After alignment"
case above would guarantee a 2k of available room, as the "Before
alignment" would report only 1920B.

Some extra information can be found at:
https://mails.dpdk.org/archives/dev/2018-November/119219.html

Note: This has been found by Ian Stokes while going through some
af_packet checks.

Reported-by: Ian Stokes 
Fixes: dfaf00e ("netdev-dpdk: fix mbuf sizing")
Signed-off-by: Tiago Lam 
---
v2:
   - Rebase to master 85706c3 ("ovn: Avoid tunneling for VLAN packets
 redirected to a gateway chassis").

   - Fix mbuf size calculations under Documentation/topics/dpdk/memory.rst
 to take into account the header_size added to each mepool element (64
 bytes) - Ian.
---


 Documentation/topics/dpdk/memory.rst | 28 ++--
 lib/netdev-dpdk.c|  6 --
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/Documentation/topics/dpdk/memory.rst 
b/Documentation/topics/dpdk/memory.rst
index c9b739f..c20dfed 100644
--- a/Documentation/topics/dpdk/memory.rst
+++ b/Documentation/topics/dpdk/memory.rst
@@ -107,8 +107,8 @@ Example 1
 
  MTU = 1500 Bytes
  Number of mbufs = 262144
- Mbuf size = 2752 Bytes
- Memory required = 262144 * 2752 = 721 MB
+ Mbuf size = 2944 Bytes
+ Memory required = 262144 * 2944 = 772 MB
 
 Example 2
 +
@@ -116,8 +116,8 @@ Example 2
 
  MTU = 1800 Bytes
  Number of mbufs = 262144
- Mbuf size = 2752 Bytes
- Memory required = 262144 * 2752 = 721 MB
+ Mbuf size = 2944 Bytes
+ Memory required = 262144 * 2944 = 772 MB
 
 .. note::
 
@@ -130,8 +130,8 @@ Example 3
 
  MTU = 6000 Bytes
  Number of mbufs = 262144
- Mbuf size = 8000 Bytes
- Memory required = 262144 * 8000 = 2097 MB
+ Mbuf size = 7040 Bytes
+ Memory required = 262144 * 7040 = 1845 MB
 
 Example 4
 +
@@ -139,8 +139,8 @@ Example 4
 
  MTU = 9000 Bytes
  Number of mbufs = 262144
- Mbuf size = 10048 Bytes
- Memory required = 262144 * 10048 = 2634 MB
+ Mbuf size = 10112 Bytes
+ Memory required = 262144 * 10112 = 2651 MB
 
 Per Port Memory Calculations
 
@@ -194,8 +194,8 @@ Example 1: (1 rxq, 1 PMD, 1500 MTU)
 
  MTU = 1500
  Number of mbufs = (1 * 2048) + (2 * 2048) + (1 * 32) + (16384) = 22560
- Mbuf size = 2752 Bytes
- Memory required = 22560 * 2752 = 62 MB
+ Mbuf size = 2944 Bytes
+ Memory required = 22560 * 2944 = 65 MB
 
 Example 2: (1 rxq, 2 PMD, 6000 MTU)
 +++
@@ -203,8 +203,8 @@ Example 2: (1 rxq, 2 PMD, 6000 MTU)
 
  MTU = 6000
  Number of mbufs = (1 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 24608
- Mbuf size = 8000 Bytes
- Memory required = 24608 * 8000 = 196 MB
+ Mbuf size = 7040 Bytes
+ Memory required = 24608 * 7040 = 173 MB
 
 Example 3: (2 rxq, 2 PMD, 9000 MTU)
 +++
@@ -212,5 +212,5 @@ Example 3: (2 rxq, 2 PMD, 9000 MTU)
 
  MTU = 9000
  Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
- Mbuf size = 10048 Bytes
- Memory required = 26656 * 10048 = 267 MB
+ Mbuf size = 10112 Bytes
+ Memory required = 26656 * 10112 = 270 MB
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index e8618a6..a871743 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -521,8 +521,8 @@ is_dpdk_class(const struct netdev_class *class)
 static uint32_t
 dpdk_buf_size(int mtu)
 {
-return ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM),
- NETDEV_DPDK_MBUF_ALIGN);
+return ROUND_UP(MTU_TO_MAX_FRAME_LEN(mtu), NETDEV_DPDK_MBUF_ALIGN)
++ RTE_PKTMBUF_HEADROOM;
 }
 
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
@@ -681,6 +681,8 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
   dev->requested_n_rxq, dev->requested_n_txq,
   RTE_CACHE_LINE_SIZE);
 
+/* The size of the mbuf's private area (i.e. area that holds OvS'
+ * dp_packet data)*/
 mbuf_priv_data_len = sizeof(struct dp_packet) -
  sizeof(struct rte_mbuf)

[ovs-dev] [PATCH] netdev-dpdk: Add mbuf HEADROOM after alignment.

2018-11-16 Thread Tiago Lam
Commit dfaf00e started using the result of dpdk_buf_size() to calculate
the available size on each mbuf, as opposed to using the previous
MBUF_SIZE macro. However, this was calculating the mbuf size by adding
up the MTU with RTE_PKTMBUF_HEADROOM and only then aligning to
NETDEV_DPDK_MBUF_ALIGN. Instead, the accounting for the
RTE_PKTMBUF_HEADROOM should only happen after alignment, as per below.

Before alignment:
ROUNDUP(MTU(1500) + RTE_PKTMBUF_HEADROOM(128), 1024) = 2048

After aligment:
ROUNDUP(MTU(1500), 1024) + 128 = 2176

This might seem insignificant, however, it might have performance
implications in DPDK, where each mbuf is expected to have 2k +
RTE_PKTMBUF_HEADROOM of available space. This is because not only some
NICs have course grained alignments of 1k, they will also take
RTE_PKTMBUF_HEADROOM bytes from the overall available space in an mbuf
when setting up their Rx requirements. Thus, only the "After alignment"
case above would guarantee a 2k of available room, as the "Before
alignment" would report only 1920B.

Some extra information can be found at:
https://mails.dpdk.org/archives/dev/2018-November/119219.html

Note: This has been found by Ian Stokes while going through some
af_packet checks.

Reported-by: Ian Stokes 
Fixes: dfaf00e ("netdev-dpdk: fix mbuf sizing")
Signed-off-by: Tiago Lam 
---
 Documentation/topics/dpdk/memory.rst | 20 ++--
 lib/netdev-dpdk.c|  6 --
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/Documentation/topics/dpdk/memory.rst 
b/Documentation/topics/dpdk/memory.rst
index c9b739f..3c4ee17 100644
--- a/Documentation/topics/dpdk/memory.rst
+++ b/Documentation/topics/dpdk/memory.rst
@@ -107,8 +107,8 @@ Example 1
 
  MTU = 1500 Bytes
  Number of mbufs = 262144
- Mbuf size = 2752 Bytes
- Memory required = 262144 * 2752 = 721 MB
+ Mbuf size = 2880 Bytes
+ Memory required = 262144 * 2880 = 755 MB
 
 Example 2
 +
@@ -116,8 +116,8 @@ Example 2
 
  MTU = 1800 Bytes
  Number of mbufs = 262144
- Mbuf size = 2752 Bytes
- Memory required = 262144 * 2752 = 721 MB
+ Mbuf size = 2880 Bytes
+ Memory required = 262144 * 2880 = 755 MB
 
 .. note::
 
@@ -130,8 +130,8 @@ Example 3
 
  MTU = 6000 Bytes
  Number of mbufs = 262144
- Mbuf size = 8000 Bytes
- Memory required = 262144 * 8000 = 2097 MB
+ Mbuf size = 6976 Bytes
+ Memory required = 262144 * 6976 = 1829 MB
 
 Example 4
 +
@@ -194,8 +194,8 @@ Example 1: (1 rxq, 1 PMD, 1500 MTU)
 
  MTU = 1500
  Number of mbufs = (1 * 2048) + (2 * 2048) + (1 * 32) + (16384) = 22560
- Mbuf size = 2752 Bytes
- Memory required = 22560 * 2752 = 62 MB
+ Mbuf size = 2880 Bytes
+ Memory required = 22560 * 2880 = 65 MB
 
 Example 2: (1 rxq, 2 PMD, 6000 MTU)
 +++
@@ -203,8 +203,8 @@ Example 2: (1 rxq, 2 PMD, 6000 MTU)
 
  MTU = 6000
  Number of mbufs = (1 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 24608
- Mbuf size = 8000 Bytes
- Memory required = 24608 * 8000 = 196 MB
+ Mbuf size = 6976 Bytes
+ Memory required = 24608 * 6976 = 171 MB
 
 Example 3: (2 rxq, 2 PMD, 9000 MTU)
 +++
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index e8618a6..a871743 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -521,8 +521,8 @@ is_dpdk_class(const struct netdev_class *class)
 static uint32_t
 dpdk_buf_size(int mtu)
 {
-return ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM),
- NETDEV_DPDK_MBUF_ALIGN);
+return ROUND_UP(MTU_TO_MAX_FRAME_LEN(mtu), NETDEV_DPDK_MBUF_ALIGN)
++ RTE_PKTMBUF_HEADROOM;
 }
 
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
@@ -681,6 +681,8 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
   dev->requested_n_rxq, dev->requested_n_txq,
   RTE_CACHE_LINE_SIZE);
 
+/* The size of the mbuf's private area (i.e. area that holds OvS'
+ * dp_packet data)*/
 mbuf_priv_data_len = sizeof(struct dp_packet) -
  sizeof(struct rte_mbuf);
 /* The size of the entire dp_packet. */
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v12 3/3] dp-packet: Fix allocated size on DPDK init.

2018-11-02 Thread Tiago Lam
When enabled with DPDK OvS deals with two types of packets, the ones
coming from the mempool and the ones locally created by OvS - which are
copied to mempool mbufs before output. In the latter, the space is
allocated from the system, while in the former the mbufs are allocated
from a mempool, which takes care of initialising them appropriately.

In the current implementation, during mempool's initialisation of mbufs,
dp_packet_set_allocated() is called from dp_packet_init_dpdk() without
considering that the allocated space, in the case of multi-segment
mbufs, might be greater than a single mbuf.  Furthermore, given that
dp_packet_init_dpdk() is on the code path that's called upon mempool's
initialisation, a call to dp_packet_set_allocated() is redundant, since
mempool takes care of initialising it.

To fix this, dp_packet_set_allocated() is no longer called after
initialisation of a mempool, only in dp_packet_init__(), which is still
called by OvS when initialising locally created packets.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/dp-packet.c   | 11 +++
 lib/dp-packet.h   |  2 +-
 lib/netdev-dpdk.c |  2 +-
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 443c225..93b0e9c 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -92,16 +92,11 @@ dp_packet_use_const(struct dp_packet *b, const void *data, 
size_t size)
 dp_packet_set_size(b, size);
 }
 
-/* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes.
- * DPDK allocated dp_packet and *data is allocated from one continous memory
- * region as part of memory pool, so in memory data start right after
- * dp_packet.  Therefore, there is a special method to free this type of
- * buffer.  Here, non-transient ovs dp-packet fields are initialized for
- * packets that are part of a DPDK memory pool. */
+/* Initializes 'b' as a DPDK dp-packet, which must have been allocated from a
+ * DPDK memory pool. */
 void
-dp_packet_init_dpdk(struct dp_packet *b, size_t allocated)
+dp_packet_init_dpdk(struct dp_packet *b)
 {
-dp_packet_set_allocated(b, allocated);
 b->source = DPBUF_DPDK;
 }
 
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index fe34d04..7b85dd9 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -114,7 +114,7 @@ void dp_packet_use(struct dp_packet *, void *, size_t);
 void dp_packet_use_stub(struct dp_packet *, void *, size_t);
 void dp_packet_use_const(struct dp_packet *, const void *, size_t);
 
-void dp_packet_init_dpdk(struct dp_packet *, size_t allocated);
+void dp_packet_init_dpdk(struct dp_packet *);
 
 void dp_packet_init(struct dp_packet *, size_t);
 void dp_packet_uninit(struct dp_packet *);
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index a72438a..ff88331 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -550,7 +550,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp OVS_UNUSED,
 {
 struct rte_mbuf *pkt = _p;
 
-dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);
+dp_packet_init_dpdk((struct dp_packet *) pkt);
 }
 
 static int
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v12 2/3] dp-packet: Init specific mbuf fields.

2018-11-02 Thread Tiago Lam
From: Mark Kavanagh 

dp_packets are created using xmalloc(); in the case of OvS-DPDK, it's
possible the the resultant mbuf portion of the dp_packet contains
random data. For some mbuf fields, specifically those related to
multi-segment mbufs and/or offload features, random values may cause
unexpected behaviour, should the dp_packet's contents be later copied
to a DPDK mbuf. It is critical therefore, that these fields should be
initialized to 0.

This patch ensures that the following mbuf fields are initialized to
appropriate values on creation of a new dp_packet:
   - ol_flags=0
   - nb_segs=1
   - tx_offload=0
   - packet_type=0
   - next=NULL

Adapted from an idea by Michael Qiu :
https://patchwork.ozlabs.org/patch/777570/

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/dp-packet.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 5b4c9c7..fe34d04 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -498,14 +498,14 @@ dp_packet_mbuf_rss_flag_reset(struct dp_packet *p)
 p->mbuf.ol_flags &= ~PKT_RX_RSS_HASH;
 }
 
-/* This initialization is needed for packets that do not come
- * from DPDK interfaces, when vswitchd is built with --with-dpdk.
- * The DPDK rte library will still otherwise manage the mbuf.
- * We only need to initialize the mbuf ol_flags. */
+/* This initialization is needed for packets that do not come from DPDK
+ * interfaces, when vswitchd is built with --with-dpdk. */
 static inline void
 dp_packet_mbuf_init(struct dp_packet *p)
 {
-p->mbuf.ol_flags = 0;
+p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
+p->mbuf.nb_segs = 1;
+p->mbuf.next = NULL;
 }
 
 static inline bool
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v12 1/3] netdev-dpdk: fix mbuf sizing

2018-11-02 Thread Tiago Lam
From: Mark Kavanagh 

There are numerous factors that must be considered when calculating
the size of an mbuf:
- the data portion of the mbuf must be sized in accordance With Rx
  buffer alignment (typically 1024B). So, for example, in order to
  successfully receive and capture a 1500B packet, mbufs with a
  data portion of size 2048B must be used.
- in OvS, the elements that comprise an mbuf are:
  * the dp packet, which includes a struct rte mbuf (704B)
  * RTE_PKTMBUF_HEADROOM (128B)
  * packet data (aligned to 1k, as previously described)
  * RTE_PKTMBUF_TAILROOM (typically 0)

Some PMDs require that the total mbuf size (i.e. the total sum of all
of the above-listed components' lengths) is cache-aligned. To satisfy
this requirement, it may be necessary to round up the total mbuf size
with respect to cacheline size. In doing so, it's possible that the
dp_packet's data portion is inadvertently increased in size, such that
it no longer adheres to Rx buffer alignment. Consequently, the
following property of the mbuf no longer holds true:

mbuf.data_len == mbuf.buf_len - mbuf.data_off

This creates a problem in the case of multi-segment mbufs, where that
assumption is assumed to be true for all but the final segment in an
mbuf chain. Resolve this issue by adjusting the size of the mbuf's
private data portion, as opposed to the packet data portion when
aligning mbuf size to cachelines.

Co-authored-by: Tiago Lam 

Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
CC: Santosh Shukla 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Santosh Shukla 
Acked-by: Eelco Chaudron 
---
 Documentation/topics/dpdk/memory.rst | 28 +-
 lib/netdev-dpdk.c| 55 
 2 files changed, 51 insertions(+), 32 deletions(-)

diff --git a/Documentation/topics/dpdk/memory.rst 
b/Documentation/topics/dpdk/memory.rst
index e5fb166..c9b739f 100644
--- a/Documentation/topics/dpdk/memory.rst
+++ b/Documentation/topics/dpdk/memory.rst
@@ -107,8 +107,8 @@ Example 1
 
  MTU = 1500 Bytes
  Number of mbufs = 262144
- Mbuf size = 3008 Bytes
- Memory required = 262144 * 3008 = 788 MB
+ Mbuf size = 2752 Bytes
+ Memory required = 262144 * 2752 = 721 MB
 
 Example 2
 +
@@ -116,8 +116,8 @@ Example 2
 
  MTU = 1800 Bytes
  Number of mbufs = 262144
- Mbuf size = 3008 Bytes
- Memory required = 262144 * 3008 = 788 MB
+ Mbuf size = 2752 Bytes
+ Memory required = 262144 * 2752 = 721 MB
 
 .. note::
 
@@ -130,8 +130,8 @@ Example 3
 
  MTU = 6000 Bytes
  Number of mbufs = 262144
- Mbuf size = 8128 Bytes
- Memory required = 262144 * 8128 = 2130 MB
+ Mbuf size = 8000 Bytes
+ Memory required = 262144 * 8000 = 2097 MB
 
 Example 4
 +
@@ -139,8 +139,8 @@ Example 4
 
  MTU = 9000 Bytes
  Number of mbufs = 262144
- Mbuf size = 10176 Bytes
- Memory required = 262144 * 10176 = 2667 MB
+ Mbuf size = 10048 Bytes
+ Memory required = 262144 * 10048 = 2634 MB
 
 Per Port Memory Calculations
 
@@ -194,8 +194,8 @@ Example 1: (1 rxq, 1 PMD, 1500 MTU)
 
  MTU = 1500
  Number of mbufs = (1 * 2048) + (2 * 2048) + (1 * 32) + (16384) = 22560
- Mbuf size = 3008 Bytes
- Memory required = 22560 * 3008 = 67 MB
+ Mbuf size = 2752 Bytes
+ Memory required = 22560 * 2752 = 62 MB
 
 Example 2: (1 rxq, 2 PMD, 6000 MTU)
 +++
@@ -203,8 +203,8 @@ Example 2: (1 rxq, 2 PMD, 6000 MTU)
 
  MTU = 6000
  Number of mbufs = (1 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 24608
- Mbuf size = 8128 Bytes
- Memory required = 24608 * 8128 = 200 MB
+ Mbuf size = 8000 Bytes
+ Memory required = 24608 * 8000 = 196 MB
 
 Example 3: (2 rxq, 2 PMD, 9000 MTU)
 +++
@@ -212,5 +212,5 @@ Example 3: (2 rxq, 2 PMD, 9000 MTU)
 
  MTU = 9000
  Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
- Mbuf size = 10176 Bytes
- Memory required = 26656 * 10176 = 271 MB
+ Mbuf size = 10048 Bytes
+ Memory required = 26656 * 10048 = 267 MB
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index f91aa27..a72438a 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -88,10 +88,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 
20);
 #define MTU_TO_MAX_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_MAX_LEN)
 #define FRAME_LEN_TO_MTU(frame_len) ((frame_len)\
  - ETHER_HDR_LEN - ETHER_CRC_LEN)
-#define MBUF_SIZE(mtu)  ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) \
- + sizeof(struct dp_packet) \
- + RTE_PKTMBUF_HEADROOM),   \
- RTE_CACHE_LINE_SIZE)
 #define NETDEV_DPDK_MBUF_ALIGN  1024
 #define NETDEV_DPDK_MAX_PKT_LEN 9728
 
@@ -637,7 +633,11 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_m

[ovs-dev] [PATCH v12 0/3] Misc fixes in dp_packet and netdev-dpdk.

2018-11-02 Thread Tiago Lam
This series is a split from the multi-seg mbuf series, "v11 - Support
multi-segment mbufs"; Hence why it starts at v12.

v12: - Rebase on master af26093 ("connmgr: Improve interface for setting
   controllers.");
 - Address Ian's comments:
   - Updated memory model calculations in dpdk/memory.rst docs to reflect
 the mbufs sizes;
   - Avoid casting in dp_packet_mbuf_init() to comply with the style in the
 same file;
   - Reword comment dp_packet_init_dpdk() which weren't making much sense
 after the code changes.
 - Address Ilya's comments:
   - Use MTU_TO_FRAME_LEN macro in dpdk_mp_create() instead of
 recalculating the mbuf size from the original dev->mtu;
   - Remove RTE_PKTMBUF_HEADROOM when calculating from 'pkt_size' in
 dpdk_mp_create(). 

Mark Kavanagh (2):
  netdev-dpdk: fix mbuf sizing
  dp-packet: Init specific mbuf fields.

Tiago Lam (1):
  dp-packet: Fix allocated size on DPDK init.

 Documentation/topics/dpdk/memory.rst | 28 +-
 lib/dp-packet.c  | 11 ++-
 lib/dp-packet.h  | 12 
 lib/netdev-dpdk.c| 57 
 4 files changed, 61 insertions(+), 47 deletions(-)

-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] Python: Build C JSON extension conditionally.

2018-10-15 Thread Tiago Lam
Commit a7be68a ("Test the Python C JSON extension") enables the building
of the C JSON parser Python extension by default, which tries to link to
a shared library of libopenvswitch. On instalations where
"--enable-shared" is not specified, although the build ultimately
suceeds, it prints the following messages:

/usr/bin/ld: ../ovs/ovs/lib/.libs/libopenvswitch.a(json.o): relocation 
R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared 
object; recompile with -fPIC
../ovs/ovs/lib/.libs/libopenvswitch.a: error adding symbols: Bad value
collect2: error: ld returned 1 exit status

***

WARNING: The C extension could not be compiled, speedups are not enabled.

Failure information, if any, is above.

Retrying the build without the C extension.

***

This pollutes the compilation logs and can cause some CIs to consider
the build as "FAILED", since words as "WARNING" and "error" will be
interpreted as such while parsing the logs.

Fixes: a7be68a ("Test the Python C JSON extension")
CC: Terry Wilson 
Signed-off-by: Tiago Lam 
---
 configure.ac   | 1 +
 m4/openvswitch.m4  | 3 +++
 python/automake.mk | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/configure.ac b/configure.ac
index 2b38003..5d52461 100644
--- a/configure.ac
+++ b/configure.ac
@@ -136,6 +136,7 @@ OVS_LIBTOOL_VERSIONS
 OVS_CHECK_CXX
 AX_FUNC_POSIX_MEMALIGN
 OVS_CHECK_UNBOUND
+OVS_CHECK_SHARED
 
 OVS_CHECK_INCLUDE_NEXT([stdio.h string.h])
 AC_CONFIG_FILES([
diff --git a/m4/openvswitch.m4 b/m4/openvswitch.m4
index 41042c9..989cbbb 100644
--- a/m4/openvswitch.m4
+++ b/m4/openvswitch.m4
@@ -698,3 +698,6 @@ AC_DEFUN([OVS_CHECK_UNBOUND],
fi
AM_CONDITIONAL([HAVE_UNBOUND], [test "$HAVE_UNBOUND" = yes])
AC_SUBST([HAVE_UNBOUND])])
+
+AC_DEFUN([OVS_CHECK_SHARED],
+   AM_CONDITIONAL([SHARED_LIBRARY], [test "x$enable_shared" = xyes]))
diff --git a/python/automake.mk b/python/automake.mk
index 21547f7..149b2cd 100644
--- a/python/automake.mk
+++ b/python/automake.mk
@@ -93,6 +93,7 @@ python-sdist: $(srcdir)/python/ovs/version.py $(ovs_pyfiles) 
python/ovs/dirs.py
 pypi-upload: $(srcdir)/python/ovs/version.py $(ovs_pyfiles) python/ovs/dirs.py
(cd python/ && $(PYTHON) setup.py sdist upload)
 
+if SHARED_LIBRARY
 ALL_LOCAL += python-build
 python-build:$(lib_LTLIBRARIES)
(cd $(srcdir)/python/ && $(PYTHON) setup.py build_py -d 
$(abs_top_builddir)/python && $(PYTHON) setup.py build_ext -b 
$(abs_top_builddir)/python 
-I$(abs_top_srcdir)/include:$(abs_top_builddir)/include 
-L$(abs_top_builddir)/lib/.libs)
@@ -103,6 +104,7 @@ CLEAN_LOCAL += python-clean
 python-clean:
rm -rf $(abs_top_builddir)/python
 endif
+endif # SHARED_LIBRARY
 else
 ovs-install-data-local:
@:
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v11 14/14] dpdk-tests: End-to-end tests for multi-seg mbufs.

2018-10-10 Thread Tiago Lam
The following tests are added to the DPDK testsuite to add some
coverage for the multi-segment mbufs:
- Check that multi-segment mbufs are disabled by default;
- Check that providing `other_config:dpdk-multi-seg-mbufs=true` indeed
  enables mbufs;
- Using a DPDK port, send a random packet out and check that `ofctl
  dump-flows` shows the correct amount of packets and bytes sent.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/system-dpdk.at | 65 
 1 file changed, 65 insertions(+)

diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 3d21b01..af8de8c 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -71,3 +71,68 @@ OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel 
module is probably n
 ")
 AT_CLEANUP
 dnl --
+
+AT_SETUP([Jumbo frames - Multi-segment disabled by default])
+OVS_DPDK_START()
+
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [1], [])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment enabled])
+OVS_DPDK_START([dpdk-multi-seg-mbufs=true])
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [], [stdout])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment mbufs Tx])
+OVS_DPDK_PRE_CHECK()
+OVS_DPDK_START([per-port-memory=true dpdk-multi-seg-mbufs=true])
+
+dnl Add userspace bridge and attach it to OVS
+AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
+AT_CHECK([ovs-vsctl add-port br10 dpdk0 \
+-- set Interface dpdk0 type=dpdk options:dpdk-devargs=$(cat PCI_ADDR) \
+-- set Interface dpdk0 mtu_request=9000], [], [stdout], [stderr])
+
+AT_CHECK([ovs-vsctl show], [], [stdout])
+
+dnl Add flows to send packets out from the 'dpdk0' port
+AT_CHECK([
+ovs-ofctl del-flows br10
+ovs-ofctl add-flow br10 in_port=LOCAL,actions=output:dpdk0
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [], [stdout])
+
+dnl Send packet out, of the 'dpdk0' port
+AT_CHECK([
+ARP_HEADER="09000B0009000A000806000108000604000100010A\
+0100020A02"
+dnl Build a random hex string to append to the ARP_HEADER
+RANDOM_BODY=$(printf '0102030405%.0s' {1..1750})
+dnl 8792B ARP packet
+RANDOM_ARP="$ARP_HEADER$RANDOM_BODY"
+
+ovs-ofctl packet-out br10 "packet=$RANDOM_ARP,action=resubmit:LOCAL"
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [0], [stdout])
+
+dnl Confirm the single packet as been sent with correct size
+AT_CHECK([ovs-ofctl dump-flows br10 | ofctl_strip | grep in_port], [0], [dnl
+ n_packets=1, n_bytes=8792, in_port=LOCAL actions=output:1
+])
+
+dnl Clean up
+OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel module is probably 
not loaded./d
+/Failed to enable flow control/d
+/failed to connect to \/tmp\/dpdkvhostclient0: No such file or directory/d
+/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v11 11/14] netdev-dpdk: support multi-segment jumbo frames

2018-10-10 Thread Tiago Lam
From: Mark Kavanagh 

Currently, jumbo frame support for OvS-DPDK is implemented by
increasing the size of mbufs within a mempool, such that each mbuf
within the pool is large enough to contain an entire jumbo frame of
a user-defined size. Typically, for each user-defined MTU,
'requested_mtu', a new mempool is created, containing mbufs of size
~requested_mtu.

With the multi-segment approach, a port uses a single mempool,
(containing standard/default-sized mbufs of ~2k bytes), irrespective
of the user-requested MTU value. To accommodate jumbo frames, mbufs
are chained together, where each mbuf in the chain stores a portion of
the jumbo frame. Each mbuf in the chain is termed a segment, hence the
name.

== Enabling multi-segment mbufs ==
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 Documentation/topics/dpdk/jumbo-frames.rst | 73 ++
 Documentation/topics/dpdk/memory.rst   | 36 +++
 NEWS   |  1 +
 lib/dpdk.c |  8 
 lib/netdev-dpdk.c  | 60 
 lib/netdev-dpdk.h  |  1 +
 vswitchd/vswitch.xml   | 22 +
 7 files changed, 193 insertions(+), 8 deletions(-)

diff --git a/Documentation/topics/dpdk/jumbo-frames.rst 
b/Documentation/topics/dpdk/jumbo-frames.rst
index 00360b4..9804bbb 100644
--- a/Documentation/topics/dpdk/jumbo-frames.rst
+++ b/Documentation/topics/dpdk/jumbo-frames.rst
@@ -71,3 +71,76 @@ Jumbo frame support has been validated against 9728B frames, 
which is the
 largest frame size supported by Fortville NIC using the DPDK i40e driver, but
 larger frames and other DPDK NIC drivers may be supported. These cases are
 common for use cases involving East-West traffic only.
+
+---
+Multi-segment mbufs
+---
+
+Instead of increasing the size of mbufs within a mempool, such that each mbuf
+within the pool is large enough to contain an entire jumbo frame of a
+user-defined size, mbufs can be chained together instead. In this approach each
+mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
+irrespective of the user-requested MTU value. Since each mbuf in the chain is
+termed a segment, this approach is named "multi-segment mbufs".
+
+This approach may bring more flexibility in use cases where the maximum packet
+length may be hard to guess. For example, in cases where packets originate from
+sources marked for offload (such as TSO), each packet may be larger than the
+MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
+enough to hold all of the packet's data.
+
+Multi-segment and single-segment mbufs are mutually exclusive, and the user
+must decide on which approach to adopt on initialisation. If multi-segment
+mbufs is to be enabled, it can be done so with the following command::
+
+$ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
+
+Single-segment mbufs still remain the default when using OvS-DPDK, and the
+above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
+multi-segment mbufs are to be used.
+
+~
+Performance notes
+~
+
+When using multi-segment mbufs some PMDs may not support vectorized Tx
+functions, due to its non-contiguous nature. As a result this can hit
+performance for smaller packet sizes. For example, on a setup sending 64B
+packets at line rate, a decrease of ~20% has been observed. The performance
+impact stops being noticeable for larger packet sizes, although the exact size
+will depend on each PMD, and vary between architectures.
+
+Tests performed with the i40e PMD driver only showed this limitation for 64B
+packets, and the same rate was observed when comparing multi-segment mbufs and
+single-segment mbuf for 128B packets. In other words, the 20% drop in
+performance was not observed for packets >= 128B during this test case.
+
+Because of this, multi-segment mbufs is not advised to be used with smaller
+packet si

[ovs-dev] [PATCH v11 13/14] dpdk-tests: Accept other configs in OVS_DPDK_START

2018-10-10 Thread Tiago Lam
As it stands, OVS_DPDK_START() won't allow other configs to be set
before starting the ovs-vswitchd daemon. This is a problem since some
configs, such as the "dpdk-multi-seg-mbufs=true" for enabling the
multi-segment mbufs, need to be set prior to start OvS.

To support other options, OVS_DPDK_START() has been modified to accept
extra configs in the form "$config_name=$config_value". It then uses
ovs-vsctl to set the configs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 tests/system-dpdk-macros.at | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tests/system-dpdk-macros.at b/tests/system-dpdk-macros.at
index 0762ee0..7c65834 100644
--- a/tests/system-dpdk-macros.at
+++ b/tests/system-dpdk-macros.at
@@ -21,7 +21,7 @@ m4_define([OVS_DPDK_PRE_CHECK],
 ])
 
 
-# OVS_DPDK_START()
+# OVS_DPDK_START([other-conf-args])
 #
 # Create an empty database and start ovsdb-server. Add special configuration
 # dpdk-init to enable DPDK functionality. Start ovs-vswitchd connected to that
@@ -48,6 +48,10 @@ m4_define([OVS_DPDK_START],
AT_CHECK([lscpu], [], [stdout])
AT_CHECK([cat stdout | grep "NUMA node(s)" | awk '{c=1; while (c++<$(3)) 
{printf "1024,"}; print "1024"}' > SOCKET_MEM])
AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . 
other_config:dpdk-socket-mem="$(cat SOCKET_MEM)"])
+   dnl Iterate through $other-conf-args list and include them
+   m4_foreach_w(opt, $1, [
+   AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . other_config:opt])
+   ])
 
dnl Start ovs-vswitchd.
AT_CHECK([ovs-vswitchd --detach --no-chdir --pidfile --log-file -vvconn 
-vofproto_dpif -vunixctl], [0], [stdout], [stderr])
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v11 12/14] dpdk-tests: Add unit-tests for multi-seg mbufs.

2018-10-10 Thread Tiago Lam
In order to create a minimal environment that allows the tests to get
mbufs from an existing mempool, the following approach is taken:
- EAL is initialised (by using the main dpdk_init()) and a (very) small
  mempool is instantiated (mimicking the logic in dpdk_mp_create()).
  This mempool instance is global and used by all the tests;
- Packets are then allocated from the instantiated mempool, and tested
  on, by running some operations on them and manipulating data.

The tests introduced focus on testing DPDK dp_packets (where
source=DPBUF_DPDK), linked with a single or multiple mbufs, across
several operations, such as:
- dp_packet_put();
- dp_packet_shift();
- dp_packet_reserve();
- dp_packet_push_uninit();
- dp_packet_clear();
- dp_packet_equal();
- dp_packet_linear_data();
- And as a consequence of some of these, dp_packet_put_uninit() and
  dp_packet_resize__().

Finally, this has also been integrated with the new DPDK testsuite.
Thus, when running `$sudo make check-dpdk` one will also be running
these tests.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 tests/automake.mk  |  10 +-
 tests/dpdk-packet-mbufs.at |   7 +
 tests/system-dpdk-testsuite.at |   1 +
 tests/test-dpdk-mbufs.c| 722 +
 4 files changed, 739 insertions(+), 1 deletion(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 97312cf..fca35d9 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -168,7 +168,8 @@ SYSTEM_DPDK_TESTSUITE_AT = \
tests/system-common-macros.at \
tests/system-dpdk-macros.at \
tests/system-dpdk-testsuite.at \
-   tests/system-dpdk.at
+   tests/system-dpdk.at \
+   tests/dpdk-packet-mbufs.at
 
 check_SCRIPTS += tests/atlocal
 
@@ -425,6 +426,10 @@ tests_ovstest_SOURCES = \
tests/test-vconn.c \
tests/test-aa.c \
tests/test-stopwatch.c
+if DPDK_NETDEV
+tests_ovstest_SOURCES += \
+   tests/test-dpdk-mbufs.c
+endif
 
 if !WIN32
 tests_ovstest_SOURCES += \
@@ -437,6 +442,9 @@ tests_ovstest_SOURCES += \
 endif
 
 tests_ovstest_LDADD = lib/libopenvswitch.la ovn/lib/libovn.la
+if DPDK_NETDEV
+tests_ovstest_LDFLAGS = $(AM_LDFLAGS) $(DPDK_vswitchd_LDFLAGS)
+endif
 
 noinst_PROGRAMS += tests/test-strtok_r
 tests_test_strtok_r_SOURCES = tests/test-strtok_r.c
diff --git a/tests/dpdk-packet-mbufs.at b/tests/dpdk-packet-mbufs.at
new file mode 100644
index 000..f28e4fc
--- /dev/null
+++ b/tests/dpdk-packet-mbufs.at
@@ -0,0 +1,7 @@
+AT_BANNER([OVS-DPDK dp_packet unit tests])
+
+AT_SETUP([OVS-DPDK dp_packet - mbufs allocation])
+AT_KEYWORDS([dp_packet, multi-seg, mbufs])
+AT_CHECK(ovstest test-dpdk-packet, [], [ignore], [ignore])
+
+AT_CLEANUP
diff --git a/tests/system-dpdk-testsuite.at b/tests/system-dpdk-testsuite.at
index 382f09e..f5edf58 100644
--- a/tests/system-dpdk-testsuite.at
+++ b/tests/system-dpdk-testsuite.at
@@ -23,3 +23,4 @@ m4_include([tests/system-common-macros.at])
 m4_include([tests/system-dpdk-macros.at])
 
 m4_include([tests/system-dpdk.at])
+m4_include([tests/dpdk-packet-mbufs.at])
diff --git a/tests/test-dpdk-mbufs.c b/tests/test-dpdk-mbufs.c
new file mode 100644
index 000..0c152bf
--- /dev/null
+++ b/tests/test-dpdk-mbufs.c
@@ -0,0 +1,722 @@
+/*
+ * Copyright (c) 2018 Intel Corporation
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "dp-packet.h"
+#include "ovstest.h"
+#include "dpdk.h"
+#include "smap.h"
+#include "csum.h"
+#include "crc32c.h"
+
+#define N_MBUFS 1024
+#define MBUF_DATA_LEN 2048
+
+static int num_tests = 0;
+
+/* Global var to hold a mempool instance, "test-mp", used in all of the tests
+ * below. This instance is instantiated in dpdk_setup_eal_with_mp(). */
+static struct rte_mempool *mp;
+
+/* Test data used to fill the packets with data. Note that this isn't a string
+ * that repsents a valid packet, by any means. The pattern is generated in set_
+ * testing_pattern_str() and the sole purpose is to verify the data remains the
+ * same after inserting and operating on multi-segment mbufs. */
+static char *test_str;
+
+/* Asserts a dp_packet that holds a single mbuf, where:
+ * - nb_segs must be 1;
+ 

[ovs-dev] [PATCH v11 09/14] dp-packet: Add support for data "linearization".

2018-10-10 Thread Tiago Lam
Previous commits have added support to the dp_packet API to handle
multi-segmented packets, where data is not stored contiguously in
memory. However, in some cases, it is inevitable and data must be
provided contiguously. Examples of such cases are when performing csums
over the entire packet data, or when write()'ing to a file descriptor
(for a tap interface, for example). For such cases, the dp_packet API
has been extended to provide a way to transform a multi-segmented
DPBUF_DPDK packet into a DPBUF_MALLOC system packet (at the expense of a
copy of memory). If the packet's data is already stored in memory
contigously then there's no need to convert the packet.

Thus, the main use cases that were assuming that a dp_packet's data is
always held contiguously in memory were changed to make use of the new
"linear functions" in the dp_packet API when there's a need to traverse
the entire's packet data. Per the example above, when the packet's data
needs to be write() to the tap's file descriptor, or when the conntrack
module needs to verify a packet's checksum, the data is now linearized.

Additionally, the layer functions, such as dp_packet_l3() and variants,
have been modified to check if there's enough data in the packet before
returning a pointer to the data (and callers have been modified
accordingly). This requirement is needed to guarantee that a caller
doesn't read beyond the available memory.

Signed-off-by: Tiago Lam 
---
 lib/bfd.c |   3 +-
 lib/cfm.c |   5 +-
 lib/conntrack-icmp.c  |   4 +-
 lib/conntrack-private.h   |   4 +-
 lib/conntrack-tcp.c   |   6 +-
 lib/conntrack.c   | 109 ++--
 lib/crc32c.c  |  35 +++--
 lib/crc32c.h  |   2 +
 lib/dp-packet.c   |  18 +++
 lib/dp-packet.h   | 288 +-
 lib/dpif-netdev.c |   5 +
 lib/dpif-netlink.c|   5 +
 lib/dpif.c|   9 ++
 lib/flow.c|  44 ---
 lib/lacp.c|   3 +-
 lib/mcast-snooping.c  |   8 +-
 lib/netdev-bsd.c  |   5 +
 lib/netdev-dummy.c|  13 +-
 lib/netdev-linux.c|  13 +-
 lib/netdev-native-tnl.c   |  38 +++---
 lib/odp-execute.c |  28 ++--
 lib/ofp-print.c   |  10 +-
 lib/ovs-lldp.c|   3 +-
 lib/packets.c | 151 --
 lib/packets.h |   7 +
 lib/pcap-file.c   |   2 +-
 ofproto/ofproto-dpif-upcall.c |  20 ++-
 ofproto/ofproto-dpif-xlate.c  |  42 --
 ovn/controller/pinctrl.c  |  29 +++--
 tests/test-conntrack.c|   2 +-
 tests/test-rstp.c |   8 +-
 tests/test-stp.c  |   8 +-
 32 files changed, 637 insertions(+), 290 deletions(-)

diff --git a/lib/bfd.c b/lib/bfd.c
index cc8c685..12e076a 100644
--- a/lib/bfd.c
+++ b/lib/bfd.c
@@ -721,7 +721,8 @@ bfd_process_packet(struct bfd *bfd, const struct flow *flow,
 if (!msg) {
 VLOG_INFO_RL(, "%s: Received too-short BFD control message (only "
  "%"PRIdPTR" bytes long, at least %d required).",
- bfd->name, (uint8_t *) dp_packet_tail(p) - l7,
+ bfd->name, dp_packet_size(p) -
+ (l7 - (uint8_t *) dp_packet_data(p)),
  BFD_PACKET_LEN);
 goto out;
 }
diff --git a/lib/cfm.c b/lib/cfm.c
index 71d2c02..83baf2a 100644
--- a/lib/cfm.c
+++ b/lib/cfm.c
@@ -584,7 +584,7 @@ cfm_compose_ccm(struct cfm *cfm, struct dp_packet *packet,
 
 atomic_read_relaxed(>extended, );
 
-ccm = dp_packet_l3(packet);
+ccm = dp_packet_l3(packet, sizeof(*ccm));
 ccm->mdlevel_version = 0;
 ccm->opcode = CCM_OPCODE;
 ccm->tlv_offset = 70;
@@ -759,8 +759,7 @@ cfm_process_heartbeat(struct cfm *cfm, const struct 
dp_packet *p)
 atomic_read_relaxed(>extended, );
 
 eth = dp_packet_eth(p);
-ccm = dp_packet_at(p, (uint8_t *)dp_packet_l3(p) - (uint8_t 
*)dp_packet_data(p),
-CCM_ACCEPT_LEN);
+ccm = dp_packet_l3(p, CCM_ACCEPT_LEN);
 
 if (!ccm) {
 VLOG_INFO_RL(, "%s: Received an unparseable 802.1ag CCM heartbeat.",
diff --git a/lib/conntrack-icmp.c b/lib/conntrack-icmp.c
index 40fd1d8..0575d0e 100644
--- a/lib/conntrack-icmp.c
+++ b/lib/conntrack-icmp.c
@@ -63,7 +63,7 @@ icmp_conn_update(struct conn *conn_, struct conntrack_bucket 
*ctb,
 static bool
 icmp4_valid_new(struct dp_packet *pkt)
 {
-struct icmp_header *icmp = dp_packet_l4(pkt);
+struct icmp_header *icmp = dp_packet_l4(pkt, sizeof *icmp);
 
 return icmp->icmp_type == ICMP4_ECHO_REQUEST
|| icmp->icmp_type == ICMP4_INFOREQUEST
@@ -73,7 +73,7 @@ icmp4_valid_new(struct dp_packet *pkt)
 static bool
 icmp6_valid_new(struct dp_packet *pkt)
 {
-struct icmp6_head

[ovs-dev] [PATCH v11 10/14] netdev-dpdk: copy large packet to multi-seg. mbufs

2018-10-10 Thread Tiago Lam
From: Mark Kavanagh 

Currently, packets are only copied to a single segment in the function
dpdk_do_tx_copy(). This could be an issue in the case of jumbo frames,
particularly when multi-segment mbufs are involved.

This patch calculates the number of segments needed by a packet and
copies the data to each segment.

A new function, dpdk_buf_alloc(), has also been introduced as a wrapper
around the nonpmd_mp_mutex to serialise allocations from a non-pmd
context.

Co-authored-by: Michael Qiu 
Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Michael Qiu 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 89 +--
 1 file changed, 80 insertions(+), 9 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index d5e70d3..3ef6606 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -552,6 +552,25 @@ dpdk_rte_mzalloc(size_t sz)
 return rte_zmalloc(OVS_VPORT_DPDK, sz, OVS_CACHE_LINE_SIZE);
 }
 
+static struct rte_mbuf *
+dpdk_buf_alloc(struct rte_mempool *mp)
+{
+struct rte_mbuf *mbuf = NULL;
+
+/* If non-pmd we need to lock on nonpmd_mp_mutex mutex */
+if (dpdk_thread_is_pmd()) {
+mbuf = rte_pktmbuf_alloc(mp);
+} else {
+ovs_mutex_lock(_mp_mutex);
+
+mbuf = rte_pktmbuf_alloc(mp);
+
+ovs_mutex_unlock(_mp_mutex);
+}
+
+return mbuf;
+}
+
 void
 free_dpdk_buf(struct dp_packet *packet)
 {
@@ -2316,6 +2335,56 @@ out:
 }
 }
 
+static int
+dpdk_copy_dp_packet_to_mbuf(struct dp_packet *packet, struct rte_mbuf **head,
+struct rte_mempool *mp)
+{
+struct rte_mbuf *mbuf, *fmbuf;
+uint16_t max_data_len;
+uint32_t nb_segs = 0;
+uint32_t size = 0;
+
+/* We will need the whole data for copying below */
+if (!dp_packet_is_linear(packet)) {
+dp_packet_linearize(packet);
+}
+
+/* Allocate first mbuf to know the size of data available */
+fmbuf = mbuf = *head = dpdk_buf_alloc(mp);
+if (OVS_UNLIKELY(!mbuf)) {
+return ENOMEM;
+}
+
+size = dp_packet_size(packet);
+
+/* All new allocated mbuf's max data len is the same */
+max_data_len = mbuf->buf_len - mbuf->data_off;
+
+/* Calculate # of output mbufs. */
+nb_segs = size / max_data_len;
+if (size % max_data_len) {
+nb_segs = nb_segs + 1;
+}
+
+/* Allocate additional mbufs, less the one alredy allocated above */
+for (int i = 1; i < nb_segs; i++) {
+mbuf->next = dpdk_buf_alloc(mp);
+if (!mbuf->next) {
+free_dpdk_buf(CONTAINER_OF(fmbuf, struct dp_packet, mbuf));
+fmbuf = NULL;
+return ENOMEM;
+}
+mbuf = mbuf->next;
+}
+
+fmbuf->nb_segs = nb_segs;
+fmbuf->pkt_len = size;
+
+dp_packet_mbuf_write(fmbuf, 0, size, dp_packet_data(packet));
+
+return 0;
+}
+
 /* Tx function. Transmit packets indefinitely */
 static void
 dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
@@ -2332,6 +2401,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
 uint32_t cnt = batch_cnt;
 uint32_t dropped = 0;
+uint32_t i;
 
 if (dev->type != DPDK_DEV_VHOST) {
 /* Check if QoS has been configured for this netdev. */
@@ -2342,28 +2412,29 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 
 uint32_t txcnt = 0;
 
-for (uint32_t i = 0; i < cnt; i++) {
+for (i = 0; i < cnt; i++) {
 struct dp_packet *packet = batch->packets[i];
 uint32_t size = dp_packet_size(packet);
+int err = 0;
 
 if (OVS_UNLIKELY(size > dev->max_packet_len)) {
 VLOG_WARN_RL(, "Too big size %u max_packet_len %d",
  size, dev->max_packet_len);
-
 dropped++;
 continue;
 }
 
-pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
-if (OVS_UNLIKELY(!pkts[txcnt])) {
+err = dpdk_copy_dp_packet_to_mbuf(packet, [txcnt],
+  dev->dpdk_mp->mp);
+if (err != 0) {
+if (err == ENOMEM) {
+VLOG_ERR_RL(, "Failed to alloc mbufs! %u packets dropped",
+cnt - i);
+}
+
 dropped += cnt - i;
 break;
 }
-
-/* We have to do a copy for now */
-memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
-   dp_packet_data(packet), size);
-dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
 dp_packet_copy_mbuf_flags((struct dp_packet *)pkts[txcnt], packet);
 
 txcnt++;
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v11 08/14] dp-packet: copy data from multi-seg. DPDK mbuf

2018-10-10 Thread Tiago Lam
From: Michael Qiu 

When doing packet clone, if packet source is from DPDK driver,
multi-segment must be considered, and copy the segment's data one by
one.

Also, lots of DPDK mbuf's info is missed during a copy, like packet
type, ol_flags, etc.  That information is very important for DPDK to do
packets processing.

Co-authored-by: Mark Kavanagh 
Co-authored-by: Tiago Lam 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c   | 57 ---
 lib/dp-packet.h   | 33 
 lib/netdev-dpdk.c |  1 +
 3 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 4c197b6..1b9503c 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -163,39 +163,58 @@ dp_packet_clone(const struct dp_packet *buffer)
 return dp_packet_clone_with_headroom(buffer, 0);
 }
 
+#ifdef DPDK_NETDEV
+struct dp_packet *
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom) {
+struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
+
+/* copy multi-seg data */
+if (b->source == DPBUF_DPDK && !rte_pktmbuf_is_contiguous(>mbuf)) {
+void *dst = NULL;
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+new_buffer = dp_packet_new_with_headroom(pkt_len, headroom);
+dst = dp_packet_data(new_buffer);
+dp_packet_set_size(new_buffer, pkt_len);
+
+if (!rte_pktmbuf_read(mbuf, 0, pkt_len, dst)) {
+dp_packet_delete(new_buffer);
+return NULL;
+}
+} else {
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+dp_packet_size(b),
+headroom);
+}
+
+dp_packet_copy_common_members(new_buffer, b);
+
+return new_buffer;
+}
+#else
 /* Creates and returns a new dp_packet whose data are copied from 'buffer'.
  * The returned dp_packet will additionally have 'headroom' bytes of
  * headroom. */
 struct dp_packet *
-dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t headroom)
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom)
 {
 struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
 
-new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(buffer),
- dp_packet_size(buffer),
- headroom);
-/* Copy the following fields into the returned buffer: l2_pad_size,
- * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
-memcpy(_buffer->l2_pad_size, >l2_pad_size,
-sizeof(struct dp_packet) -
-offsetof(struct dp_packet, l2_pad_size));
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+pkt_len, headroom);
 
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.ol_flags = buffer->mbuf.ol_flags;
-#else
-new_buffer->rss_hash_valid = buffer->rss_hash_valid;
-#endif
+dp_packet_copy_common_members(new_buffer, b);
 
+new_buffer->rss_hash_valid = b->rss_hash_valid;
 if (dp_packet_rss_valid(new_buffer)) {
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.hash.rss = buffer->mbuf.hash.rss;
-#else
-new_buffer->rss_hash = buffer->rss_hash;
-#endif
+new_buffer->rss_hash = b->rss_hash;
 }
 
 return new_buffer;
 }
+#endif
 
 /* Creates and returns a new dp_packet that initially contains a copy of the
  * 'size' bytes of data starting at 'data' with no headroom or tailroom. */
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 38f11c4..bae1882 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -91,6 +91,9 @@ static inline void dp_packet_set_size(struct dp_packet *, 
uint32_t);
 static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
 static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
 
+static inline void
+dp_packet_copy_mbuf_flags(struct dp_packet *dst, const struct dp_packet *src);
+
 void *dp_packet_resize_l2(struct dp_packet *, int increment);
 void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
 static inline void *dp_packet_eth(const struct dp_packet *);
@@ -119,6 +122,9 @@ void dp_packet_init_dpdk(struct dp_packet *);
 void dp_packet_init(struct dp_packet *, size_t);
 void dp_packet_uninit(struct dp_packet *);
 
+void dp_packet_copy_mbuf_flags(struct dp_packet *dst,
+   const struct dp_packet *src);
+
 struct dp_packet *dp_packet_new(size_t);
 struct dp_packet *dp_packet_new_with_headroom(size_t, size_t headroom);
 struct dp_packet *dp_packet_clone(const struct dp_packet *);
@@ -129,6 +135,10 @@ struct dp_packe

[ovs-dev] [PATCH v11 07/14] dp-packet: Handle multi-seg mubfs in shift() func.

2018-10-10 Thread Tiago Lam
In its current implementation dp_packet_shift() is also unaware of
multi-seg mbufs (that holds data in memory non-contiguously) and assumes
that data exists contiguously in memory, memmove'ing data to perform the
shift.

To add support for multi-seg mbufs a new set of functions was
introduced, dp_packet_mbuf_shift() and dp_packet_mbuf_write(). These
functions are used by dp_packet_shift(), when handling multi-seg mbufs,
to shift and write data within a chain of mbufs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c | 105 
 lib/dp-packet.h |   3 ++
 2 files changed, 108 insertions(+)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 2aaeaae..4c197b6 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -23,6 +23,11 @@
 #include "openvswitch/dynamic-string.h"
 #include "util.h"
 
+#ifdef DPDK_NETDEV
+#define MBUF_BUF_END(BUF_ADDR, BUF_LEN) \
+(char *) (((char *) BUF_ADDR) + BUF_LEN)
+#endif
+
 static void
 dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source 
source)
 {
@@ -294,6 +299,100 @@ dp_packet_prealloc_headroom(struct dp_packet *b, size_t 
size)
 }
 }
 
+#ifdef DPDK_NETDEV
+/* Write len data bytes in a mbuf at specified offset.
+ *
+ * 'mbuf', pointer to the destination mbuf where 'ofs' is, and the mbuf where
+ * the data will first be written.
+ * 'ofs', the offset within the provided 'mbuf' where 'data' is to be written.
+ * 'len', the size of the to be written 'data'.
+ * 'data', pointer to the to be written bytes.
+ *
+ * XXX: This function is the counterpart of the `rte_pktmbuf_read()` function
+ * available with DPDK, in the rte_mbuf.h */
+void
+dp_packet_mbuf_write(struct rte_mbuf *mbuf, int16_t ofs, uint32_t len,
+ const void *data)
+{
+char *dst_addr;
+uint16_t data_len;
+int len_copy;
+while (mbuf) {
+if (len == 0) {
+break;
+}
+
+dst_addr = rte_pktmbuf_mtod_offset(mbuf, char *, ofs);
+data_len = MBUF_BUF_END(mbuf->buf_addr, mbuf->buf_len) - dst_addr;
+
+len_copy = MIN(len, data_len);
+/* We don't know if 'data' is the result of a rte_pktmbuf_read() call,
+ * in which case we may end up writing to the same region of memory we
+ * are reading from and overlapping. Hence the use of memmove() here */
+memmove(dst_addr, data, len_copy);
+
+data = ((char *) data) + len_copy;
+len -= len_copy;
+ofs = 0;
+
+mbuf->data_len = len_copy;
+mbuf = mbuf->next;
+}
+}
+
+static void
+dp_packet_mbuf_shift_(struct rte_mbuf *dbuf, int16_t dst_ofs,
+  const struct rte_mbuf *sbuf, uint16_t src_ofs, int len)
+{
+char *rd = xmalloc(sizeof(*rd) * len);
+const char *wd = rte_pktmbuf_read(sbuf, src_ofs, len, rd);
+
+ovs_assert(wd);
+
+dp_packet_mbuf_write(dbuf, dst_ofs, len, wd);
+
+free(rd);
+}
+
+/* Similarly to dp_packet_shift(), shifts the data within the mbufs of a
+ * dp_packet of DPBUF_DPDK source by 'delta' bytes.
+ * Caller must make sure of the following conditions:
+ * - When shifting left, delta can't be bigger than the data_len available in
+ *   the last mbuf;
+ * - When shifting right, delta can't be bigger than the space available in the
+ *   first mbuf (buf_len - data_off).
+ * Both these conditions guarantee that a shift operation doesn't fall outside
+ * the bounds of the existing mbufs, so that the first and last mbufs (when
+ * using multi-segment mbufs), remain the same. */
+static void
+dp_packet_mbuf_shift(struct dp_packet *b, int delta)
+{
+uint16_t src_ofs;
+int16_t dst_ofs;
+
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+struct rte_mbuf *tmbuf = rte_pktmbuf_lastseg(mbuf);
+
+if (delta < 0) {
+ovs_assert(-delta <= tmbuf->data_len);
+} else {
+ovs_assert(delta < (mbuf->buf_len - mbuf->data_off));
+}
+
+/* Set the destination and source offsets to copy to */
+dst_ofs = delta;
+src_ofs = 0;
+
+/* Shift data from src mbuf and offset to dst mbuf and offset */
+dp_packet_mbuf_shift_(mbuf, dst_ofs, mbuf, src_ofs,
+  rte_pktmbuf_pkt_len(mbuf));
+
+/* Update mbufs' properties, and if using multi-segment mbufs, first and
+ * last mbuf's data_len also needs to be adjusted */
+mbuf->data_off = mbuf->data_off + dst_ofs;
+}
+#endif
+
 /* Shifts all of the data within the allocated space in 'b' by 'delta' bytes.
  * For example, a 'delta' of 1 would cause each byte of data to move one byte
  * forward (from address 'p' to 'p+1'), and a 'delta' of -1 would cause each
@@ -306,6 +405,12 @@ dp_packet_shift(struct dp_packet *b, int delta)
: true);
 
 if (delta != 0) {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+dp_packet_mbuf_shift(b, delta);
+return;
+  

[ovs-dev] [PATCH v11 06/14] dp-packet: Handle multi-seg mbufs in helper funcs.

2018-10-10 Thread Tiago Lam
Most helper functions in dp-packet assume that the data held by a
dp_packet is contiguous, and perform operations such as pointer
arithmetic under that assumption. However, with the introduction of
multi-segment mbufs, where data is non-contiguous, such assumptions are
no longer possible. Some examples of Such helper functions are
dp_packet_tail(), dp_packet_tailroom(), dp_packet_end(),
dp_packet_get_allocated() and dp_packet_at().

Thus, instead of assuming contiguous data in dp_packet, they  now
iterate over the (non-contiguous) data in mbufs to perform their
calculations.

Finally, dp_packet_use__() has also been modified to perform the
initialisation of the packet (and setting the source) before continuing
to set its size and data length, which now depends on the type of
packet.

Co-authored-by: Mark Kavanagh 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/dp-packet.c |   4 +-
 lib/dp-packet.h | 191 +++-
 2 files changed, 177 insertions(+), 18 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 782e7c2..2aaeaae 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -41,11 +41,11 @@ static void
 dp_packet_use__(struct dp_packet *b, void *base, size_t allocated,
  enum dp_packet_source source)
 {
+dp_packet_init__(b, allocated, source);
+
 dp_packet_set_base(b, base);
 dp_packet_set_data(b, base);
 dp_packet_set_size(b, 0);
-
-dp_packet_init__(b, allocated, source);
 }
 
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 223efe2..78c6f7f 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -133,6 +133,10 @@ static inline void *dp_packet_at(const struct dp_packet *, 
size_t offset,
  size_t size);
 static inline void *dp_packet_at_assert(const struct dp_packet *,
 size_t offset, size_t size);
+#ifdef DPDK_NETDEV
+static inline const struct rte_mbuf *
+dp_packet_mbuf_from_offset(const struct dp_packet *b, size_t *offset);
+#endif
 static inline void *dp_packet_tail(const struct dp_packet *);
 static inline void *dp_packet_end(const struct dp_packet *);
 
@@ -185,9 +189,25 @@ dp_packet_delete(struct dp_packet *b)
 static inline void *
 dp_packet_at(const struct dp_packet *b, size_t offset, size_t size)
 {
-return offset + size <= dp_packet_size(b)
-   ? (char *) dp_packet_data(b) + offset
-   : NULL;
+if (offset + size > dp_packet_size(b)) {
+return NULL;
+}
+
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+while (buf && offset > buf->data_len) {
+offset -= buf->data_len;
+
+buf = buf->next;
+}
+
+return buf ? rte_pktmbuf_mtod_offset(buf, char *, offset) : NULL;
+}
+#endif
+
+return (char *) dp_packet_data(b) + offset;
 }
 
 /* Returns a pointer to byte 'offset' in 'b', which must contain at least
@@ -196,13 +216,23 @@ static inline void *
 dp_packet_at_assert(const struct dp_packet *b, size_t offset, size_t size)
 {
 ovs_assert(offset + size <= dp_packet_size(b));
-return ((char *) dp_packet_data(b)) + offset;
+return dp_packet_at(b, offset, size);
 }
 
 /* Returns a pointer to byte following the last byte of data in use in 'b'. */
 static inline void *
 dp_packet_tail(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
+/* Find last segment where data ends, meaning the tail of the chained
+ *  mbufs must be there */
+buf = rte_pktmbuf_lastseg(buf);
+
+return rte_pktmbuf_mtod_offset(buf, void *, buf->data_len);
+}
+#endif
 return (char *) dp_packet_data(b) + dp_packet_size(b);
 }
 
@@ -211,6 +241,15 @@ dp_packet_tail(const struct dp_packet *b)
 static inline void *
 dp_packet_end(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, &(b->mbuf));
+
+buf = rte_pktmbuf_lastseg(buf);
+
+return (char *) buf->buf_addr + buf->buf_len;
+}
+#endif
 return (char *) dp_packet_base(b) + dp_packet_get_allocated(b);
 }
 
@@ -236,6 +275,15 @@ dp_packet_tailroom(const struct dp_packet *b)
 static inline void
 dp_packet_clear(struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+/* sets pkt_len and data_len to zero and frees unused mbufs */
+dp_packet_set_size(b, 0);
+rte_pktmbuf_reset(>mbuf);
+
+return;
+}
+#endif
 dp_packet_set_data(b, dp_packet_base(b));
 dp_packet_set_size(b, 0);
 }
@@ -248,28 +296,47 @@ dp_packet_pull(st

[ovs-dev] [PATCH v11 03/14] dp-packet: Fix allocated size on DPDK init.

2018-10-10 Thread Tiago Lam
When enabled with DPDK OvS deals with two types of packets, the ones
coming from the mempool and the ones locally created by OvS - which are
copied to mempool mbufs before output. In the latter, the space is
allocated from the system, while in the former the mbufs are allocated
from a mempool, which takes care of initialising them appropriately.

In the current implementation, during mempool's initialisation of mbufs,
dp_packet_set_allocated() is called from dp_packet_init_dpdk() without
considering that the allocated space, in the case of multi-segment
mbufs, might be greater than a single mbuf.  Furthermore, given that
dp_packet_init_dpdk() is on the code path that's called upon mempool's
initialisation, a call to dp_packet_set_allocated() is redundant, since
mempool takes care of initialising it.

To fix this, dp_packet_set_allocated() is no longer called after
initialisation of a mempool, only in dp_packet_init__(), which is still
called by OvS when initialising locally created packets.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/dp-packet.c   | 3 +--
 lib/dp-packet.h   | 2 +-
 lib/netdev-dpdk.c | 2 +-
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 443c225..782e7c2 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -99,9 +99,8 @@ dp_packet_use_const(struct dp_packet *b, const void *data, 
size_t size)
  * buffer.  Here, non-transient ovs dp-packet fields are initialized for
  * packets that are part of a DPDK memory pool. */
 void
-dp_packet_init_dpdk(struct dp_packet *b, size_t allocated)
+dp_packet_init_dpdk(struct dp_packet *b)
 {
-dp_packet_set_allocated(b, allocated);
 b->source = DPBUF_DPDK;
 }
 
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index b948fe1..6376039 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -114,7 +114,7 @@ void dp_packet_use(struct dp_packet *, void *, size_t);
 void dp_packet_use_stub(struct dp_packet *, void *, size_t);
 void dp_packet_use_const(struct dp_packet *, const void *, size_t);
 
-void dp_packet_init_dpdk(struct dp_packet *, size_t allocated);
+void dp_packet_init_dpdk(struct dp_packet *);
 
 void dp_packet_init(struct dp_packet *, size_t);
 void dp_packet_uninit(struct dp_packet *);
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 11659eb..b6a22bd 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -550,7 +550,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp OVS_UNUSED,
 {
 struct rte_mbuf *pkt = _p;
 
-dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);
+dp_packet_init_dpdk((struct dp_packet *) pkt);
 }
 
 static int
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v11 04/14] netdev-dpdk: Serialise non-pmds mbufs' alloc/free.

2018-10-10 Thread Tiago Lam
A new mutex, 'nonpmd_mp_mutex', has been introduced to serialise
allocation and free operations by non-pmd threads on a given mempool.

free_dpdk_buf() has been modified to make use of the introduced mutex.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/netdev-dpdk.c | 33 ++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index b6a22bd..f5044ca 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -322,6 +322,16 @@ static struct ovs_mutex dpdk_mp_mutex 
OVS_ACQ_AFTER(dpdk_mutex)
 static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
 = OVS_LIST_INITIALIZER(_mp_list);
 
+/* This mutex must be used by non pmd threads when allocating or freeing
+ * mbufs through mempools, when outside of the `non_pmd_mutex` mutex, in struct
+ * dp_netdev.
+ * The reason, as pointed out in the "Known Issues" section in DPDK's EAL docs,
+ * is that the implementation on which mempool is based off is non-preemptable.
+ * Since non-pmds may end up not being pinned this could lead to the preemption
+ * between non-pmds performing operations on the same mempool, which could lead
+ * to memory corruption. */
+static struct ovs_mutex nonpmd_mp_mutex = OVS_MUTEX_INITIALIZER;
+
 struct dpdk_mp {
  struct rte_mempool *mp;
  int mtu;
@@ -492,6 +502,8 @@ struct netdev_rxq_dpdk {
 dpdk_port_t port_id;
 };
 
+static bool dpdk_thread_is_pmd(void);
+
 static void netdev_dpdk_destruct(struct netdev *netdev);
 static void netdev_dpdk_vhost_destruct(struct netdev *netdev);
 
@@ -525,6 +537,12 @@ dpdk_buf_size(int mtu)
  NETDEV_DPDK_MBUF_ALIGN);
 }
 
+static bool
+dpdk_thread_is_pmd(void)
+{
+ return rte_lcore_id() != NON_PMD_CORE_ID;
+}
+
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
  *
  * Unlike xmalloc(), this function can return NULL on failure. */
@@ -535,11 +553,20 @@ dpdk_rte_mzalloc(size_t sz)
 }
 
 void
-free_dpdk_buf(struct dp_packet *p)
+free_dpdk_buf(struct dp_packet *packet)
 {
-struct rte_mbuf *pkt = (struct rte_mbuf *) p;
+/* If non-pmd we need to lock on nonpmd_mp_mutex mutex */
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_lock(_mp_mutex);
+
+rte_pktmbuf_free(>mbuf);
+
+ovs_mutex_unlock(_mp_mutex);
+
+return;
+}
 
-rte_pktmbuf_free(pkt);
+rte_pktmbuf_free(>mbuf);
 }
 
 static void
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v11 02/14] dp-packet: Init specific mbuf fields.

2018-10-10 Thread Tiago Lam
From: Mark Kavanagh 

dp_packets are created using xmalloc(); in the case of OvS-DPDK, it's
possible the the resultant mbuf portion of the dp_packet contains
random data. For some mbuf fields, specifically those related to
multi-segment mbufs and/or offload features, random values may cause
unexpected behaviour, should the dp_packet's contents be later copied
to a DPDK mbuf. It is critical therefore, that these fields should be
initialized to 0.

This patch ensures that the following mbuf fields are initialized to
appropriate values on creation of a new dp_packet:
   - ol_flags=0
   - nb_segs=1
   - tx_offload=0
   - packet_type=0
   - next=NULL

Adapted from an idea by Michael Qiu :
https://patchwork.ozlabs.org/patch/777570/

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
Acked-by: Flavio Leitner 
---
 lib/dp-packet.h | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index ba91e58..b948fe1 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -625,14 +625,15 @@ dp_packet_mbuf_rss_flag_reset(struct dp_packet *p 
OVS_UNUSED)
 }
 
 /* This initialization is needed for packets that do not come
- * from DPDK interfaces, when vswitchd is built with --with-dpdk.
- * The DPDK rte library will still otherwise manage the mbuf.
- * We only need to initialize the mbuf ol_flags. */
+ * from DPDK interfaces, when vswitchd is built with --with-dpdk. */
 static inline void
 dp_packet_mbuf_init(struct dp_packet *p OVS_UNUSED)
 {
 #ifdef DPDK_NETDEV
-p->mbuf.ol_flags = 0;
+struct rte_mbuf *mbuf = &(p->mbuf);
+mbuf->ol_flags = mbuf->tx_offload = mbuf->packet_type = 0;
+mbuf->nb_segs = 1;
+mbuf->next = NULL;
 #endif
 }
 
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v11 01/14] netdev-dpdk: fix mbuf sizing

2018-10-10 Thread Tiago Lam
From: Mark Kavanagh 

There are numerous factors that must be considered when calculating
the size of an mbuf:
- the data portion of the mbuf must be sized in accordance With Rx
  buffer alignment (typically 1024B). So, for example, in order to
  successfully receive and capture a 1500B packet, mbufs with a
  data portion of size 2048B must be used.
- in OvS, the elements that comprise an mbuf are:
  * the dp packet, which includes a struct rte mbuf (704B)
  * RTE_PKTMBUF_HEADROOM (128B)
  * packet data (aligned to 1k, as previously described)
  * RTE_PKTMBUF_TAILROOM (typically 0)

Some PMDs require that the total mbuf size (i.e. the total sum of all
of the above-listed components' lengths) is cache-aligned. To satisfy
this requirement, it may be necessary to round up the total mbuf size
with respect to cacheline size. In doing so, it's possible that the
dp_packet's data portion is inadvertently increased in size, such that
it no longer adheres to Rx buffer alignment. Consequently, the
following property of the mbuf no longer holds true:

mbuf.data_len == mbuf.buf_len - mbuf.data_off

This creates a problem in the case of multi-segment mbufs, where that
assumption is assumed to be true for all but the final segment in an
mbuf chain. Resolve this issue by adjusting the size of the mbuf's
private data portion, as opposed to the packet data portion when
aligning mbuf size to cachelines.

Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
CC: Santosh Shukla 
Signed-off-by: Mark Kavanagh 
Acked-by: Santosh Shukla 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 56 +--
 1 file changed, 38 insertions(+), 18 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index f91aa27..11659eb 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -88,10 +88,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 
20);
 #define MTU_TO_MAX_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_MAX_LEN)
 #define FRAME_LEN_TO_MTU(frame_len) ((frame_len)\
  - ETHER_HDR_LEN - ETHER_CRC_LEN)
-#define MBUF_SIZE(mtu)  ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) \
- + sizeof(struct dp_packet) \
- + RTE_PKTMBUF_HEADROOM),   \
- RTE_CACHE_LINE_SIZE)
 #define NETDEV_DPDK_MBUF_ALIGN  1024
 #define NETDEV_DPDK_MAX_PKT_LEN 9728
 
@@ -637,7 +633,11 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
 char mp_name[RTE_MEMPOOL_NAMESIZE];
 const char *netdev_name = netdev_get_name(>up);
 int socket_id = dev->requested_socket_id;
-uint32_t n_mbufs;
+uint32_t n_mbufs = 0;
+uint32_t mbuf_size = 0;
+uint32_t aligned_mbuf_size = 0;
+uint32_t mbuf_priv_data_len = 0;
+uint32_t pkt_size = 0;
 uint32_t hash = hash_string(netdev_name, 0);
 struct dpdk_mp *dmp = NULL;
 int ret;
@@ -650,6 +650,9 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
 dmp->mtu = mtu;
 dmp->refcount = 1;
 
+/* Get the size of each mbuf, based on the MTU */
+mbuf_size = dpdk_buf_size(dev->requested_mtu);
+
 n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
 
 do {
@@ -661,8 +664,8 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
  * so this is not an issue for tasks such as debugging.
  */
 ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
-   "ovs%08x%02d%05d%07u",
-   hash, socket_id, mtu, n_mbufs);
+   "ovs%08x%02d%05d%07u",
+hash, socket_id, mtu, n_mbufs);
 if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
 VLOG_DBG("snprintf returned %d. "
  "Failed to generate a mempool name for \"%s\". "
@@ -671,17 +674,34 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
 break;
 }
 
-VLOG_DBG("Port %s: Requesting a mempool of %u mbufs "
-  "on socket %d for %d Rx and %d Tx queues.",
-  netdev_name, n_mbufs, socket_id,
-  dev->requested_n_rxq, dev->requested_n_txq);
-
-dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
-  MP_CACHE_SZ,
-  sizeof (struct dp_packet)
-  - sizeof (struct rte_mbuf),
-  MBUF_SIZE(mtu)
-  - sizeof(struct dp_packet),
+VLOG_DBG("Port %s: Requesting a mempool of %u mbufs of size %u "
+  "on socket %d for %d Rx and %d Tx queues, "
+  "cache line size of %u",
+  netdev_name, n_mbufs, 

[ovs-dev] [PATCH v11 00/14] Support multi-segment mbufs

2018-10-10 Thread Tiago Lam
hin the bounds of existing
  mbufs only;
- Fix dp_packet_clear() which wasn't correctly clearing / freeing
  other mbufs in the chain for chains with more than a single mbuf;
- dp_packet layer functions (such as dp_packet_l3()) now check if
  the header is within the first mbuf, when using mbufs;
- Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
  was refactored to use free_dpdk_buf();
- Fix wrong rte_memcpy() when performing dp_packet_clone() which was
  leading to memory corruption; 
- Modified the added tests to account for some of the above changes;
- Run performance tests, compiling results and adding them to the
  cover letter;
- Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
  together with a "Performance notes" sub-section reflecting the
  findings mentioned above in the cover letter.

v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
 - Address Ilya comments;
- Fix non-DPDK build;
- Serialise the access of non pmds to allocation and free of mbufs by
  using a newly introduced mutex.
 - Add a new set of tests that integrates with the recently added DPDK
   testsuite. These focus on allocating dp_packets, with a single or
   multiple mbufs, from an instantiated mempool and performing several
   operations on those, verifying if the data at the end matches what's
   expected;
 - Fix bugs found by the new tests:
- dp_packet_shift() wasn't taking into account shift lefts;
- dp_packet_resize__() was misusing and miscalculating the tailrooms
  and headrooms, ending up calculating the wrong number of segments
  that needed allocation;
- An mbuf's end was being miscalculated both in dp_packet_tail,
  dp_packet_mbuf_tail() and dp_packet_end();
- dp_packet_set_size() was not updating the number of chained segments
  'nb_segs';
 - Add support for multi-seg mbufs in dp_packet_clear().

v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
   reuse.");
 - Further improve dp_packet_put_uninit() and dp_packet_shift() to
   support multi-seg mbufs;
 - Add support for multi-seg mbufs in dp_packet_l4_size() and
   improve other helper funcs, such as dp_packet_tail() and dp_
   packet_tailroom().
 - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
   put_zeros(), as well as dp_packet_resize__() - allocating new
   mbufs and linking them together;
 Restructured patchset:
 - Squash patch 5 into patch 6, since they were both related to
   copying data while handling multi-seg mbufs;
 - Split patch 4 into two separate patches - one that introduces the
   changes in helper functions to deal with multi-seg mbufs and
   two others for the shift() and put_uninit() functionality;
 - Move patch 4 to before patch 3, so that ihelper functions come
   before functionality improvement that rely on those helpers.

v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
  from getting redundantly incremented");
- Sugesh's comments have been addressed:
  - Changed dp_packet_set_data() and dp_packet_set_size() logic to
make them independent of each other;
  - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
size() are independent;
  - dp_packet_clone_with_headroom() now has split functions for
handling DPDK sourced packets and non-DPDK packets;
- Modified various functions in dp-packet.h to account for multi-seg
  mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
  and dp_packet_at();
- Added support for shifting packet data in multi-seg mbufs, using
  dp_packet_shift();
- Fixed some minor inconsistencies.

Note that some of the changes in v5 have been contributed by Mark
Kavanagh as well.

v4: - restructure patchset
- account for 128B ARM cacheline when sizing mbufs

Mark Kavanagh (4):
  netdev-dpdk: fix mbuf sizing
  dp-packet: Init specific mbuf fields.
  netdev-dpdk: copy large packet to multi-seg. mbufs
  netdev-dpdk: support multi-segment jumbo frames

Michael Qiu (1):
  dp-packet: copy data from multi-seg. DPDK mbuf

Tiago Lam (9):
  dp-packet: Fix allocated size on DPDK init.
  netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
  dp-packet: Fix data_len handling multi-seg mbufs.
  dp-packet: Handle multi-seg mbufs in helper funcs.
  dp-packet: Handle multi-seg mubfs in shift() func.
  dp-packet: Add support for data "linearization".
  dpdk-tests: Add unit-tests for multi-seg mbufs.
  dpdk-tests: Accept other configs in OVS_DPDK_START
  dpdk-tests: End-to-end tests for multi-seg mbufs.

 Documentation/topics/dpdk/jumbo-frames.rst |  73 +++
 Documentation/topics/dpdk/memory.rst   |  36 ++
 NEWS 

[ovs-dev] [PATCH v10 14/14] dpdk-tests: End-to-end tests for multi-seg mbufs.

2018-09-28 Thread Tiago Lam
The following tests are added to the DPDK testsuite to add some
coverage for the multi-segment mbufs:
- Check that multi-segment mbufs are disabled by default;
- Check that providing `other_config:dpdk-multi-seg-mbufs=true` indeed
  enables mbufs;
- Using a DPDK port, send a random packet out and check that `ofctl
  dump-flows` shows the correct amount of packets and bytes sent.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/system-dpdk.at | 65 
 1 file changed, 65 insertions(+)

diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 3d21b01..af8de8c 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -71,3 +71,68 @@ OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel 
module is probably n
 ")
 AT_CLEANUP
 dnl --
+
+AT_SETUP([Jumbo frames - Multi-segment disabled by default])
+OVS_DPDK_START()
+
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [1], [])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment enabled])
+OVS_DPDK_START([dpdk-multi-seg-mbufs=true])
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [], [stdout])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment mbufs Tx])
+OVS_DPDK_PRE_CHECK()
+OVS_DPDK_START([per-port-memory=true dpdk-multi-seg-mbufs=true])
+
+dnl Add userspace bridge and attach it to OVS
+AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
+AT_CHECK([ovs-vsctl add-port br10 dpdk0 \
+-- set Interface dpdk0 type=dpdk options:dpdk-devargs=$(cat PCI_ADDR) \
+-- set Interface dpdk0 mtu_request=9000], [], [stdout], [stderr])
+
+AT_CHECK([ovs-vsctl show], [], [stdout])
+
+dnl Add flows to send packets out from the 'dpdk0' port
+AT_CHECK([
+ovs-ofctl del-flows br10
+ovs-ofctl add-flow br10 in_port=LOCAL,actions=output:dpdk0
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [], [stdout])
+
+dnl Send packet out, of the 'dpdk0' port
+AT_CHECK([
+ARP_HEADER="09000B0009000A000806000108000604000100010A\
+0100020A02"
+dnl Build a random hex string to append to the ARP_HEADER
+RANDOM_BODY=$(printf '0102030405%.0s' {1..1750})
+dnl 8792B ARP packet
+RANDOM_ARP="$ARP_HEADER$RANDOM_BODY"
+
+ovs-ofctl packet-out br10 "packet=$RANDOM_ARP,action=resubmit:LOCAL"
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [0], [stdout])
+
+dnl Confirm the single packet as been sent with correct size
+AT_CHECK([ovs-ofctl dump-flows br10 | ofctl_strip | grep in_port], [0], [dnl
+ n_packets=1, n_bytes=8792, in_port=LOCAL actions=output:1
+])
+
+dnl Clean up
+OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel module is probably 
not loaded./d
+/Failed to enable flow control/d
+/failed to connect to \/tmp\/dpdkvhostclient0: No such file or directory/d
+/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v10 13/14] dpdk-tests: Accept other configs in OVS_DPDK_START

2018-09-28 Thread Tiago Lam
As it stands, OVS_DPDK_START() won't allow other configs to be set
before starting the ovs-vswitchd daemon. This is a problem since some
configs, such as the "dpdk-multi-seg-mbufs=true" for enabling the
multi-segment mbufs, need to be set prior to start OvS.

To support other options, OVS_DPDK_START() has been modified to accept
extra configs in the form "$config_name=$config_value". It then uses
ovs-vsctl to set the configs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/system-dpdk-macros.at | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tests/system-dpdk-macros.at b/tests/system-dpdk-macros.at
index 0762ee0..7c65834 100644
--- a/tests/system-dpdk-macros.at
+++ b/tests/system-dpdk-macros.at
@@ -21,7 +21,7 @@ m4_define([OVS_DPDK_PRE_CHECK],
 ])
 
 
-# OVS_DPDK_START()
+# OVS_DPDK_START([other-conf-args])
 #
 # Create an empty database and start ovsdb-server. Add special configuration
 # dpdk-init to enable DPDK functionality. Start ovs-vswitchd connected to that
@@ -48,6 +48,10 @@ m4_define([OVS_DPDK_START],
AT_CHECK([lscpu], [], [stdout])
AT_CHECK([cat stdout | grep "NUMA node(s)" | awk '{c=1; while (c++<$(3)) 
{printf "1024,"}; print "1024"}' > SOCKET_MEM])
AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . 
other_config:dpdk-socket-mem="$(cat SOCKET_MEM)"])
+   dnl Iterate through $other-conf-args list and include them
+   m4_foreach_w(opt, $1, [
+   AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . other_config:opt])
+   ])
 
dnl Start ovs-vswitchd.
AT_CHECK([ovs-vswitchd --detach --no-chdir --pidfile --log-file -vvconn 
-vofproto_dpif -vunixctl], [0], [stdout], [stderr])
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v10 12/14] dpdk-tests: Add unit-tests for multi-seg mbufs.

2018-09-28 Thread Tiago Lam
In order to create a minimal environment that allows the tests to get
mbufs from an existing mempool, the following approach is taken:
- EAL is initialised (by using the main dpdk_init()) and a (very) small
  mempool is instantiated (mimicking the logic in dpdk_mp_create()).
  This mempool instance is global and used by all the tests;
- Packets are then allocated from the instantiated mempool, and tested
  on, by running some operations on them and manipulating data.

The tests introduced focus on testing DPDK dp_packets (where
source=DPBUF_DPDK), linked with a single or multiple mbufs, across
several operations, such as:
- dp_packet_put();
- dp_packet_shift();
- dp_packet_reserve();
- dp_packet_push_uninit();
- dp_packet_clear();
- dp_packet_equal();
- dp_packet_linear_data();
- And as a consequence of some of these, dp_packet_put_uninit() and
  dp_packet_resize__().

Finally, this has also been integrated with the new DPDK testsuite.
Thus, when running `$sudo make check-dpdk` one will also be running
these tests.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/automake.mk  |  10 +-
 tests/dpdk-packet-mbufs.at |   7 +
 tests/system-dpdk-testsuite.at |   1 +
 tests/test-dpdk-mbufs.c| 620 +
 4 files changed, 637 insertions(+), 1 deletion(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 97312cf..fca35d9 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -168,7 +168,8 @@ SYSTEM_DPDK_TESTSUITE_AT = \
tests/system-common-macros.at \
tests/system-dpdk-macros.at \
tests/system-dpdk-testsuite.at \
-   tests/system-dpdk.at
+   tests/system-dpdk.at \
+   tests/dpdk-packet-mbufs.at
 
 check_SCRIPTS += tests/atlocal
 
@@ -425,6 +426,10 @@ tests_ovstest_SOURCES = \
tests/test-vconn.c \
tests/test-aa.c \
tests/test-stopwatch.c
+if DPDK_NETDEV
+tests_ovstest_SOURCES += \
+   tests/test-dpdk-mbufs.c
+endif
 
 if !WIN32
 tests_ovstest_SOURCES += \
@@ -437,6 +442,9 @@ tests_ovstest_SOURCES += \
 endif
 
 tests_ovstest_LDADD = lib/libopenvswitch.la ovn/lib/libovn.la
+if DPDK_NETDEV
+tests_ovstest_LDFLAGS = $(AM_LDFLAGS) $(DPDK_vswitchd_LDFLAGS)
+endif
 
 noinst_PROGRAMS += tests/test-strtok_r
 tests_test_strtok_r_SOURCES = tests/test-strtok_r.c
diff --git a/tests/dpdk-packet-mbufs.at b/tests/dpdk-packet-mbufs.at
new file mode 100644
index 000..f28e4fc
--- /dev/null
+++ b/tests/dpdk-packet-mbufs.at
@@ -0,0 +1,7 @@
+AT_BANNER([OVS-DPDK dp_packet unit tests])
+
+AT_SETUP([OVS-DPDK dp_packet - mbufs allocation])
+AT_KEYWORDS([dp_packet, multi-seg, mbufs])
+AT_CHECK(ovstest test-dpdk-packet, [], [ignore], [ignore])
+
+AT_CLEANUP
diff --git a/tests/system-dpdk-testsuite.at b/tests/system-dpdk-testsuite.at
index 382f09e..f5edf58 100644
--- a/tests/system-dpdk-testsuite.at
+++ b/tests/system-dpdk-testsuite.at
@@ -23,3 +23,4 @@ m4_include([tests/system-common-macros.at])
 m4_include([tests/system-dpdk-macros.at])
 
 m4_include([tests/system-dpdk.at])
+m4_include([tests/dpdk-packet-mbufs.at])
diff --git a/tests/test-dpdk-mbufs.c b/tests/test-dpdk-mbufs.c
new file mode 100644
index 000..50e41ea
--- /dev/null
+++ b/tests/test-dpdk-mbufs.c
@@ -0,0 +1,620 @@
+/*
+ * Copyright (c) 2018 Intel Corporation
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "dp-packet.h"
+#include "ovstest.h"
+#include "dpdk.h"
+#include "smap.h"
+
+#define N_MBUFS 1024
+#define MBUF_DATA_LEN 2048
+
+static int num_tests = 0;
+
+/* Global var to hold a mempool instance, "test-mp", used in all of the tests
+ * below. This instance is instantiated in dpdk_setup_eal_with_mp(). */
+static struct rte_mempool *mp;
+
+/* Test data used to fill the packets with data. Note that this isn't a string
+ * that repsents a valid packet, by any means. The pattern is generated in set_
+ * testing_pattern_str() and the sole purpose is to verify the data remains the
+ * same after inserting and operating on multi-segment mbufs. */
+static char *test_str;
+
+/* Asserts a dp_packet that holds a single mbuf, where:
+ * - nb_segs must be 1;
+ * - pkt_len must be equal to data_len which in turn must equal the provided
+ *   'pkt_len';
+ * - d

[ovs-dev] [PATCH v10 11/14] netdev-dpdk: support multi-segment jumbo frames

2018-09-28 Thread Tiago Lam
From: Mark Kavanagh 

Currently, jumbo frame support for OvS-DPDK is implemented by
increasing the size of mbufs within a mempool, such that each mbuf
within the pool is large enough to contain an entire jumbo frame of
a user-defined size. Typically, for each user-defined MTU,
'requested_mtu', a new mempool is created, containing mbufs of size
~requested_mtu.

With the multi-segment approach, a port uses a single mempool,
(containing standard/default-sized mbufs of ~2k bytes), irrespective
of the user-requested MTU value. To accommodate jumbo frames, mbufs
are chained together, where each mbuf in the chain stores a portion of
the jumbo frame. Each mbuf in the chain is termed a segment, hence the
name.

== Enabling multi-segment mbufs ==
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 Documentation/topics/dpdk/jumbo-frames.rst | 73 ++
 Documentation/topics/dpdk/memory.rst   | 36 +++
 NEWS   |  1 +
 lib/dpdk.c |  8 
 lib/netdev-dpdk.c  | 60 
 lib/netdev-dpdk.h  |  1 +
 vswitchd/vswitch.xml   | 22 +
 7 files changed, 193 insertions(+), 8 deletions(-)

diff --git a/Documentation/topics/dpdk/jumbo-frames.rst 
b/Documentation/topics/dpdk/jumbo-frames.rst
index 00360b4..5888f1e 100644
--- a/Documentation/topics/dpdk/jumbo-frames.rst
+++ b/Documentation/topics/dpdk/jumbo-frames.rst
@@ -71,3 +71,76 @@ Jumbo frame support has been validated against 9728B frames, 
which is the
 largest frame size supported by Fortville NIC using the DPDK i40e driver, but
 larger frames and other DPDK NIC drivers may be supported. These cases are
 common for use cases involving East-West traffic only.
+
+---
+Multi-segment mbufs
+---
+
+Instead of increasing the size of mbufs within a mempool, such that each mbuf
+within the pool is large enough to contain an entire jumbo frame of a
+user-defined size, mbufs can be chained together instead. In this approach each
+mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
+irrespective of the user-requested MTU value. Since each mbuf in the chain is
+termed a segment, this approach is named "multi-segment mbufs".
+
+This approach may bring more flexibility in use cases where the maximum packet
+length may be hard to guess. For example, in cases where packets originate from
+sources marked for oflload (such as TSO), each packet may be larger than the
+MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
+enough to hold all of the packet's data.
+
+Multi-segment and single-segment mbufs are mutually exclusive, and the user
+must decide on which approach to adopt on initialisation. If multi-segment
+mbufs is to be enabled, it can be done so with the following command::
+
+$ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
+
+Single-segment mbufs still remain the default when using OvS-DPDK, and the
+above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
+multi-segment mbufs are to be used.
+
+~
+Performance notes
+~
+
+When using multi-segment mbufs some PMDs may not support vectorized Tx
+functions, due to its non-contiguous nature. As a result this can hit
+performance for smaller packet sizes. For example, on a setup sending 64B
+packets at line rate, a decrease of ~20% has been observed. The performance
+impact stops being noticeable for larger packet sizes, although the exact size
+will between PMDs, and depending on the architecture one's using.
+
+Tests performed with the i40e PMD driver only showed this limitation for 64B
+packets, and the same rate was observed when comparing multi-segment mbufs and
+single-segment mbuf for 128B packets. In other words, the 20% drop in
+performance was not observed for packets >= 128B during this test case.
+
+Because of this, multi-segment mbufs is not advised to be used with small

[ovs-dev] [PATCH v10 09/14] dp-packet: Add support for data "linearization".

2018-09-28 Thread Tiago Lam
Previous commits have added support to the dp_packet API to handle
multi-segmented packets, where data is not stored contiguously in
memory. However, in some cases, it is inevitable and data must be
provided contiguously. Examples of such cases are when performing csums
over the entire packet data, or when write()'ing to a file descriptor
(for a tap interface, for example). For such cases, the dp_packet API
has been extended to provide a way to transform a multi-segmented
DPBUF_DPDK packet into a DPBUF_MALLOC system packet (at the expense of a
copy of memory). If the packet's data is already stored in memory
contigously then there's no need to convert the packet.

Thus, the main use cases that were assuming that a dp_packet's data is
always held contiguously in memory were changed to make use of the new
"linear functions" in the dp_packet API when there's a need to traverse
the entire's packet data. Per the example above, when the packet's data
needs to be write() to the tap's file descriptor, or when the conntrack
module needs to verify a packet's checksum, the data is now linearized.

Additionally, the layer functions, such as dp_packet_l3() and variants,
have been modified to check if there's enough data in the packet before
returning a pointer to the data (and callers have been modified
accordingly). This requirement is needed to guarantee that a caller
doesn't read beyond the available memory.

Signed-off-by: Tiago Lam 
---
 lib/bfd.c |   3 +-
 lib/cfm.c |   5 +-
 lib/conntrack-icmp.c  |   4 +-
 lib/conntrack-private.h   |   4 +-
 lib/conntrack-tcp.c   |   6 +-
 lib/conntrack.c   | 109 +
 lib/dp-packet.c   |  18 
 lib/dp-packet.h   | 217 +++---
 lib/dpif-netdev.c |   5 +
 lib/dpif-netlink.c|   5 +
 lib/dpif.c|   9 ++
 lib/flow.c|  29 +++---
 lib/lacp.c|   3 +-
 lib/mcast-snooping.c  |   8 +-
 lib/netdev-bsd.c  |   5 +
 lib/netdev-dummy.c|  13 ++-
 lib/netdev-linux.c|  13 ++-
 lib/netdev-native-tnl.c   |  39 +---
 lib/odp-execute.c |  28 --
 lib/ofp-print.c   |  10 +-
 lib/ovs-lldp.c|   3 +-
 lib/packets.c |  81 +---
 lib/pcap-file.c   |   2 +-
 ofproto/ofproto-dpif-upcall.c |  20 +++-
 ofproto/ofproto-dpif-xlate.c  |  42 ++--
 ovn/controller/pinctrl.c  |  29 +++---
 tests/test-conntrack.c|   2 +-
 tests/test-rstp.c |   8 +-
 tests/test-stp.c  |   8 +-
 29 files changed, 483 insertions(+), 245 deletions(-)

diff --git a/lib/bfd.c b/lib/bfd.c
index 5308262..d50d2da 100644
--- a/lib/bfd.c
+++ b/lib/bfd.c
@@ -722,7 +722,8 @@ bfd_process_packet(struct bfd *bfd, const struct flow *flow,
 if (!msg) {
 VLOG_INFO_RL(, "%s: Received too-short BFD control message (only "
  "%"PRIdPTR" bytes long, at least %d required).",
- bfd->name, (uint8_t *) dp_packet_tail(p) - l7,
+ bfd->name, dp_packet_size(p) -
+ (l7 - (uint8_t *) dp_packet_data(p)),
  BFD_PACKET_LEN);
 goto out;
 }
diff --git a/lib/cfm.c b/lib/cfm.c
index 71d2c02..83baf2a 100644
--- a/lib/cfm.c
+++ b/lib/cfm.c
@@ -584,7 +584,7 @@ cfm_compose_ccm(struct cfm *cfm, struct dp_packet *packet,
 
 atomic_read_relaxed(>extended, );
 
-ccm = dp_packet_l3(packet);
+ccm = dp_packet_l3(packet, sizeof(*ccm));
 ccm->mdlevel_version = 0;
 ccm->opcode = CCM_OPCODE;
 ccm->tlv_offset = 70;
@@ -759,8 +759,7 @@ cfm_process_heartbeat(struct cfm *cfm, const struct 
dp_packet *p)
 atomic_read_relaxed(>extended, );
 
 eth = dp_packet_eth(p);
-ccm = dp_packet_at(p, (uint8_t *)dp_packet_l3(p) - (uint8_t 
*)dp_packet_data(p),
-CCM_ACCEPT_LEN);
+ccm = dp_packet_l3(p, CCM_ACCEPT_LEN);
 
 if (!ccm) {
 VLOG_INFO_RL(, "%s: Received an unparseable 802.1ag CCM heartbeat.",
diff --git a/lib/conntrack-icmp.c b/lib/conntrack-icmp.c
index 40fd1d8..0575d0e 100644
--- a/lib/conntrack-icmp.c
+++ b/lib/conntrack-icmp.c
@@ -63,7 +63,7 @@ icmp_conn_update(struct conn *conn_, struct conntrack_bucket 
*ctb,
 static bool
 icmp4_valid_new(struct dp_packet *pkt)
 {
-struct icmp_header *icmp = dp_packet_l4(pkt);
+struct icmp_header *icmp = dp_packet_l4(pkt, sizeof *icmp);
 
 return icmp->icmp_type == ICMP4_ECHO_REQUEST
|| icmp->icmp_type == ICMP4_INFOREQUEST
@@ -73,7 +73,7 @@ icmp4_valid_new(struct dp_packet *pkt)
 static bool
 icmp6_valid_new(struct dp_packet *pkt)
 {
-struct icmp6_header *icmp6 = dp_packet_l4(pkt);
+struct icmp6_header *icmp6 = dp_packet_l4(pkt, sizeof *icmp6);
 
 

[ovs-dev] [PATCH v10 10/14] netdev-dpdk: copy large packet to multi-seg. mbufs

2018-09-28 Thread Tiago Lam
From: Mark Kavanagh 

Currently, packets are only copied to a single segment in the function
dpdk_do_tx_copy(). This could be an issue in the case of jumbo frames,
particularly when multi-segment mbufs are involved.

This patch calculates the number of segments needed by a packet and
copies the data to each segment.

A new function, dpdk_buf_alloc(), has also been introduced as a wrapper
around the nonpmd_mp_mutex to serialise allocations from a non-pmd
context.

Co-authored-by: Michael Qiu 
Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Michael Qiu 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 91 +--
 1 file changed, 82 insertions(+), 9 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 8484239..e58e7ac 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -552,6 +552,27 @@ dpdk_rte_mzalloc(size_t sz)
 return rte_zmalloc(OVS_VPORT_DPDK, sz, OVS_CACHE_LINE_SIZE);
 }
 
+static struct rte_mbuf *
+dpdk_buf_alloc(struct rte_mempool *mp)
+{
+struct rte_mbuf *mbuf = NULL;
+
+/* If non-pmd we need to lock on nonpmd_mp_mutex mutex */
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_lock(_mp_mutex);
+
+mbuf = rte_pktmbuf_alloc(mp);
+
+ovs_mutex_unlock(_mp_mutex);
+
+return mbuf;
+}
+
+mbuf = rte_pktmbuf_alloc(mp);
+
+return mbuf;
+}
+
 void
 free_dpdk_buf(struct dp_packet *packet)
 {
@@ -2316,6 +2337,56 @@ out:
 }
 }
 
+static int
+dpdk_copy_dp_packet_to_mbuf(struct dp_packet *packet, struct rte_mbuf **head,
+struct rte_mempool *mp)
+{
+struct rte_mbuf *mbuf, *fmbuf;
+uint16_t max_data_len;
+uint32_t nb_segs = 0;
+uint32_t size = 0;
+
+/* We will need the whole data for copying below */
+if (!dp_packet_is_linear(packet)) {
+dp_packet_linearize(packet);
+}
+
+/* Allocate first mbuf to know the size of data available */
+fmbuf = mbuf = *head = dpdk_buf_alloc(mp);
+if (OVS_UNLIKELY(!mbuf)) {
+return ENOMEM;
+}
+
+size = dp_packet_size(packet);
+
+/* All new allocated mbuf's max data len is the same */
+max_data_len = mbuf->buf_len - mbuf->data_off;
+
+/* Calculate # of output mbufs. */
+nb_segs = size / max_data_len;
+if (size % max_data_len) {
+nb_segs = nb_segs + 1;
+}
+
+/* Allocate additional mbufs, less the one alredy allocated above */
+for (int i = 1; i < nb_segs; i++) {
+mbuf->next = dpdk_buf_alloc(mp);
+if (!mbuf->next) {
+free_dpdk_buf(CONTAINER_OF(fmbuf, struct dp_packet, mbuf));
+fmbuf = NULL;
+return ENOMEM;
+}
+mbuf = mbuf->next;
+}
+
+fmbuf->nb_segs = nb_segs;
+fmbuf->pkt_len = size;
+
+dp_packet_mbuf_write(fmbuf, 0, size, dp_packet_data(packet));
+
+return 0;
+}
+
 /* Tx function. Transmit packets indefinitely */
 static void
 dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
@@ -2332,6 +2403,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
 uint32_t cnt = batch_cnt;
 uint32_t dropped = 0;
+uint32_t i;
 
 if (dev->type != DPDK_DEV_VHOST) {
 /* Check if QoS has been configured for this netdev. */
@@ -2342,28 +2414,29 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 
 uint32_t txcnt = 0;
 
-for (uint32_t i = 0; i < cnt; i++) {
+for (i = 0; i < cnt; i++) {
 struct dp_packet *packet = batch->packets[i];
 uint32_t size = dp_packet_size(packet);
+int err = 0;
 
 if (OVS_UNLIKELY(size > dev->max_packet_len)) {
 VLOG_WARN_RL(, "Too big size %u max_packet_len %d",
  size, dev->max_packet_len);
-
 dropped++;
 continue;
 }
 
-pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
-if (OVS_UNLIKELY(!pkts[txcnt])) {
+err = dpdk_copy_dp_packet_to_mbuf(packet, [txcnt],
+  dev->dpdk_mp->mp);
+if (err != 0) {
+if (err == ENOMEM) {
+VLOG_ERR_RL(, "Failed to alloc mbufs! %u packets dropped",
+cnt - i);
+}
+
 dropped += cnt - i;
 break;
 }
-
-/* We have to do a copy for now */
-memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
-   dp_packet_data(packet), size);
-dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
 dp_packet_copy_mbuf_flags((struct dp_packet *)pkts[txcnt], packet);
 
 txcnt++;
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v10 08/14] dp-packet: copy data from multi-seg. DPDK mbuf

2018-09-28 Thread Tiago Lam
From: Michael Qiu 

When doing packet clone, if packet source is from DPDK driver,
multi-segment must be considered, and copy the segment's data one by
one.

Also, lots of DPDK mbuf's info is missed during a copy, like packet
type, ol_flags, etc.  That information is very important for DPDK to do
packets processing.

Co-authored-by: Mark Kavanagh 
Co-authored-by: Tiago Lam 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c   | 69 ++-
 lib/dp-packet.h   |  3 +++
 lib/netdev-dpdk.c |  1 +
 3 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 167bf43..806640b 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -48,6 +48,22 @@ dp_packet_use__(struct dp_packet *b, void *base, size_t 
allocated,
 dp_packet_set_size(b, 0);
 }
 
+#ifdef DPDK_NETDEV
+void
+dp_packet_copy_mbuf_flags(struct dp_packet *dst, const struct dp_packet *src)
+{
+ovs_assert(dst != NULL && src != NULL);
+struct rte_mbuf *buf_dst = &(dst->mbuf);
+struct rte_mbuf buf_src = src->mbuf;
+
+buf_dst->ol_flags = buf_src.ol_flags;
+buf_dst->packet_type = buf_src.packet_type;
+buf_dst->tx_offload = buf_src.tx_offload;
+}
+#else
+#define dp_packet_copy_mbuf_flags(arg1, arg2)
+#endif
+
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
  * memory starting at 'base'.  'base' should be the first byte of a region
  * obtained from malloc().  It will be freed (with free()) if 'b' is resized or
@@ -158,6 +174,44 @@ dp_packet_clone(const struct dp_packet *buffer)
 return dp_packet_clone_with_headroom(buffer, 0);
 }
 
+#ifdef DPDK_NETDEV
+struct dp_packet *
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom) {
+struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
+
+/* copy multi-seg data */
+if (b->source == DPBUF_DPDK && !rte_pktmbuf_is_contiguous(>mbuf)) {
+void *dst = NULL;
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+new_buffer = dp_packet_new_with_headroom(pkt_len, headroom);
+dst = dp_packet_data(new_buffer);
+dp_packet_set_size(new_buffer, pkt_len);
+
+if (!rte_pktmbuf_read(mbuf, 0, pkt_len, dst)) {
+return NULL;
+}
+} else {
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+dp_packet_size(b),
+headroom);
+}
+
+/* Copy the following fields into the returned buffer: l2_pad_size,
+ * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
+memcpy(_buffer->l2_pad_size, >l2_pad_size,
+   sizeof(struct dp_packet) -
+   offsetof(struct dp_packet, l2_pad_size));
+
+dp_packet_copy_mbuf_flags(new_buffer, b);
+if (dp_packet_rss_valid(new_buffer)) {
+new_buffer->mbuf.hash.rss = b->mbuf.hash.rss;
+}
+
+return new_buffer;
+}
+#else
 /* Creates and returns a new dp_packet whose data are copied from 'buffer'.
  * The returned dp_packet will additionally have 'headroom' bytes of
  * headroom. */
@@ -165,32 +219,25 @@ struct dp_packet *
 dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t headroom)
 {
 struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(buffer);
 
 new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(buffer),
- dp_packet_size(buffer),
- headroom);
+ pkt_len, headroom);
+
 /* Copy the following fields into the returned buffer: l2_pad_size,
  * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
 memcpy(_buffer->l2_pad_size, >l2_pad_size,
 sizeof(struct dp_packet) -
 offsetof(struct dp_packet, l2_pad_size));
 
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.ol_flags = buffer->mbuf.ol_flags;
-#else
 new_buffer->rss_hash_valid = buffer->rss_hash_valid;
-#endif
-
 if (dp_packet_rss_valid(new_buffer)) {
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.hash.rss = buffer->mbuf.hash.rss;
-#else
 new_buffer->rss_hash = buffer->rss_hash;
-#endif
 }
 
 return new_buffer;
 }
+#endif
 
 /* Creates and returns a new dp_packet that initially contains a copy of the
  * 'size' bytes of data starting at 'data' with no headroom or tailroom. */
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index aab8f62..cbf002c 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -124,6 +124,9 @@ void dp_packet_init_dpdk(struct dp_packet *);
 void dp_packet_init(struct dp_packet *, size_t);
 void dp_packet_uninit(struct dp_packet *);
 
+void dp_packet_copy_mbuf_fl

[ovs-dev] [PATCH v10 07/14] dp-packet: Handle multi-seg mubfs in shift() func.

2018-09-28 Thread Tiago Lam
In its current implementation dp_packet_shift() is also unaware of
multi-seg mbufs (that holds data in memory non-contiguously) and assumes
that data exists contiguously in memory, memmove'ing data to perform the
shift.

To add support for multi-seg mbuds a new set of functions was
introduced, dp_packet_mbuf_shift() and dp_packet_mbuf_write(). These
functions are used by dp_packet_shift(), when handling multi-seg mbufs,
to shift and write data within a chain of mbufs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c | 100 
 lib/dp-packet.h |  10 ++
 2 files changed, 110 insertions(+)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 2aaeaae..167bf43 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -294,6 +294,100 @@ dp_packet_prealloc_headroom(struct dp_packet *b, size_t 
size)
 }
 }
 
+#ifdef DPDK_NETDEV
+/* Write len data bytes in a mbuf at specified offset.
+ *
+ * 'mbuf', pointer to the destination mbuf where 'ofs' is, and the mbuf where
+ * the data will first be written.
+ * 'ofs', the offset within the provided 'mbuf' where 'data' is to be written.
+ * 'len', the size of the to be written 'data'.
+ * 'data', pointer to the to be written bytes.
+ *
+ * XXX: This function is the counterpart of the `rte_pktmbuf_read()` function
+ * available with DPDK, in the rte_mbuf.h */
+void
+dp_packet_mbuf_write(struct rte_mbuf *mbuf, int16_t ofs, uint32_t len,
+ const void *data)
+{
+char *dst_addr;
+uint16_t data_len;
+int len_copy;
+while (mbuf) {
+if (len == 0) {
+break;
+}
+
+dst_addr = rte_pktmbuf_mtod_offset(mbuf, char *, ofs);
+data_len = MBUF_BUF_END(mbuf->buf_addr, mbuf->buf_len) - dst_addr;
+
+len_copy = MIN(len, data_len);
+/* We don't know if 'data' is the result of a rte_pktmbuf_read() call,
+ * in which case we may end up writing to the same region of memory we
+ * are reading from and overlapping. Hence the use of memmove() here */
+memmove(dst_addr, data, len_copy);
+
+data = ((char *) data) + len_copy;
+len -= len_copy;
+ofs = 0;
+
+mbuf->data_len = len_copy;
+mbuf = mbuf->next;
+}
+}
+
+static void
+dp_packet_mbuf_shift_(struct rte_mbuf *dbuf, int16_t dst_ofs,
+  const struct rte_mbuf *sbuf, uint16_t src_ofs, int len)
+{
+char *rd = xmalloc(sizeof(*rd) * len);
+const char *wd = rte_pktmbuf_read(sbuf, src_ofs, len, rd);
+
+ovs_assert(wd);
+
+dp_packet_mbuf_write(dbuf, dst_ofs, len, wd);
+
+free(rd);
+}
+
+/* Similarly to dp_packet_shift(), shifts the data within the mbufs of a
+ * dp_packet of DPBUF_DPDK source by 'delta' bytes.
+ * Caller must make sure of the following conditions:
+ * - When shifting left, delta can't be bigger than the data_len available in
+ *   the last mbuf;
+ * - When shifting right, delta can't be bigger than the space available in the
+ *   first mbuf (buf_len - data_off).
+ * Both these conditions guarantee that a shift operation doesn't fall outside
+ * the bounds of the existing mbufs, so that the first and last mbufs (when
+ * using multi-segment mbufs), remain the same. */
+static void
+dp_packet_mbuf_shift(struct dp_packet *b, int delta)
+{
+uint16_t src_ofs;
+int16_t dst_ofs;
+
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+struct rte_mbuf *tmbuf = rte_pktmbuf_lastseg(mbuf);
+
+if (delta < 0) {
+ovs_assert(-delta <= tmbuf->data_len);
+} else {
+ovs_assert(delta < (mbuf->buf_len - mbuf->data_off));
+}
+
+/* Set the destination and source offsets to copy to */
+dst_ofs = delta;
+src_ofs = 0;
+
+/* Shift data from src mbuf and offset to dst mbuf and offset */
+dp_packet_mbuf_shift_(mbuf, dst_ofs, mbuf, src_ofs,
+  rte_pktmbuf_pkt_len(mbuf));
+
+/* Update mbufs' properties, and if using multi-segment mbufs, first and
+ * last mbuf's data_len also needs to be adjusted */
+mbuf->data_off = mbuf->data_off + dst_ofs;
+}
+#endif
+
 /* Shifts all of the data within the allocated space in 'b' by 'delta' bytes.
  * For example, a 'delta' of 1 would cause each byte of data to move one byte
  * forward (from address 'p' to 'p+1'), and a 'delta' of -1 would cause each
@@ -306,6 +400,12 @@ dp_packet_shift(struct dp_packet *b, int delta)
: true);
 
 if (delta != 0) {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+dp_packet_mbuf_shift(b, delta);
+return;
+}
+#endif
 char *dst = (char *) dp_packet_data(b) + delta;
 memmove(dst, dp_packet_data(b), dp_packet_size(b));
 dp_packet_set_data(b, dst);
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 259b470..aab8f62 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -80,6 +80,11 @

[ovs-dev] [PATCH v10 06/14] dp-packet: Handle multi-seg mbufs in helper funcs.

2018-09-28 Thread Tiago Lam
Most helper functions in dp-packet assume that the data held by a
dp_packet is contiguous, and perform operations such as pointer
arithmetic under that assumption. However, with the introduction of
multi-segment mbufs, where data is non-contiguous, such assumptions are
no longer possible. Some examples of Such helper functions are
dp_packet_tail(), dp_packet_tailroom(), dp_packet_end(),
dp_packet_get_allocated() and dp_packet_at().

Thus, instead of assuming contiguous data in dp_packet, they  now
iterate over the (non-contiguous) data in mbufs to perform their
calculations.

Finally, dp_packet_use__() has also been modified to perform the
initialisation of the packet (and setting the source) before continuing
to set its size and data length, which now depends on the type of
packet.

Co-authored-by: Mark Kavanagh 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c |   4 +-
 lib/dp-packet.h | 162 ++--
 2 files changed, 148 insertions(+), 18 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 782e7c2..2aaeaae 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -41,11 +41,11 @@ static void
 dp_packet_use__(struct dp_packet *b, void *base, size_t allocated,
  enum dp_packet_source source)
 {
+dp_packet_init__(b, allocated, source);
+
 dp_packet_set_base(b, base);
 dp_packet_set_data(b, base);
 dp_packet_set_size(b, 0);
-
-dp_packet_init__(b, allocated, source);
 }
 
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 4545041..259b470 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -185,9 +185,25 @@ dp_packet_delete(struct dp_packet *b)
 static inline void *
 dp_packet_at(const struct dp_packet *b, size_t offset, size_t size)
 {
-return offset + size <= dp_packet_size(b)
-   ? (char *) dp_packet_data(b) + offset
-   : NULL;
+if (offset + size > dp_packet_size(b)) {
+return NULL;
+}
+
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+while (buf && offset > buf->data_len) {
+offset -= buf->data_len;
+
+buf = buf->next;
+}
+
+return buf ? rte_pktmbuf_mtod_offset(buf, char *, offset) : NULL;
+}
+#endif
+
+return (char *) dp_packet_data(b) + offset;
 }
 
 /* Returns a pointer to byte 'offset' in 'b', which must contain at least
@@ -196,13 +212,23 @@ static inline void *
 dp_packet_at_assert(const struct dp_packet *b, size_t offset, size_t size)
 {
 ovs_assert(offset + size <= dp_packet_size(b));
-return ((char *) dp_packet_data(b)) + offset;
+return dp_packet_at(b, offset, size);
 }
 
 /* Returns a pointer to byte following the last byte of data in use in 'b'. */
 static inline void *
 dp_packet_tail(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
+/* Find last segment where data ends, meaning the tail of the chained
+ *  mbufs must be there */
+buf = rte_pktmbuf_lastseg(buf);
+
+return rte_pktmbuf_mtod_offset(buf, void *, buf->data_len);
+}
+#endif
 return (char *) dp_packet_data(b) + dp_packet_size(b);
 }
 
@@ -211,6 +237,15 @@ dp_packet_tail(const struct dp_packet *b)
 static inline void *
 dp_packet_end(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, &(b->mbuf));
+
+buf = rte_pktmbuf_lastseg(buf);
+
+return (char *) buf->buf_addr + buf->buf_len;
+}
+#endif
 return (char *) dp_packet_base(b) + dp_packet_get_allocated(b);
 }
 
@@ -236,6 +271,15 @@ dp_packet_tailroom(const struct dp_packet *b)
 static inline void
 dp_packet_clear(struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+/* sets pkt_len and data_len to zero and frees unused mbufs */
+dp_packet_set_size(b, 0);
+rte_pktmbuf_reset(>mbuf);
+
+return;
+}
+#endif
 dp_packet_set_data(b, dp_packet_base(b));
 dp_packet_set_size(b, 0);
 }
@@ -248,28 +292,47 @@ dp_packet_pull(struct dp_packet *b, size_t size)
 void *data = dp_packet_data(b);
 ovs_assert(dp_packet_size(b) - dp_packet_l2_pad_size(b) >= size);
 dp_packet_set_data(b, (char *) dp_packet_data(b) + size);
-dp_packet_set_size(b, dp_packet_size(b) - size);
+#ifdef DPDK_NETDEV
+b->mbuf.pkt_len -= size;
+#else
+b->size_ -= size;
+#endif
+
 return data;
 }
 
+#ifdef DPDK_NETDEV
+/* Similar to dp_packet_try_pull() but doesn't actually pull any data, only
+ * checks if it could and returns true or false accordingly.
+ *
+ * Valid for dp_packets carryi

[ovs-dev] [PATCH v10 05/14] dp-packet: Fix data_len handling multi-seg mbufs.

2018-09-28 Thread Tiago Lam
When a dp_packet is from a DPDK source, and it contains multi-segment
mbufs, the data_len is not equal to the packet size, pkt_len. Instead,
the data_len of each mbuf in the chain should be considered while
distributing the new (provided) size.

To account for the above dp_packet_set_size() has been changed so that,
in the multi-segment mbufs case, only the data_len on the last mbuf of
the chain and the total size of the packet, pkt_len, are changed. The
data_len on the intermediate mbufs preceeding the last mbuf is not
changed by dp_packet_set_size(). Furthermore, in some cases
dp_packet_set_size() may be used to set a smaller size than the current
packet size, thus effectively trimming the end of the packet. In the
multi-segment mbufs case this may lead to lingering mbufs that may need
freeing.

__dp_packet_set_data() now also updates an mbufs' data_len after setting
the data offset. This is so that both fields are always in sync for each
mbuf in a chain.

Co-authored-by: Michael Qiu 
Co-authored-by: Mark Kavanagh 
Co-authored-by: Przemyslaw Lal 
Co-authored-by: Marcin Ksiadz 
Co-authored-by: Yuanhan Liu 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Przemyslaw Lal 
Signed-off-by: Marcin Ksiadz 
Signed-off-by: Yuanhan Liu 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.h | 84 -
 1 file changed, 72 insertions(+), 12 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 6376039..4545041 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -426,20 +426,60 @@ dp_packet_size(const struct dp_packet *b)
 return b->mbuf.pkt_len;
 }
 
+/* Sets the size of the packet 'b' to 'v'. For non-DPDK packets this only means
+ * setting b->size_, but if used in a DPDK packet it means adjusting the first
+ * mbuf pkt_len and last mbuf data_len, to reflect the real size, which can
+ * lead to free'ing tail mbufs that are no longer used.
+ *
+ * This function should be used for setting the size only, and if there's an
+ * assumption that the tail end of 'b' will be trimmed. For adjustng the head
+ * 'end' of 'b', dp_packet_pull() should be used instead. */
 static inline void
 dp_packet_set_size(struct dp_packet *b, uint32_t v)
 {
-/* netdev-dpdk does not currently support segmentation; consequently, for
- * all intents and purposes, 'data_len' (16 bit) and 'pkt_len' (32 bit) may
- * be used interchangably.
- *
- * On the datapath, it is expected that the size of packets
- * (and thus 'v') will always be <= UINT16_MAX; this means that there is no
- * loss of accuracy in assigning 'v' to 'data_len'.
- */
-b->mbuf.data_len = (uint16_t)v;  /* Current seg length. */
-b->mbuf.pkt_len = v; /* Total length of all segments linked to
-  * this segment. */
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *mbuf = >mbuf;
+uint16_t new_len = v;
+uint16_t data_len;
+uint16_t nb_segs = 0;
+uint16_t pkt_len = 0;
+
+/* Trim 'v' length bytes from the end of the chained buffers, freeing
+   any buffers that may be left floating */
+while (mbuf) {
+data_len = MIN(new_len, mbuf->data_len);
+mbuf->data_len = data_len;
+
+if (new_len - data_len <= 0) {
+/* Free the rest of chained mbufs */
+free_dpdk_buf(CONTAINER_OF(mbuf->next, struct dp_packet,
+   mbuf));
+mbuf->next = NULL;
+} else if (!mbuf->next) {
+/* Don't assign more than what we have available */
+mbuf->data_len = MIN(new_len,
+ mbuf->buf_len - mbuf->data_off);
+}
+
+new_len -= data_len;
+nb_segs += 1;
+pkt_len += mbuf->data_len;
+mbuf = mbuf->next;
+}
+
+/* pkt_len != v would effectively mean that pkt_len < than 'v' (as
+ * being bigger is logically impossible). Being < than 'v' would mean
+ * the 'v' provided was bigger than the available room, which is the
+ * responsibility of the caller to make sure there is enough room */
+ovs_assert(pkt_len == v);
+
+b->mbuf.nb_segs = nb_segs;
+b->mbuf.pkt_len = pkt_len;
+} else {
+b->mbuf.data_len = v;
+/* Total length of all segments linked to this segment. */
+b->mbuf.pkt_len = v;
+}
 }
 
 static inline uint16_t
@@ -451,7 +491,27 @@ __packet_data(const struct dp_packet *b)
 static inline void
 __packet_set_data(struct dp_packet *b, uint16_t v)
 {
-b->mbuf.data_off = v;
+if (b->source == DPBUF_DPDK) {
+/* Moving data_off away from the first mbuf in the chain is not a
+ * possibility using DPBUF_DPDK dp_packets */
+  

[ovs-dev] [PATCH v10 04/14] netdev-dpdk: Serialise non-pmds mbufs' alloc/free.

2018-09-28 Thread Tiago Lam
A new mutex, 'nonpmd_mp_mutex', has been introduced to serialise
allocation and free operations by non-pmd threads on a given mempool.

free_dpdk_buf() has been modified to make use of the introduced mutex.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 33 ++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index d786be3..3b11546 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -322,6 +322,16 @@ static struct ovs_mutex dpdk_mp_mutex 
OVS_ACQ_AFTER(dpdk_mutex)
 static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
 = OVS_LIST_INITIALIZER(_mp_list);
 
+/* This mutex must be used by non pmd threads when allocating or freeing
+ * mbufs through mempools, when outside of the `non_pmd_mutex` mutex, in struct
+ * dp_netdev.
+ * The reason, as pointed out in the "Known Issues" section in DPDK's EAL docs,
+ * is that the implementation on which mempool is based off is non-preemptable.
+ * Since non-pmds may end up not being pinned this could lead to the preemption
+ * between non-pmds performing operations on the same mempool, which could lead
+ * to memory corruption. */
+static struct ovs_mutex nonpmd_mp_mutex = OVS_MUTEX_INITIALIZER;
+
 struct dpdk_mp {
  struct rte_mempool *mp;
  int mtu;
@@ -492,6 +502,8 @@ struct netdev_rxq_dpdk {
 dpdk_port_t port_id;
 };
 
+static bool dpdk_thread_is_pmd(void);
+
 static void netdev_dpdk_destruct(struct netdev *netdev);
 static void netdev_dpdk_vhost_destruct(struct netdev *netdev);
 
@@ -525,6 +537,12 @@ dpdk_buf_size(int mtu)
  NETDEV_DPDK_MBUF_ALIGN);
 }
 
+static bool
+dpdk_thread_is_pmd(void)
+{
+ return rte_lcore_id() != NON_PMD_CORE_ID;
+}
+
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
  *
  * Unlike xmalloc(), this function can return NULL on failure. */
@@ -535,11 +553,20 @@ dpdk_rte_mzalloc(size_t sz)
 }
 
 void
-free_dpdk_buf(struct dp_packet *p)
+free_dpdk_buf(struct dp_packet *packet)
 {
-struct rte_mbuf *pkt = (struct rte_mbuf *) p;
+/* If non-pmd we need to lock on nonpmd_mp_mutex mutex */
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_lock(_mp_mutex);
+
+rte_pktmbuf_free(>mbuf);
+
+ovs_mutex_unlock(_mp_mutex);
+
+return;
+}
 
-rte_pktmbuf_free(pkt);
+rte_pktmbuf_free(>mbuf);
 }
 
 static void
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v10 03/14] dp-packet: Fix allocated size on DPDK init.

2018-09-28 Thread Tiago Lam
When enabled with DPDK OvS deals with two types of packets, the ones
coming from the mempool and the ones locally created by OvS - which are
copied to mempool mbufs before output. In the latter, the space is
allocated from the system, while in the former the mbufs are allocated
from a mempool, which takes care of initialising them appropriately.

In the current implementation, during mempool's initialisation of mbufs,
dp_packet_set_allocated() is called from dp_packet_init_dpdk() without
considering that the allocated space, in the case of multi-segment
mbufs, might be greater than a single mbuf.  Furthermore, given that
dp_packet_init_dpdk() is on the code path that's called upon mempool's
initialisation, a call to dp_packet_set_allocated() is redundant, since
mempool takes care of initialising it.

To fix this, dp_packet_set_allocated() is no longer called after
initialisation of a mempool, only in dp_packet_init__(), which is still
called by OvS when initialising locally created packets.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c   | 3 +--
 lib/dp-packet.h   | 2 +-
 lib/netdev-dpdk.c | 2 +-
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 443c225..782e7c2 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -99,9 +99,8 @@ dp_packet_use_const(struct dp_packet *b, const void *data, 
size_t size)
  * buffer.  Here, non-transient ovs dp-packet fields are initialized for
  * packets that are part of a DPDK memory pool. */
 void
-dp_packet_init_dpdk(struct dp_packet *b, size_t allocated)
+dp_packet_init_dpdk(struct dp_packet *b)
 {
-dp_packet_set_allocated(b, allocated);
 b->source = DPBUF_DPDK;
 }
 
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index b948fe1..6376039 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -114,7 +114,7 @@ void dp_packet_use(struct dp_packet *, void *, size_t);
 void dp_packet_use_stub(struct dp_packet *, void *, size_t);
 void dp_packet_use_const(struct dp_packet *, const void *, size_t);
 
-void dp_packet_init_dpdk(struct dp_packet *, size_t allocated);
+void dp_packet_init_dpdk(struct dp_packet *);
 
 void dp_packet_init(struct dp_packet *, size_t);
 void dp_packet_uninit(struct dp_packet *);
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 1e19622..d786be3 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -550,7 +550,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp OVS_UNUSED,
 {
 struct rte_mbuf *pkt = _p;
 
-dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);
+dp_packet_init_dpdk((struct dp_packet *) pkt);
 }
 
 static int
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v10 02/14] dp-packet: Init specific mbuf fields.

2018-09-28 Thread Tiago Lam
From: Mark Kavanagh 

dp_packets are created using xmalloc(); in the case of OvS-DPDK, it's
possible the the resultant mbuf portion of the dp_packet contains
random data. For some mbuf fields, specifically those related to
multi-segment mbufs and/or offload features, random values may cause
unexpected behaviour, should the dp_packet's contents be later copied
to a DPDK mbuf. It is critical therefore, that these fields should be
initialized to 0.

This patch ensures that the following mbuf fields are initialized to
appropriate values on creation of a new dp_packet:
   - ol_flags=0
   - nb_segs=1
   - tx_offload=0
   - packet_type=0
   - next=NULL

Adapted from an idea by Michael Qiu :
https://patchwork.ozlabs.org/patch/777570/

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.h | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index ba91e58..b948fe1 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -625,14 +625,15 @@ dp_packet_mbuf_rss_flag_reset(struct dp_packet *p 
OVS_UNUSED)
 }
 
 /* This initialization is needed for packets that do not come
- * from DPDK interfaces, when vswitchd is built with --with-dpdk.
- * The DPDK rte library will still otherwise manage the mbuf.
- * We only need to initialize the mbuf ol_flags. */
+ * from DPDK interfaces, when vswitchd is built with --with-dpdk. */
 static inline void
 dp_packet_mbuf_init(struct dp_packet *p OVS_UNUSED)
 {
 #ifdef DPDK_NETDEV
-p->mbuf.ol_flags = 0;
+struct rte_mbuf *mbuf = &(p->mbuf);
+mbuf->ol_flags = mbuf->tx_offload = mbuf->packet_type = 0;
+mbuf->nb_segs = 1;
+mbuf->next = NULL;
 #endif
 }
 
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v10 01/14] netdev-dpdk: fix mbuf sizing

2018-09-28 Thread Tiago Lam
From: Mark Kavanagh 

There are numerous factors that must be considered when calculating
the size of an mbuf:
- the data portion of the mbuf must be sized in accordance With Rx
  buffer alignment (typically 1024B). So, for example, in order to
  successfully receive and capture a 1500B packet, mbufs with a
  data portion of size 2048B must be used.
- in OvS, the elements that comprise an mbuf are:
  * the dp packet, which includes a struct rte mbuf (704B)
  * RTE_PKTMBUF_HEADROOM (128B)
  * packet data (aligned to 1k, as previously described)
  * RTE_PKTMBUF_TAILROOM (typically 0)

Some PMDs require that the total mbuf size (i.e. the total sum of all
of the above-listed components' lengths) is cache-aligned. To satisfy
this requirement, it may be necessary to round up the total mbuf size
with respect to cacheline size. In doing so, it's possible that the
dp_packet's data portion is inadvertently increased in size, such that
it no longer adheres to Rx buffer alignment. Consequently, the
following property of the mbuf no longer holds true:

mbuf.data_len == mbuf.buf_len - mbuf.data_off

This creates a problem in the case of multi-segment mbufs, where that
assumption is assumed to be true for all but the final segment in an
mbuf chain. Resolve this issue by adjusting the size of the mbuf's
private data portion, as opposed to the packet data portion when
aligning mbuf size to cachelines.

Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
CC: Santosh Shukla 
Signed-off-by: Mark Kavanagh 
Acked-by: Santosh Shukla 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 56 +--
 1 file changed, 38 insertions(+), 18 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index f91aa27..1e19622 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -88,10 +88,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 
20);
 #define MTU_TO_MAX_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_MAX_LEN)
 #define FRAME_LEN_TO_MTU(frame_len) ((frame_len)\
  - ETHER_HDR_LEN - ETHER_CRC_LEN)
-#define MBUF_SIZE(mtu)  ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) \
- + sizeof(struct dp_packet) \
- + RTE_PKTMBUF_HEADROOM),   \
- RTE_CACHE_LINE_SIZE)
 #define NETDEV_DPDK_MBUF_ALIGN  1024
 #define NETDEV_DPDK_MAX_PKT_LEN 9728
 
@@ -637,7 +633,11 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
 char mp_name[RTE_MEMPOOL_NAMESIZE];
 const char *netdev_name = netdev_get_name(>up);
 int socket_id = dev->requested_socket_id;
-uint32_t n_mbufs;
+uint32_t n_mbufs = 0;
+uint32_t mbuf_size = 0;
+uint32_t aligned_mbuf_size = 0;
+uint32_t mbuf_priv_data_len = 0;
+uint32_t pkt_size = 0;
 uint32_t hash = hash_string(netdev_name, 0);
 struct dpdk_mp *dmp = NULL;
 int ret;
@@ -650,6 +650,9 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
 dmp->mtu = mtu;
 dmp->refcount = 1;
 
+/* Get the size of each mbuf, based on the MTU */
+mbuf_size = dpdk_buf_size(dev->requested_mtu);
+
 n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
 
 do {
@@ -661,8 +664,8 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
  * so this is not an issue for tasks such as debugging.
  */
 ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
-   "ovs%08x%02d%05d%07u",
-   hash, socket_id, mtu, n_mbufs);
+   "ovs%08x%02d%05d%07u",
+hash, socket_id, mtu, n_mbufs);
 if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
 VLOG_DBG("snprintf returned %d. "
  "Failed to generate a mempool name for \"%s\". "
@@ -671,17 +674,34 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
 break;
 }
 
-VLOG_DBG("Port %s: Requesting a mempool of %u mbufs "
-  "on socket %d for %d Rx and %d Tx queues.",
-  netdev_name, n_mbufs, socket_id,
-  dev->requested_n_rxq, dev->requested_n_txq);
-
-dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
-  MP_CACHE_SZ,
-  sizeof (struct dp_packet)
-  - sizeof (struct rte_mbuf),
-  MBUF_SIZE(mtu)
-  - sizeof(struct dp_packet),
+VLOG_DBG("Port %s: Requesting a mempool of %u mbufs of size %u "
+  "on socket %d for %d Rx and %d Tx queues, "
+  "cache line size of %u",
+  netdev_name, n_mbufs, 

[ovs-dev] [PATCH v10 00/14] Support multi-segment mbufs

2018-09-28 Thread Tiago Lam
h a single or
   multiple mbufs, from an instantiated mempool and performing several
   operations on those, verifying if the data at the end matches what's
   expected;
 - Fix bugs found by the new tests:
- dp_packet_shift() wasn't taking into account shift lefts;
- dp_packet_resize__() was misusing and miscalculating the tailrooms
  and headrooms, ending up calculating the wrong number of segments
  that needed allocation;
- An mbuf's end was being miscalculated both in dp_packet_tail,
  dp_packet_mbuf_tail() and dp_packet_end();
- dp_packet_set_size() was not updating the number of chained segments
  'nb_segs';
 - Add support for multi-seg mbufs in dp_packet_clear().

v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
   reuse.");
 - Further improve dp_packet_put_uninit() and dp_packet_shift() to
   support multi-seg mbufs;
 - Add support for multi-seg mbufs in dp_packet_l4_size() and
   improve other helper funcs, such as dp_packet_tail() and dp_
   packet_tailroom().
 - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
   put_zeros(), as well as dp_packet_resize__() - allocating new
   mbufs and linking them together;
 Restructured patchset:
 - Squash patch 5 into patch 6, since they were both related to
   copying data while handling multi-seg mbufs;
 - Split patch 4 into two separate patches - one that introduces the
   changes in helper functions to deal with multi-seg mbufs and
   two others for the shift() and put_uninit() functionality;
 - Move patch 4 to before patch 3, so that ihelper functions come
   before functionality improvement that rely on those helpers.

v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
  from getting redundantly incremented");
- Sugesh's comments have been addressed:
  - Changed dp_packet_set_data() and dp_packet_set_size() logic to
make them independent of each other;
  - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
size() are independent;
  - dp_packet_clone_with_headroom() now has split functions for
handling DPDK sourced packets and non-DPDK packets;
- Modified various functions in dp-packet.h to account for multi-seg
  mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
  and dp_packet_at();
- Added support for shifting packet data in multi-seg mbufs, using
  dp_packet_shift();
- Fixed some minor inconsistencies.

Note that some of the changes in v5 have been contributed by Mark
Kavanagh as well.

v4: - restructure patchset
- account for 128B ARM cacheline when sizing mbufs

Mark Kavanagh (4):
  netdev-dpdk: fix mbuf sizing
  dp-packet: Init specific mbuf fields.
  netdev-dpdk: copy large packet to multi-seg. mbufs
  netdev-dpdk: support multi-segment jumbo frames

Michael Qiu (1):
  dp-packet: copy data from multi-seg. DPDK mbuf

Tiago Lam (9):
  dp-packet: Fix allocated size on DPDK init.
  netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
  dp-packet: Fix data_len handling multi-seg mbufs.
  dp-packet: Handle multi-seg mbufs in helper funcs.
  dp-packet: Handle multi-seg mubfs in shift() func.
  dp-packet: Add support for data "linearization".
  dpdk-tests: Add unit-tests for multi-seg mbufs.
  dpdk-tests: Accept other configs in OVS_DPDK_START
  dpdk-tests: End-to-end tests for multi-seg mbufs.

 Documentation/topics/dpdk/jumbo-frames.rst |  73 
 Documentation/topics/dpdk/memory.rst   |  36 ++
 NEWS   |   1 +
 lib/bfd.c  |   3 +-
 lib/cfm.c  |   5 +-
 lib/conntrack-icmp.c   |   4 +-
 lib/conntrack-private.h|   4 +-
 lib/conntrack-tcp.c|   6 +-
 lib/conntrack.c| 109 +++--
 lib/dp-packet.c| 192 -
 lib/dp-packet.h| 407 +++
 lib/dpdk.c |   8 +
 lib/dpif-netdev.c  |   5 +
 lib/dpif-netlink.c |   5 +
 lib/dpif.c |   9 +
 lib/flow.c |  29 +-
 lib/lacp.c |   3 +-
 lib/mcast-snooping.c   |   8 +-
 lib/netdev-bsd.c   |   5 +
 lib/netdev-dpdk.c  | 243 +--
 lib/netdev-dpdk.h  |   1 +
 lib/netdev-dummy.c |  13 +-
 lib/netdev-linux.c |  13 +-
 lib/netdev-native-tnl.c|  39 +-
 lib/odp-execute.c  |  28 +-
 lib/ofp-print.c|  10 +-
 lib/ovs-lldp.c |

[ovs-dev] [PATCH] system-traffic: Fix conntrack per zone limit test.

2018-09-06 Thread Tiago Lam
Commit 3f1087c added a per zone limit test which relied on the
CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE m4 macro to skip the test when executing
in a userspace datapath (since the per zone limit feature is not yet
implemented in userspace). That macro, however, has been removed in
commit 271e48a ("conntrack: Support conntrack flush by ct 5-tuple")
which was causing the test to fail when executing in userspace.

Instead, a new m4 macro, CHECK_CT_DPIF_PER_ZONE_LIMIT, is introduced to
make the same differentiation, until userspace doesn't support the per
zone limit.

CC: Yi-Hung Wei 
Fixes: 3f1087c ("system-traffic: Add conntrack per zone limit test case")
Signed-off-by: Tiago Lam 
---
 tests/system-kmod-macros.at  | 7 +++
 tests/system-traffic.at  | 2 +-
 tests/system-userspace-macros.at | 9 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/tests/system-kmod-macros.at b/tests/system-kmod-macros.at
index 770703b..3296d64 100644
--- a/tests/system-kmod-macros.at
+++ b/tests/system-kmod-macros.at
@@ -114,6 +114,13 @@ m4_define([CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN],
 #
 m4_define([CHECK_CONNTRACK_NAT])
 
+# CHECK_CT_DPIF_PER_ZONE_LIMIT()
+#
+# Perform requirements checks for running ovs-dpctl ct-[set|get|del]-limits per
+# zone. The kernel datapath does support this feature. Will remove this check
+# after both kernel and userspace datapath support it.
+m4_define([CHECK_CT_DPIF_PER_ZONE_LIMIT])
+
 # CHECK_CT_DPIF_SET_GET_MAXCONNS()
 #
 # Perform requirements checks for running ovs-dpctl ct-set-maxconns or
diff --git a/tests/system-traffic.at b/tests/system-traffic.at
index 2a33e9f..19cf89a 100644
--- a/tests/system-traffic.at
+++ b/tests/system-traffic.at
@@ -3189,7 +3189,7 @@ AT_CLEANUP
 
 AT_SETUP([conntrack - limit by zone])
 CHECK_CONNTRACK()
-CHECK_CT_DPIF_FLUSH_BY_CT_TUPLE()
+CHECK_CT_DPIF_PER_ZONE_LIMIT()
 OVS_TRAFFIC_VSWITCHD_START()
 
 ADD_NAMESPACES(at_ns0, at_ns1)
diff --git a/tests/system-userspace-macros.at b/tests/system-userspace-macros.at
index 968a95e..27bde8b 100644
--- a/tests/system-userspace-macros.at
+++ b/tests/system-userspace-macros.at
@@ -116,6 +116,15 @@ m4_define([CHECK_CONNTRACK_FRAG_IPV6_MULT_EXTEN],
 #
 m4_define([CHECK_CONNTRACK_NAT])
 
+# CHECK_CT_DPIF_PER_ZONE_LIMIT()
+#
+# Perform requirements checks for running ovs-dpctl ct-[set|get|del]-limits per
+# zone. The userspace datapath does not support this feature yet.
+m4_define([CHECK_CT_DPIF_PER_ZONE_LIMIT],
+[
+AT_SKIP_IF([:])
+])
+
 # CHECK_CT_DPIF_SET_GET_MAXCONNS()
 #
 # Perform requirements checks for running ovs-dpctl ct-set-maxconns or
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v9 14/14] dpdk-tests: End-to-end tests for multi-seg mbufs.

2018-08-24 Thread Tiago Lam
The following tests are added to the DPDK testsuite to add some
coverage for the multi-segment mbufs:
- Check that multi-segment mbufs are disabled by default;
- Check that providing `other_config:dpdk-multi-seg-mbufs=true` indeed
  enables mbufs;
- Using a DPDK port, send a random packet out and check that `ofctl
  dump-flows` shows the correct amount of packets and bytes sent.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/system-dpdk.at | 65 
 1 file changed, 65 insertions(+)

diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 3d21b01..af8de8c 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -71,3 +71,68 @@ OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel 
module is probably n
 ")
 AT_CLEANUP
 dnl --
+
+AT_SETUP([Jumbo frames - Multi-segment disabled by default])
+OVS_DPDK_START()
+
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [1], [])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment enabled])
+OVS_DPDK_START([dpdk-multi-seg-mbufs=true])
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [], [stdout])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment mbufs Tx])
+OVS_DPDK_PRE_CHECK()
+OVS_DPDK_START([per-port-memory=true dpdk-multi-seg-mbufs=true])
+
+dnl Add userspace bridge and attach it to OVS
+AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
+AT_CHECK([ovs-vsctl add-port br10 dpdk0 \
+-- set Interface dpdk0 type=dpdk options:dpdk-devargs=$(cat PCI_ADDR) \
+-- set Interface dpdk0 mtu_request=9000], [], [stdout], [stderr])
+
+AT_CHECK([ovs-vsctl show], [], [stdout])
+
+dnl Add flows to send packets out from the 'dpdk0' port
+AT_CHECK([
+ovs-ofctl del-flows br10
+ovs-ofctl add-flow br10 in_port=LOCAL,actions=output:dpdk0
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [], [stdout])
+
+dnl Send packet out, of the 'dpdk0' port
+AT_CHECK([
+ARP_HEADER="09000B0009000A000806000108000604000100010A\
+0100020A02"
+dnl Build a random hex string to append to the ARP_HEADER
+RANDOM_BODY=$(printf '0102030405%.0s' {1..1750})
+dnl 8792B ARP packet
+RANDOM_ARP="$ARP_HEADER$RANDOM_BODY"
+
+ovs-ofctl packet-out br10 "packet=$RANDOM_ARP,action=resubmit:LOCAL"
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [0], [stdout])
+
+dnl Confirm the single packet as been sent with correct size
+AT_CHECK([ovs-ofctl dump-flows br10 | ofctl_strip | grep in_port], [0], [dnl
+ n_packets=1, n_bytes=8792, in_port=LOCAL actions=output:1
+])
+
+dnl Clean up
+OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel module is probably 
not loaded./d
+/Failed to enable flow control/d
+/failed to connect to \/tmp\/dpdkvhostclient0: No such file or directory/d
+/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v9 13/14] dpdk-tests: Accept other configs in OVS_DPDK_START

2018-08-24 Thread Tiago Lam
As it stands, OVS_DPDK_START() won't allow other configs to be set
before starting the ovs-vswitchd daemon. This is a problem since some
configs, such as the "dpdk-multi-seg-mbufs=true" for enabling the
multi-segment mbufs, need to be set prior to start OvS.

To support other options, OVS_DPDK_START() has been modified to accept
extra configs in the form "$config_name=$config_value". It then uses
ovs-vsctl to set the configs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/system-dpdk-macros.at | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tests/system-dpdk-macros.at b/tests/system-dpdk-macros.at
index 0762ee0..7c65834 100644
--- a/tests/system-dpdk-macros.at
+++ b/tests/system-dpdk-macros.at
@@ -21,7 +21,7 @@ m4_define([OVS_DPDK_PRE_CHECK],
 ])
 
 
-# OVS_DPDK_START()
+# OVS_DPDK_START([other-conf-args])
 #
 # Create an empty database and start ovsdb-server. Add special configuration
 # dpdk-init to enable DPDK functionality. Start ovs-vswitchd connected to that
@@ -48,6 +48,10 @@ m4_define([OVS_DPDK_START],
AT_CHECK([lscpu], [], [stdout])
AT_CHECK([cat stdout | grep "NUMA node(s)" | awk '{c=1; while (c++<$(3)) 
{printf "1024,"}; print "1024"}' > SOCKET_MEM])
AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . 
other_config:dpdk-socket-mem="$(cat SOCKET_MEM)"])
+   dnl Iterate through $other-conf-args list and include them
+   m4_foreach_w(opt, $1, [
+   AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . other_config:opt])
+   ])
 
dnl Start ovs-vswitchd.
AT_CHECK([ovs-vswitchd --detach --no-chdir --pidfile --log-file -vvconn 
-vofproto_dpif -vunixctl], [0], [stdout], [stderr])
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v9 12/14] dpdk-tests: Add unit-tests for multi-seg mbufs.

2018-08-24 Thread Tiago Lam
In order to create a minimal environment that allows the tests to get
mbufs from an existing mempool, the following approach is taken:
- EAL is initialised (by using the main dpdk_init()) and a (very) small
  mempool is instantiated (mimicking the logic in dpdk_mp_create()).
  This mempool instance is global and used by all the tests;
- Packets are then allocated from the instantiated mempool, and tested
  on, by running some operations on them and manipulating data.

The tests introduced focus on testing DPDK dp_packets (where
source=DPBUF_DPDK), linked with a single or multiple mbufs, across
several operations, such as:
- dp_packet_put();
- dp_packet_shift();
- dp_packet_reserve();
- dp_packet_push_uninit();
- dp_packet_clear();
- dp_packet_equal();
- dp_packet_linear_data();
- And as a consequence of some of these, dp_packet_put_uninit() and
  dp_packet_resize__().

Finally, this has also been integrated with the new DPDK testsuite.
Thus, when running `$sudo make check-dpdk` one will also be running
these tests.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/automake.mk  |  10 +-
 tests/dpdk-packet-mbufs.at |   7 +
 tests/system-dpdk-testsuite.at |   1 +
 tests/test-dpdk-mbufs.c| 619 +
 4 files changed, 636 insertions(+), 1 deletion(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 49ceb41..f484f69 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -135,7 +135,8 @@ SYSTEM_DPDK_TESTSUITE_AT = \
tests/system-common-macros.at \
tests/system-dpdk-macros.at \
tests/system-dpdk-testsuite.at \
-   tests/system-dpdk.at
+   tests/system-dpdk.at \
+   tests/dpdk-packet-mbufs.at
 
 check_SCRIPTS += tests/atlocal
 
@@ -392,6 +393,10 @@ tests_ovstest_SOURCES = \
tests/test-vconn.c \
tests/test-aa.c \
tests/test-stopwatch.c
+if DPDK_NETDEV
+tests_ovstest_SOURCES += \
+   tests/test-dpdk-mbufs.c
+endif
 
 if !WIN32
 tests_ovstest_SOURCES += \
@@ -404,6 +409,9 @@ tests_ovstest_SOURCES += \
 endif
 
 tests_ovstest_LDADD = lib/libopenvswitch.la ovn/lib/libovn.la
+if DPDK_NETDEV
+tests_ovstest_LDFLAGS = $(AM_LDFLAGS) $(DPDK_vswitchd_LDFLAGS)
+endif
 
 noinst_PROGRAMS += tests/test-strtok_r
 tests_test_strtok_r_SOURCES = tests/test-strtok_r.c
diff --git a/tests/dpdk-packet-mbufs.at b/tests/dpdk-packet-mbufs.at
new file mode 100644
index 000..f28e4fc
--- /dev/null
+++ b/tests/dpdk-packet-mbufs.at
@@ -0,0 +1,7 @@
+AT_BANNER([OVS-DPDK dp_packet unit tests])
+
+AT_SETUP([OVS-DPDK dp_packet - mbufs allocation])
+AT_KEYWORDS([dp_packet, multi-seg, mbufs])
+AT_CHECK(ovstest test-dpdk-packet, [], [ignore], [ignore])
+
+AT_CLEANUP
diff --git a/tests/system-dpdk-testsuite.at b/tests/system-dpdk-testsuite.at
index 382f09e..f5edf58 100644
--- a/tests/system-dpdk-testsuite.at
+++ b/tests/system-dpdk-testsuite.at
@@ -23,3 +23,4 @@ m4_include([tests/system-common-macros.at])
 m4_include([tests/system-dpdk-macros.at])
 
 m4_include([tests/system-dpdk.at])
+m4_include([tests/dpdk-packet-mbufs.at])
diff --git a/tests/test-dpdk-mbufs.c b/tests/test-dpdk-mbufs.c
new file mode 100644
index 000..19081a3
--- /dev/null
+++ b/tests/test-dpdk-mbufs.c
@@ -0,0 +1,619 @@
+/*
+ * Copyright (c) 2018 Intel Corporation
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "dp-packet.h"
+#include "ovstest.h"
+#include "dpdk.h"
+#include "smap.h"
+
+#define N_MBUFS 1024
+#define MBUF_DATA_LEN 2048
+
+static int num_tests = 0;
+
+/* Global var to hold a mempool instance, "test-mp", used in all of the tests
+ * below. This instance is instantiated in dpdk_setup_eal_with_mp(). */
+static struct rte_mempool *mp;
+
+/* Test data used to fill the packets with data. Note that this isn't a string
+ * that repsents a valid packet, by any means. The pattern is generated in set_
+ * testing_pattern_str() and the sole purpose is to verify the data remains the
+ * same after inserting and operating on multi-segment mbufs. */
+static char *test_str;
+
+/* Asserts a dp_packet that holds a single mbuf, where:
+ * - nb_segs must be 1;
+ * - pkt_len must be equal to data_len which in turn must equal the provided
+ *   'pkt_len';
+ * - d

[ovs-dev] [PATCH v9 11/14] netdev-dpdk: support multi-segment jumbo frames

2018-08-24 Thread Tiago Lam
From: Mark Kavanagh 

Currently, jumbo frame support for OvS-DPDK is implemented by
increasing the size of mbufs within a mempool, such that each mbuf
within the pool is large enough to contain an entire jumbo frame of
a user-defined size. Typically, for each user-defined MTU,
'requested_mtu', a new mempool is created, containing mbufs of size
~requested_mtu.

With the multi-segment approach, a port uses a single mempool,
(containing standard/default-sized mbufs of ~2k bytes), irrespective
of the user-requested MTU value. To accommodate jumbo frames, mbufs
are chained together, where each mbuf in the chain stores a portion of
the jumbo frame. Each mbuf in the chain is termed a segment, hence the
name.

== Enabling multi-segment mbufs ==
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 Documentation/topics/dpdk/jumbo-frames.rst | 67 ++
 Documentation/topics/dpdk/memory.rst   | 36 
 NEWS   |  1 +
 lib/dpdk.c |  8 
 lib/netdev-dpdk.c  | 66 +
 lib/netdev-dpdk.h  |  2 +
 vswitchd/vswitch.xml   | 22 ++
 7 files changed, 194 insertions(+), 8 deletions(-)

diff --git a/Documentation/topics/dpdk/jumbo-frames.rst 
b/Documentation/topics/dpdk/jumbo-frames.rst
index 00360b4..07bf3ca 100644
--- a/Documentation/topics/dpdk/jumbo-frames.rst
+++ b/Documentation/topics/dpdk/jumbo-frames.rst
@@ -71,3 +71,70 @@ Jumbo frame support has been validated against 9728B frames, 
which is the
 largest frame size supported by Fortville NIC using the DPDK i40e driver, but
 larger frames and other DPDK NIC drivers may be supported. These cases are
 common for use cases involving East-West traffic only.
+
+---
+Multi-segment mbufs
+---
+
+Instead of increasing the size of mbufs within a mempool, such that each mbuf
+within the pool is large enough to contain an entire jumbo frame of a
+user-defined size, mbufs can be chained together instead. In this approach each
+mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
+irrespective of the user-requested MTU value. Since each mbuf in the chain is
+termed a segment, this approach is named "multi-segment mbufs".
+
+This approach may bring more flexibility in use cases where the maximum packet
+length may be hard to guess. For example, in cases where packets originate from
+sources marked for oflload (such as TSO), each packet may be larger than the
+MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
+enough to hold all of the packet's data.
+
+Multi-segment and single-segment mbufs are mutually exclusive, and the user
+must decide on which approach to adopt on initialisation. If multi-segment
+mbufs is to be enabled, it can be done so with the following command::
+
+$ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
+
+Single-segment mbufs still remain the default when using OvS-DPDK, and the
+above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
+multi-segment mbufs are to be used.
+
+~
+Performance notes
+~
+
+When using multi-segment mbufs some PMDs may not support vectorized Tx
+functions, due to its non-contiguous nature. As a result this can hit
+performance for smaller packet sizes. For example, on a setup sending 64B
+packets at line rate, a decrease of ~20% has been observed. The performance
+impact stops being noticeable for larger packet sizes, although the exact size
+will between PMDs, and depending on the architecture one's using.
+
+Tests performed with the i40e PMD driver only showed this limitation for 64B
+packets, and the same rate was observed when comparing multi-segment mbufs and
+single-segment mbuf for 128B packets. In other words, the 20% drop in
+performance was not observed for packets >= 128B during this test case.
+
+Because of this, multi-segment mbufs is not advised to be used with

[ovs-dev] [PATCH v9 10/14] netdev-dpdk: copy large packet to multi-seg. mbufs

2018-08-24 Thread Tiago Lam
From: Mark Kavanagh 

Currently, packets are only copied to a single segment in the function
dpdk_do_tx_copy(). This could be an issue in the case of jumbo frames,
particularly when multi-segment mbufs are involved.

This patch calculates the number of segments needed by a packet and
copies the data to each segment.

A new function, dpdk_buf_alloc(), has also been introduced as a wrapper
around the nonpmd_mp_mutex to serialise allocations from a non-pmd
context.

Co-authored-by: Michael Qiu 
Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Michael Qiu 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 84 +--
 1 file changed, 75 insertions(+), 9 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index e005d00..5ab1af3 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -552,6 +552,27 @@ dpdk_rte_mzalloc(size_t sz)
 return rte_zmalloc(OVS_VPORT_DPDK, sz, OVS_CACHE_LINE_SIZE);
 }
 
+static struct rte_mbuf *
+dpdk_buf_alloc(struct rte_mempool *mp)
+{
+struct rte_mbuf *mbuf = NULL;
+
+/* If non-pmd we need to lock on nonpmd_mp_mutex mutex */
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_lock(_mp_mutex);
+
+mbuf = rte_pktmbuf_alloc(mp);
+
+ovs_mutex_unlock(_mp_mutex);
+
+return mbuf;
+}
+
+mbuf = rte_pktmbuf_alloc(mp);
+
+return mbuf;
+}
+
 void
 free_dpdk_buf(struct dp_packet *packet)
 {
@@ -2316,6 +2337,49 @@ out:
 }
 }
 
+static int
+dpdk_copy_dp_packet_to_mbuf(struct dp_packet *packet, struct rte_mbuf **head,
+struct rte_mempool *mp)
+{
+struct rte_mbuf *mbuf, *fmbuf;
+uint32_t size = dp_packet_size(packet);
+uint16_t max_data_len;
+uint32_t nb_segs = 0;
+
+/* Allocate first mbuf to know the size of data available */
+fmbuf = mbuf = *head = dpdk_buf_alloc(mp);
+if (OVS_UNLIKELY(!mbuf)) {
+return ENOMEM;
+}
+
+/* All new allocated mbuf's max data len is the same */
+max_data_len = mbuf->buf_len - mbuf->data_off;
+
+/* Calculate # of output mbufs. */
+nb_segs = size / max_data_len;
+if (size % max_data_len) {
+nb_segs = nb_segs + 1;
+}
+
+/* Allocate additional mbufs, less the one alredy allocated above */
+for (int i = 1; i < nb_segs; i++) {
+mbuf->next = dpdk_buf_alloc(mp);
+if (!mbuf->next) {
+free_dpdk_buf(CONTAINER_OF(fmbuf, struct dp_packet, mbuf));
+fmbuf = NULL;
+return ENOMEM;
+}
+mbuf = mbuf->next;
+}
+
+fmbuf->nb_segs = nb_segs;
+fmbuf->pkt_len = size;
+
+dp_packet_mbuf_write(fmbuf, 0, size, dp_packet_linear_data(packet));
+
+return 0;
+}
+
 /* Tx function. Transmit packets indefinitely */
 static void
 dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
@@ -2332,6 +2396,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
 uint32_t cnt = batch_cnt;
 uint32_t dropped = 0;
+uint32_t i;
 
 if (dev->type != DPDK_DEV_VHOST) {
 /* Check if QoS has been configured for this netdev. */
@@ -2342,28 +2407,29 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 
 uint32_t txcnt = 0;
 
-for (uint32_t i = 0; i < cnt; i++) {
+for (i = 0; i < cnt; i++) {
 struct dp_packet *packet = batch->packets[i];
 uint32_t size = dp_packet_size(packet);
+int err = 0;
 
 if (OVS_UNLIKELY(size > dev->max_packet_len)) {
 VLOG_WARN_RL(, "Too big size %u max_packet_len %d",
  size, dev->max_packet_len);
-
 dropped++;
 continue;
 }
 
-pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
-if (OVS_UNLIKELY(!pkts[txcnt])) {
+err = dpdk_copy_dp_packet_to_mbuf(packet, [txcnt],
+  dev->dpdk_mp->mp);
+if (err != 0) {
+if (err == ENOMEM) {
+VLOG_ERR_RL(, "Failed to alloc mbufs! %u packets dropped",
+cnt - i);
+}
+
 dropped += cnt - i;
 break;
 }
-
-/* We have to do a copy for now */
-memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
-   dp_packet_data(packet), size);
-dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
 dp_packet_copy_mbuf_flags((struct dp_packet *)pkts[txcnt], packet);
 
 txcnt++;
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v9 09/14] dp-packet: Add support for data "linearization".

2018-08-24 Thread Tiago Lam
Previous commits have added support to the dp_packet API to handle
multi-segmented packets, where data is not stored contiguously in
memory. However, in some cases, it is inevitable and data must be
provided contiguously. Examples of such cases are when performing csums
over the entire packet data, or when write()'ing to a file descriptor
(for a tap interface, for example). For such cases, the dp_packet API
has been extended to provide a way to transform a multi-segmented
DPBUF_DPDK packet into a DPBUF_MALLOC system packet (at the expense of a
copy of memory). If the packet's data is already stored in memory
contigously then there's no need to convert the packet.

Additionally, the main use cases that were assuming that a dp_packet's
data is always held contiguously in memory were changed to make use of
the new "linear functions" in the dp_packet API when there's a need to
traverse the entire's packet data. Per the example above, when the
packet's data needs to be write() to the tap's file descriptor, or when
the conntrack module needs to verify a packet's checksum, the data is
now linearized.

Signed-off-by: Tiago Lam 
---
 lib/bfd.c |  3 +-
 lib/conntrack.c   | 17 +
 lib/dp-packet.c   | 18 +
 lib/dp-packet.h   | 89 +++
 lib/dpif-netlink.c|  2 +-
 lib/dpif.c|  2 +-
 lib/netdev-bsd.c  |  2 +-
 lib/netdev-dummy.c|  5 ++-
 lib/netdev-linux.c|  5 ++-
 lib/netdev-native-tnl.c   | 10 -
 lib/odp-execute.c |  2 +-
 lib/ofp-print.c   |  2 +-
 lib/ovs-lldp.c|  3 +-
 lib/packets.c |  3 +-
 ofproto/ofproto-dpif-sflow.c  |  2 +-
 ofproto/ofproto-dpif-upcall.c |  2 +-
 ofproto/ofproto-dpif-xlate.c  | 12 --
 17 files changed, 145 insertions(+), 34 deletions(-)

diff --git a/lib/bfd.c b/lib/bfd.c
index 5308262..d50d2da 100644
--- a/lib/bfd.c
+++ b/lib/bfd.c
@@ -722,7 +722,8 @@ bfd_process_packet(struct bfd *bfd, const struct flow *flow,
 if (!msg) {
 VLOG_INFO_RL(, "%s: Received too-short BFD control message (only "
  "%"PRIdPTR" bytes long, at least %d required).",
- bfd->name, (uint8_t *) dp_packet_tail(p) - l7,
+ bfd->name, dp_packet_size(p) -
+ (l7 - (uint8_t *) dp_packet_data(p)),
  BFD_PACKET_LEN);
 goto out;
 }
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 974f985..15d1ed2 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -636,6 +636,8 @@ reverse_pat_packet(struct dp_packet *pkt, const struct conn 
*conn)
 static void
 reverse_nat_packet(struct dp_packet *pkt, const struct conn *conn)
 {
+void *l3 = dp_packet_linear_ofs(pkt, pkt->l3_ofs);
+void *l4 = dp_packet_linear_ofs(pkt, pkt->l4_ofs);
 char *tail = dp_packet_tail(pkt);
 char pad = dp_packet_l2_pad_size(pkt);
 struct conn_key inner_key;
@@ -644,8 +646,8 @@ reverse_nat_packet(struct dp_packet *pkt, const struct conn 
*conn)
 uint16_t orig_l4_ofs = pkt->l4_ofs;
 
 if (conn->key.dl_type == htons(ETH_TYPE_IP)) {
-struct ip_header *nh = dp_packet_l3(pkt);
-struct icmp_header *icmp = dp_packet_l4(pkt);
+struct ip_header *nh = l3;
+struct icmp_header *icmp = l4;
 struct ip_header *inner_l3 = (struct ip_header *) (icmp + 1);
 extract_l3_ipv4(_key, inner_l3, tail - ((char *)inner_l3) - pad,
 _l4, false);
@@ -664,8 +666,8 @@ reverse_nat_packet(struct dp_packet *pkt, const struct conn 
*conn)
 icmp->icmp_csum = 0;
 icmp->icmp_csum = csum(icmp, tail - (char *) icmp - pad);
 } else {
-struct ovs_16aligned_ip6_hdr *nh6 = dp_packet_l3(pkt);
-struct icmp6_error_header *icmp6 = dp_packet_l4(pkt);
+struct ovs_16aligned_ip6_hdr *nh6 = l3;
+struct icmp6_error_header *icmp6 = l4;
 struct ovs_16aligned_ip6_hdr *inner_l3_6 =
 (struct ovs_16aligned_ip6_hdr *) (icmp6 + 1);
 extract_l3_ipv6(_key, inner_l3_6,
@@ -1320,6 +1322,7 @@ conntrack_execute(struct conntrack *ct, struct 
dp_packet_batch *pkt_batch,
 write_ct_md(packet, zone, NULL, NULL, NULL);
 continue;
 }
+
 process_one(ct, packet, , zone, force, commit, now, setmark,
 setlabel, nat_action_info, tp_src, tp_dst, helper);
 }
@@ -1902,8 +1905,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet 
*pkt, ovs_be16 dl_type,
  struct conn_lookup_ctx *ctx, uint16_t zone)
 {
 const struct eth_header *l2 = dp_packet_eth(pkt);
-const struct ip_header *l3 = dp_packet_l3(pkt);
-const char *l4 = dp_packet_l4(pkt);
+const struct ip_header *l3 = dp_packet_linear_ofs(pkt, pkt->l3_ofs);
+const char *l4 = dp_pac

[ovs-dev] [PATCH v9 08/14] dp-packet: copy data from multi-seg. DPDK mbuf

2018-08-24 Thread Tiago Lam
From: Michael Qiu 

When doing packet clone, if packet source is from DPDK driver,
multi-segment must be considered, and copy the segment's data one by
one.

Also, lots of DPDK mbuf's info is missed during a copy, like packet
type, ol_flags, etc.  That information is very important for DPDK to do
packets processing.

Co-authored-by: Mark Kavanagh 
Co-authored-by: Tiago Lam 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c   | 69 ++-
 lib/dp-packet.h   |  3 +++
 lib/netdev-dpdk.c |  1 +
 3 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 167bf43..806640b 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -48,6 +48,22 @@ dp_packet_use__(struct dp_packet *b, void *base, size_t 
allocated,
 dp_packet_set_size(b, 0);
 }
 
+#ifdef DPDK_NETDEV
+void
+dp_packet_copy_mbuf_flags(struct dp_packet *dst, const struct dp_packet *src)
+{
+ovs_assert(dst != NULL && src != NULL);
+struct rte_mbuf *buf_dst = &(dst->mbuf);
+struct rte_mbuf buf_src = src->mbuf;
+
+buf_dst->ol_flags = buf_src.ol_flags;
+buf_dst->packet_type = buf_src.packet_type;
+buf_dst->tx_offload = buf_src.tx_offload;
+}
+#else
+#define dp_packet_copy_mbuf_flags(arg1, arg2)
+#endif
+
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
  * memory starting at 'base'.  'base' should be the first byte of a region
  * obtained from malloc().  It will be freed (with free()) if 'b' is resized or
@@ -158,6 +174,44 @@ dp_packet_clone(const struct dp_packet *buffer)
 return dp_packet_clone_with_headroom(buffer, 0);
 }
 
+#ifdef DPDK_NETDEV
+struct dp_packet *
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom) {
+struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
+
+/* copy multi-seg data */
+if (b->source == DPBUF_DPDK && !rte_pktmbuf_is_contiguous(>mbuf)) {
+void *dst = NULL;
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+new_buffer = dp_packet_new_with_headroom(pkt_len, headroom);
+dst = dp_packet_data(new_buffer);
+dp_packet_set_size(new_buffer, pkt_len);
+
+if (!rte_pktmbuf_read(mbuf, 0, pkt_len, dst)) {
+return NULL;
+}
+} else {
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+dp_packet_size(b),
+headroom);
+}
+
+/* Copy the following fields into the returned buffer: l2_pad_size,
+ * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
+memcpy(_buffer->l2_pad_size, >l2_pad_size,
+   sizeof(struct dp_packet) -
+   offsetof(struct dp_packet, l2_pad_size));
+
+dp_packet_copy_mbuf_flags(new_buffer, b);
+if (dp_packet_rss_valid(new_buffer)) {
+new_buffer->mbuf.hash.rss = b->mbuf.hash.rss;
+}
+
+return new_buffer;
+}
+#else
 /* Creates and returns a new dp_packet whose data are copied from 'buffer'.
  * The returned dp_packet will additionally have 'headroom' bytes of
  * headroom. */
@@ -165,32 +219,25 @@ struct dp_packet *
 dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t headroom)
 {
 struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(buffer);
 
 new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(buffer),
- dp_packet_size(buffer),
- headroom);
+ pkt_len, headroom);
+
 /* Copy the following fields into the returned buffer: l2_pad_size,
  * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
 memcpy(_buffer->l2_pad_size, >l2_pad_size,
 sizeof(struct dp_packet) -
 offsetof(struct dp_packet, l2_pad_size));
 
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.ol_flags = buffer->mbuf.ol_flags;
-#else
 new_buffer->rss_hash_valid = buffer->rss_hash_valid;
-#endif
-
 if (dp_packet_rss_valid(new_buffer)) {
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.hash.rss = buffer->mbuf.hash.rss;
-#else
 new_buffer->rss_hash = buffer->rss_hash;
-#endif
 }
 
 return new_buffer;
 }
+#endif
 
 /* Creates and returns a new dp_packet that initially contains a copy of the
  * 'size' bytes of data starting at 'data' with no headroom or tailroom. */
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 3a99044..022e420 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -124,6 +124,9 @@ void dp_packet_init_dpdk(struct dp_packet *);
 void dp_packet_init(struct dp_packet *, size_t);
 void dp_packet_uninit(struct dp_packet *);
 
+void dp_packet_copy_mbuf_fl

[ovs-dev] [PATCH v9 07/14] dp-packet: Handle multi-seg mubfs in shift() func.

2018-08-24 Thread Tiago Lam
In its current implementation dp_packet_shift() is also unaware of
multi-seg mbufs (that holds data in memory non-contiguously) and assumes
that data exists contiguously in memory, memmove'ing data to perform the
shift.

To add support for multi-seg mbuds a new set of functions was
introduced, dp_packet_mbuf_shift() and dp_packet_mbuf_write(). These
functions are used by dp_packet_shift(), when handling multi-seg mbufs,
to shift and write data within a chain of mbufs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c | 100 
 lib/dp-packet.h |  10 ++
 2 files changed, 110 insertions(+)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 2aaeaae..167bf43 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -294,6 +294,100 @@ dp_packet_prealloc_headroom(struct dp_packet *b, size_t 
size)
 }
 }
 
+#ifdef DPDK_NETDEV
+/* Write len data bytes in a mbuf at specified offset.
+ *
+ * 'mbuf', pointer to the destination mbuf where 'ofs' is, and the mbuf where
+ * the data will first be written.
+ * 'ofs', the offset within the provided 'mbuf' where 'data' is to be written.
+ * 'len', the size of the to be written 'data'.
+ * 'data', pointer to the to be written bytes.
+ *
+ * XXX: This function is the counterpart of the `rte_pktmbuf_read()` function
+ * available with DPDK, in the rte_mbuf.h */
+void
+dp_packet_mbuf_write(struct rte_mbuf *mbuf, int16_t ofs, uint32_t len,
+ const void *data)
+{
+char *dst_addr;
+uint16_t data_len;
+int len_copy;
+while (mbuf) {
+if (len == 0) {
+break;
+}
+
+dst_addr = rte_pktmbuf_mtod_offset(mbuf, char *, ofs);
+data_len = MBUF_BUF_END(mbuf->buf_addr, mbuf->buf_len) - dst_addr;
+
+len_copy = MIN(len, data_len);
+/* We don't know if 'data' is the result of a rte_pktmbuf_read() call,
+ * in which case we may end up writing to the same region of memory we
+ * are reading from and overlapping. Hence the use of memmove() here */
+memmove(dst_addr, data, len_copy);
+
+data = ((char *) data) + len_copy;
+len -= len_copy;
+ofs = 0;
+
+mbuf->data_len = len_copy;
+mbuf = mbuf->next;
+}
+}
+
+static void
+dp_packet_mbuf_shift_(struct rte_mbuf *dbuf, int16_t dst_ofs,
+  const struct rte_mbuf *sbuf, uint16_t src_ofs, int len)
+{
+char *rd = xmalloc(sizeof(*rd) * len);
+const char *wd = rte_pktmbuf_read(sbuf, src_ofs, len, rd);
+
+ovs_assert(wd);
+
+dp_packet_mbuf_write(dbuf, dst_ofs, len, wd);
+
+free(rd);
+}
+
+/* Similarly to dp_packet_shift(), shifts the data within the mbufs of a
+ * dp_packet of DPBUF_DPDK source by 'delta' bytes.
+ * Caller must make sure of the following conditions:
+ * - When shifting left, delta can't be bigger than the data_len available in
+ *   the last mbuf;
+ * - When shifting right, delta can't be bigger than the space available in the
+ *   first mbuf (buf_len - data_off).
+ * Both these conditions guarantee that a shift operation doesn't fall outside
+ * the bounds of the existing mbufs, so that the first and last mbufs (when
+ * using multi-segment mbufs), remain the same. */
+static void
+dp_packet_mbuf_shift(struct dp_packet *b, int delta)
+{
+uint16_t src_ofs;
+int16_t dst_ofs;
+
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+struct rte_mbuf *tmbuf = rte_pktmbuf_lastseg(mbuf);
+
+if (delta < 0) {
+ovs_assert(-delta <= tmbuf->data_len);
+} else {
+ovs_assert(delta < (mbuf->buf_len - mbuf->data_off));
+}
+
+/* Set the destination and source offsets to copy to */
+dst_ofs = delta;
+src_ofs = 0;
+
+/* Shift data from src mbuf and offset to dst mbuf and offset */
+dp_packet_mbuf_shift_(mbuf, dst_ofs, mbuf, src_ofs,
+  rte_pktmbuf_pkt_len(mbuf));
+
+/* Update mbufs' properties, and if using multi-segment mbufs, first and
+ * last mbuf's data_len also needs to be adjusted */
+mbuf->data_off = mbuf->data_off + dst_ofs;
+}
+#endif
+
 /* Shifts all of the data within the allocated space in 'b' by 'delta' bytes.
  * For example, a 'delta' of 1 would cause each byte of data to move one byte
  * forward (from address 'p' to 'p+1'), and a 'delta' of -1 would cause each
@@ -306,6 +400,12 @@ dp_packet_shift(struct dp_packet *b, int delta)
: true);
 
 if (delta != 0) {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+dp_packet_mbuf_shift(b, delta);
+return;
+}
+#endif
 char *dst = (char *) dp_packet_data(b) + delta;
 memmove(dst, dp_packet_data(b), dp_packet_size(b));
 dp_packet_set_data(b, dst);
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 48be19b..3a99044 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -80,6 +80,11 @

[ovs-dev] [PATCH v9 06/14] dp-packet: Handle multi-seg mbufs in helper funcs.

2018-08-24 Thread Tiago Lam
Most helper functions in dp-packet assume that the data held by a
dp_packet is contiguous, and perform operations such as pointer
arithmetic under that assumption. However, with the introduction of
multi-segment mbufs, where data is non-contiguous, such assumptions are
no longer possible. Some examples of Such helper functions are
dp_packet_tail(), dp_packet_tailroom(), dp_packet_end(),
dp_packet_get_allocated() and dp_packet_at().

Thus, instead of assuming contiguous data in dp_packet, they  now
iterate over the (non-contiguous) data in mbufs to perform their
calculations.

Finally, dp_packet_use__() has also been modified to perform the
initialisation of the packet (and setting the source) before continuing
to set its size and data length, which now depends on the type of
packet.

Co-authored-by: Mark Kavanagh 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c |   4 +-
 lib/dp-packet.h | 150 +++-
 2 files changed, 140 insertions(+), 14 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 782e7c2..2aaeaae 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -41,11 +41,11 @@ static void
 dp_packet_use__(struct dp_packet *b, void *base, size_t allocated,
  enum dp_packet_source source)
 {
+dp_packet_init__(b, allocated, source);
+
 dp_packet_set_base(b, base);
 dp_packet_set_data(b, base);
 dp_packet_set_size(b, 0);
-
-dp_packet_init__(b, allocated, source);
 }
 
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index d2803af..48be19b 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -185,9 +185,25 @@ dp_packet_delete(struct dp_packet *b)
 static inline void *
 dp_packet_at(const struct dp_packet *b, size_t offset, size_t size)
 {
-return offset + size <= dp_packet_size(b)
-   ? (char *) dp_packet_data(b) + offset
-   : NULL;
+if (offset + size > dp_packet_size(b)) {
+return NULL;
+}
+
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+while (buf && offset > buf->data_len) {
+offset -= buf->data_len;
+
+buf = buf->next;
+}
+
+return buf ? rte_pktmbuf_mtod_offset(buf, char *, offset) : NULL;
+}
+#endif
+
+return (char *) dp_packet_data(b) + offset;
 }
 
 /* Returns a pointer to byte 'offset' in 'b', which must contain at least
@@ -196,13 +212,23 @@ static inline void *
 dp_packet_at_assert(const struct dp_packet *b, size_t offset, size_t size)
 {
 ovs_assert(offset + size <= dp_packet_size(b));
-return ((char *) dp_packet_data(b)) + offset;
+return dp_packet_at(b, offset, size);
 }
 
 /* Returns a pointer to byte following the last byte of data in use in 'b'. */
 static inline void *
 dp_packet_tail(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, >mbuf);
+/* Find last segment where data ends, meaning the tail of the chained
+ *  mbufs must be there */
+buf = rte_pktmbuf_lastseg(buf);
+
+return rte_pktmbuf_mtod_offset(buf, void *, buf->data_len);
+}
+#endif
 return (char *) dp_packet_data(b) + dp_packet_size(b);
 }
 
@@ -211,6 +237,15 @@ dp_packet_tail(const struct dp_packet *b)
 static inline void *
 dp_packet_end(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, &(b->mbuf));
+
+buf = rte_pktmbuf_lastseg(buf);
+
+return (char *) buf->buf_addr + buf->buf_len;
+}
+#endif
 return (char *) dp_packet_base(b) + dp_packet_get_allocated(b);
 }
 
@@ -236,6 +271,15 @@ dp_packet_tailroom(const struct dp_packet *b)
 static inline void
 dp_packet_clear(struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+/* sets pkt_len and data_len to zero and frees unused mbufs */
+dp_packet_set_size(b, 0);
+rte_pktmbuf_reset(>mbuf);
+
+return;
+}
+#endif
 dp_packet_set_data(b, dp_packet_base(b));
 dp_packet_set_size(b, 0);
 }
@@ -252,24 +296,38 @@ dp_packet_pull(struct dp_packet *b, size_t size)
 return data;
 }
 
+#ifdef DPDK_NETDEV
+/* Similar to dp_packet_try_pull() but doesn't actually pull any data, only
+ * checks if it could and returns true or false accordingly.
+ *
+ * Valid for dp_packets carrying mbufs only. */
+static inline bool
+dp_packet_mbuf_may_pull(const struct dp_packet *b, size_t size) {
+if (size > b->mbuf.data_len) {
+return false;
+}
+
+return true;
+}
+#endif
+
 /* If 'b' has at least 'size' bytes of data, removes that many bytes from the
  * head end of 'b' and returns the firs

[ovs-dev] [PATCH v9 05/14] dp-packet: Fix data_len handling multi-seg mbufs.

2018-08-24 Thread Tiago Lam
When a dp_packet is from a DPDK source, and it contains multi-segment
mbufs, the data_len is not equal to the packet size, pkt_len. Instead,
the data_len of each mbuf in the chain should be considered while
distributing the new (provided) size.

To account for the above dp_packet_set_size() has been changed so that,
in the multi-segment mbufs case, only the data_len on the last mbuf of
the chain and the total size of the packet, pkt_len, are changed. The
data_len on the intermediate mbufs preceeding the last mbuf is not
changed by dp_packet_set_size(). Furthermore, in some cases
dp_packet_set_size() may be used to set a smaller size than the current
packet size, thus effectively trimming the end of the packet. In the
multi-segment mbufs case this may lead to lingering mbufs that may need
freeing.

__dp_packet_set_data() now also updates an mbufs' data_len after setting
the data offset. This is so that both fields are always in sync for each
mbuf in a chain.

Co-authored-by: Michael Qiu 
Co-authored-by: Mark Kavanagh 
Co-authored-by: Przemyslaw Lal 
Co-authored-by: Marcin Ksiadz 
Co-authored-by: Yuanhan Liu 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Przemyslaw Lal 
Signed-off-by: Marcin Ksiadz 
Signed-off-by: Yuanhan Liu 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.h | 76 -
 1 file changed, 64 insertions(+), 12 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 6376039..d2803af 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -429,17 +429,49 @@ dp_packet_size(const struct dp_packet *b)
 static inline void
 dp_packet_set_size(struct dp_packet *b, uint32_t v)
 {
-/* netdev-dpdk does not currently support segmentation; consequently, for
- * all intents and purposes, 'data_len' (16 bit) and 'pkt_len' (32 bit) may
- * be used interchangably.
- *
- * On the datapath, it is expected that the size of packets
- * (and thus 'v') will always be <= UINT16_MAX; this means that there is no
- * loss of accuracy in assigning 'v' to 'data_len'.
- */
-b->mbuf.data_len = (uint16_t)v;  /* Current seg length. */
-b->mbuf.pkt_len = v; /* Total length of all segments linked to
-  * this segment. */
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *mbuf = >mbuf;
+uint16_t new_len = v;
+uint16_t data_len;
+uint16_t nb_segs = 0;
+uint16_t pkt_len = 0;
+
+/* Trim 'v' length bytes from the end of the chained buffers, freeing
+   any buffers that may be left floating */
+while (mbuf) {
+data_len = MIN(new_len, mbuf->data_len);
+mbuf->data_len = data_len;
+
+if (new_len - data_len <= 0) {
+/* Free the rest of chained mbufs */
+free_dpdk_buf(CONTAINER_OF(mbuf->next, struct dp_packet,
+   mbuf));
+mbuf->next = NULL;
+} else if (!mbuf->next) {
+/* Don't assign more than what we have available */
+mbuf->data_len = MIN(new_len,
+ mbuf->buf_len - mbuf->data_off);
+}
+
+new_len -= data_len;
+nb_segs += 1;
+pkt_len += mbuf->data_len;
+mbuf = mbuf->next;
+}
+
+/* pkt_len != v would effectively mean that pkt_len < than 'v' (as
+ * being bigger is logically impossible). Being < than 'v' would mean
+ * the 'v' provided was bigger than the available room, which is the
+ * responsibility of the caller to make sure there is enough room */
+ovs_assert(pkt_len == v);
+
+b->mbuf.nb_segs = nb_segs;
+b->mbuf.pkt_len = pkt_len;
+} else {
+b->mbuf.data_len = v;
+/* Total length of all segments linked to this segment. */
+b->mbuf.pkt_len = v;
+}
 }
 
 static inline uint16_t
@@ -451,7 +483,27 @@ __packet_data(const struct dp_packet *b)
 static inline void
 __packet_set_data(struct dp_packet *b, uint16_t v)
 {
-b->mbuf.data_off = v;
+if (b->source == DPBUF_DPDK) {
+/* Moving data_off away from the first mbuf in the chain is not a
+ * possibility using DPBUF_DPDK dp_packets */
+ovs_assert(v == UINT16_MAX || v <= b->mbuf.buf_len);
+
+uint16_t prev_ofs = b->mbuf.data_off;
+b->mbuf.data_off = v;
+int16_t ofs_diff = prev_ofs - b->mbuf.data_off;
+
+/* When dealing with DPDK mbufs, keep data_off and data_len in sync.
+ * Thus, update data_len if the length changes with the move of
+ * data_off. However, if data_len is 0, there's no data to move and
+ * data_len should remain 0. */
+
+if (b->mbuf.data_len != 0) {
+b->mbuf.data_l

[ovs-dev] [PATCH v9 04/14] netdev-dpdk: Serialise non-pmds mbufs' alloc/free.

2018-08-24 Thread Tiago Lam
A new mutex, 'nonpmd_mp_mutex', has been introduced to serialise
allocation and free operations by non-pmd threads on a given mempool.

free_dpdk_buf() has been modified to make use of the introduced mutex.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 33 ++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index ebd55e9..aee8e20 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -322,6 +322,16 @@ static struct ovs_mutex dpdk_mp_mutex 
OVS_ACQ_AFTER(dpdk_mutex)
 static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
 = OVS_LIST_INITIALIZER(_mp_list);
 
+/* This mutex must be used by non pmd threads when allocating or freeing
+ * mbufs through mempools, when outside of the `non_pmd_mutex` mutex, in struct
+ * dp_netdev.
+ * The reason, as pointed out in the "Known Issues" section in DPDK's EAL docs,
+ * is that the implementation on which mempool is based off is non-preemptable.
+ * Since non-pmds may end up not being pinned this could lead to the preemption
+ * between non-pmds performing operations on the same mempool, which could lead
+ * to memory corruption. */
+static struct ovs_mutex nonpmd_mp_mutex = OVS_MUTEX_INITIALIZER;
+
 struct dpdk_mp {
  struct rte_mempool *mp;
  int mtu;
@@ -492,6 +502,8 @@ struct netdev_rxq_dpdk {
 dpdk_port_t port_id;
 };
 
+static bool dpdk_thread_is_pmd(void);
+
 static void netdev_dpdk_destruct(struct netdev *netdev);
 static void netdev_dpdk_vhost_destruct(struct netdev *netdev);
 
@@ -525,6 +537,12 @@ dpdk_buf_size(int mtu)
  NETDEV_DPDK_MBUF_ALIGN);
 }
 
+static bool
+dpdk_thread_is_pmd(void)
+{
+ return rte_lcore_id() != NON_PMD_CORE_ID;
+}
+
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
  *
  * Unlike xmalloc(), this function can return NULL on failure. */
@@ -535,11 +553,20 @@ dpdk_rte_mzalloc(size_t sz)
 }
 
 void
-free_dpdk_buf(struct dp_packet *p)
+free_dpdk_buf(struct dp_packet *packet)
 {
-struct rte_mbuf *pkt = (struct rte_mbuf *) p;
+/* If non-pmd we need to lock on nonpmd_mp_mutex mutex */
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_lock(_mp_mutex);
+
+rte_pktmbuf_free(>mbuf);
+
+ovs_mutex_unlock(_mp_mutex);
+
+return;
+}
 
-rte_pktmbuf_free(pkt);
+rte_pktmbuf_free(>mbuf);
 }
 
 static void
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v9 03/14] dp-packet: Fix allocated size on DPDK init.

2018-08-24 Thread Tiago Lam
When enabled with DPDK OvS deals with two types of packets, the ones
coming from the mempool and the ones locally created by OvS - which are
copied to mempool mbufs before output. In the latter, the space is
allocated from the system, while in the former the mbufs are allocated
from a mempool, which takes care of initialising them appropriately.

In the current implementation, during mempool's initialisation of mbufs,
dp_packet_set_allocated() is called from dp_packet_init_dpdk() without
considering that the allocated space, in the case of multi-segment
mbufs, might be greater than a single mbuf.  Furthermore, given that
dp_packet_init_dpdk() is on the code path that's called upon mempool's
initialisation, a call to dp_packet_set_allocated() is redundant, since
mempool takes care of initialising it.

To fix this, dp_packet_set_allocated() is no longer called after
initialisation of a mempool, only in dp_packet_init__(), which is still
called by OvS when initialising locally created packets.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c   | 3 +--
 lib/dp-packet.h   | 2 +-
 lib/netdev-dpdk.c | 2 +-
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 443c225..782e7c2 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -99,9 +99,8 @@ dp_packet_use_const(struct dp_packet *b, const void *data, 
size_t size)
  * buffer.  Here, non-transient ovs dp-packet fields are initialized for
  * packets that are part of a DPDK memory pool. */
 void
-dp_packet_init_dpdk(struct dp_packet *b, size_t allocated)
+dp_packet_init_dpdk(struct dp_packet *b)
 {
-dp_packet_set_allocated(b, allocated);
 b->source = DPBUF_DPDK;
 }
 
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index b948fe1..6376039 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -114,7 +114,7 @@ void dp_packet_use(struct dp_packet *, void *, size_t);
 void dp_packet_use_stub(struct dp_packet *, void *, size_t);
 void dp_packet_use_const(struct dp_packet *, const void *, size_t);
 
-void dp_packet_init_dpdk(struct dp_packet *, size_t allocated);
+void dp_packet_init_dpdk(struct dp_packet *);
 
 void dp_packet_init(struct dp_packet *, size_t);
 void dp_packet_uninit(struct dp_packet *);
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 0cd9ff6..ebd55e9 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -550,7 +550,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp OVS_UNUSED,
 {
 struct rte_mbuf *pkt = _p;
 
-dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);
+dp_packet_init_dpdk((struct dp_packet *) pkt);
 }
 
 static int
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v9 01/14] netdev-dpdk: fix mbuf sizing

2018-08-24 Thread Tiago Lam
From: Mark Kavanagh 

There are numerous factors that must be considered when calculating
the size of an mbuf:
- the data portion of the mbuf must be sized in accordance With Rx
  buffer alignment (typically 1024B). So, for example, in order to
  successfully receive and capture a 1500B packet, mbufs with a
  data portion of size 2048B must be used.
- in OvS, the elements that comprise an mbuf are:
  * the dp packet, which includes a struct rte mbuf (704B)
  * RTE_PKTMBUF_HEADROOM (128B)
  * packet data (aligned to 1k, as previously described)
  * RTE_PKTMBUF_TAILROOM (typically 0)

Some PMDs require that the total mbuf size (i.e. the total sum of all
of the above-listed components' lengths) is cache-aligned. To satisfy
this requirement, it may be necessary to round up the total mbuf size
with respect to cacheline size. In doing so, it's possible that the
dp_packet's data portion is inadvertently increased in size, such that
it no longer adheres to Rx buffer alignment. Consequently, the
following property of the mbuf no longer holds true:

mbuf.data_len == mbuf.buf_len - mbuf.data_off

This creates a problem in the case of multi-segment mbufs, where that
assumption is assumed to be true for all but the final segment in an
mbuf chain. Resolve this issue by adjusting the size of the mbuf's
private data portion, as opposed to the packet data portion when
aligning mbuf size to cachelines.

Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
CC: Santosh Shukla 
Signed-off-by: Mark Kavanagh 
Acked-by: Santosh Shukla 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 56 +--
 1 file changed, 38 insertions(+), 18 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index ac02a09..0cd9ff6 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -88,10 +88,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 
20);
 #define MTU_TO_MAX_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_MAX_LEN)
 #define FRAME_LEN_TO_MTU(frame_len) ((frame_len)\
  - ETHER_HDR_LEN - ETHER_CRC_LEN)
-#define MBUF_SIZE(mtu)  ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) \
- + sizeof(struct dp_packet) \
- + RTE_PKTMBUF_HEADROOM),   \
- RTE_CACHE_LINE_SIZE)
 #define NETDEV_DPDK_MBUF_ALIGN  1024
 #define NETDEV_DPDK_MAX_PKT_LEN 9728
 
@@ -637,7 +633,11 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
 char mp_name[RTE_MEMPOOL_NAMESIZE];
 const char *netdev_name = netdev_get_name(>up);
 int socket_id = dev->requested_socket_id;
-uint32_t n_mbufs;
+uint32_t n_mbufs = 0;
+uint32_t mbuf_size = 0;
+uint32_t aligned_mbuf_size = 0;
+uint32_t mbuf_priv_data_len = 0;
+uint32_t pkt_size = 0;
 uint32_t hash = hash_string(netdev_name, 0);
 struct dpdk_mp *dmp = NULL;
 int ret;
@@ -650,6 +650,9 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
 dmp->mtu = mtu;
 dmp->refcount = 1;
 
+/* Get the size of each mbuf, based on the MTU */
+mbuf_size = dpdk_buf_size(dev->requested_mtu);
+
 n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
 
 do {
@@ -661,8 +664,8 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
  * so this is not an issue for tasks such as debugging.
  */
 ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
-   "ovs%08x%02d%05d%07u",
-   hash, socket_id, mtu, n_mbufs);
+   "ovs%08x%02d%05d%07u",
+hash, socket_id, mtu, n_mbufs);
 if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
 VLOG_DBG("snprintf returned %d. "
  "Failed to generate a mempool name for \"%s\". "
@@ -671,17 +674,34 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool 
per_port_mp)
 break;
 }
 
-VLOG_DBG("Port %s: Requesting a mempool of %u mbufs "
-  "on socket %d for %d Rx and %d Tx queues.",
-  netdev_name, n_mbufs, socket_id,
-  dev->requested_n_rxq, dev->requested_n_txq);
-
-dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
-  MP_CACHE_SZ,
-  sizeof (struct dp_packet)
-  - sizeof (struct rte_mbuf),
-  MBUF_SIZE(mtu)
-  - sizeof(struct dp_packet),
+VLOG_DBG("Port %s: Requesting a mempool of %u mbufs of size %u "
+  "on socket %d for %d Rx and %d Tx queues, "
+  "cache line size of %u",
+  netdev_name, n_mbufs, 

[ovs-dev] [PATCH v9 00/14] Support multi-segment mbufs

2018-08-24 Thread Tiago Lam
size() and
   improve other helper funcs, such as dp_packet_tail() and dp_
   packet_tailroom().
 - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
   put_zeros(), as well as dp_packet_resize__() - allocating new
   mbufs and linking them together;
 Restructured patchset:
 - Squash patch 5 into patch 6, since they were both related to
   copying data while handling multi-seg mbufs;
 - Split patch 4 into two separate patches - one that introduces the
   changes in helper functions to deal with multi-seg mbufs and
   two others for the shift() and put_uninit() functionality;
 - Move patch 4 to before patch 3, so that ihelper functions come
   before functionality improvement that rely on those helpers.

v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
  from getting redundantly incremented");
- Sugesh's comments have been addressed:
  - Changed dp_packet_set_data() and dp_packet_set_size() logic to
make them independent of each other;
  - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
size() are independent;
  - dp_packet_clone_with_headroom() now has split functions for
handling DPDK sourced packets and non-DPDK packets;
- Modified various functions in dp-packet.h to account for multi-seg
  mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
  and dp_packet_at();
- Added support for shifting packet data in multi-seg mbufs, using
  dp_packet_shift();
- Fixed some minor inconsistencies.

Note that some of the changes in v5 have been contributed by Mark
Kavanagh as well.

v4: - restructure patchset
- account for 128B ARM cacheline when sizing mbufs

Mark Kavanagh (4):
  netdev-dpdk: fix mbuf sizing
  dp-packet: Init specific mbuf fields.
  netdev-dpdk: copy large packet to multi-seg. mbufs
  netdev-dpdk: support multi-segment jumbo frames

Michael Qiu (1):
  dp-packet: copy data from multi-seg. DPDK mbuf

Tiago Lam (9):
  dp-packet: Fix allocated size on DPDK init.
  netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
  dp-packet: Fix data_len handling multi-seg mbufs.
  dp-packet: Handle multi-seg mbufs in helper funcs.
  dp-packet: Handle multi-seg mubfs in shift() func.
  dp-packet: Add support for data "linearization".
  dpdk-tests: Add unit-tests for multi-seg mbufs.
  dpdk-tests: Accept other configs in OVS_DPDK_START
  dpdk-tests: End-to-end tests for multi-seg mbufs.

 Documentation/topics/dpdk/jumbo-frames.rst |  67 
 Documentation/topics/dpdk/memory.rst   |  36 ++
 NEWS   |   1 +
 lib/bfd.c  |   3 +-
 lib/conntrack.c|  17 +-
 lib/dp-packet.c| 192 -
 lib/dp-packet.h| 339 ++--
 lib/dpdk.c |   8 +
 lib/dpif-netlink.c |   2 +-
 lib/dpif.c |   2 +-
 lib/netdev-bsd.c   |   2 +-
 lib/netdev-dpdk.c  | 242 +--
 lib/netdev-dpdk.h  |   2 +
 lib/netdev-dummy.c |   5 +-
 lib/netdev-linux.c |   5 +-
 lib/netdev-native-tnl.c|  10 +-
 lib/odp-execute.c  |   2 +-
 lib/ofp-print.c|   2 +-
 lib/ovs-lldp.c |   3 +-
 lib/packets.c  |   3 +-
 ofproto/ofproto-dpif-sflow.c   |   2 +-
 ofproto/ofproto-dpif-upcall.c  |   2 +-
 ofproto/ofproto-dpif-xlate.c   |  12 +-
 tests/automake.mk  |  10 +-
 tests/dpdk-packet-mbufs.at |   7 +
 tests/system-dpdk-macros.at|   6 +-
 tests/system-dpdk-testsuite.at |   1 +
 tests/system-dpdk.at   |  65 +++
 tests/test-dpdk-mbufs.c| 619 +
 vswitchd/vswitch.xml   |  22 +
 30 files changed, 1571 insertions(+), 118 deletions(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v8 12/14] dpdk-tests: Add unit-tests for multi-seg mbufs.

2018-08-20 Thread Tiago Lam
In order to create a minimal environment that allows the tests to get
mbufs from an existing mempool, the following approach is taken:
- EAL is initialised (by using the main dpdk_init()) and a (very) small
  mempool is instantiated (mimicking the logic in dpdk_mp_create()).
  This mempool instance is global and used by all the tests;
- Packets are then allocated from the instantiated mempool, and tested
  on, by running some operations on them and manipulating data.

The tests introduced focus on testing DPDK dp_packets (where
source=DPBUF_DPDK), linked with a single or multiple mbufs, across
several operations, such as:
- dp_packet_put();
- dp_packet_shift();
- dp_packet_reserve();
- dp_packet_push_uninit();
- dp_packet_clear();
- dp_packet_equal();
- dp_packet_linear_data();
- And as a consequence of some of these, dp_packet_put_uninit() and
  dp_packet_resize__().

Finally, this has also been integrated with the new DPDK testsuite.
Thus, when running `$sudo make check-dpdk` one will also be running
these tests.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/automake.mk  |  10 +-
 tests/dpdk-packet-mbufs.at |   7 +
 tests/system-dpdk-testsuite.at |   1 +
 tests/test-dpdk-mbufs.c| 619 +
 4 files changed, 636 insertions(+), 1 deletion(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 49ceb41..f484f69 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -135,7 +135,8 @@ SYSTEM_DPDK_TESTSUITE_AT = \
tests/system-common-macros.at \
tests/system-dpdk-macros.at \
tests/system-dpdk-testsuite.at \
-   tests/system-dpdk.at
+   tests/system-dpdk.at \
+   tests/dpdk-packet-mbufs.at
 
 check_SCRIPTS += tests/atlocal
 
@@ -392,6 +393,10 @@ tests_ovstest_SOURCES = \
tests/test-vconn.c \
tests/test-aa.c \
tests/test-stopwatch.c
+if DPDK_NETDEV
+tests_ovstest_SOURCES += \
+   tests/test-dpdk-mbufs.c
+endif
 
 if !WIN32
 tests_ovstest_SOURCES += \
@@ -404,6 +409,9 @@ tests_ovstest_SOURCES += \
 endif
 
 tests_ovstest_LDADD = lib/libopenvswitch.la ovn/lib/libovn.la
+if DPDK_NETDEV
+tests_ovstest_LDFLAGS = $(AM_LDFLAGS) $(DPDK_vswitchd_LDFLAGS)
+endif
 
 noinst_PROGRAMS += tests/test-strtok_r
 tests_test_strtok_r_SOURCES = tests/test-strtok_r.c
diff --git a/tests/dpdk-packet-mbufs.at b/tests/dpdk-packet-mbufs.at
new file mode 100644
index 000..f28e4fc
--- /dev/null
+++ b/tests/dpdk-packet-mbufs.at
@@ -0,0 +1,7 @@
+AT_BANNER([OVS-DPDK dp_packet unit tests])
+
+AT_SETUP([OVS-DPDK dp_packet - mbufs allocation])
+AT_KEYWORDS([dp_packet, multi-seg, mbufs])
+AT_CHECK(ovstest test-dpdk-packet, [], [ignore], [ignore])
+
+AT_CLEANUP
diff --git a/tests/system-dpdk-testsuite.at b/tests/system-dpdk-testsuite.at
index 382f09e..f5edf58 100644
--- a/tests/system-dpdk-testsuite.at
+++ b/tests/system-dpdk-testsuite.at
@@ -23,3 +23,4 @@ m4_include([tests/system-common-macros.at])
 m4_include([tests/system-dpdk-macros.at])
 
 m4_include([tests/system-dpdk.at])
+m4_include([tests/dpdk-packet-mbufs.at])
diff --git a/tests/test-dpdk-mbufs.c b/tests/test-dpdk-mbufs.c
new file mode 100644
index 000..19081a3
--- /dev/null
+++ b/tests/test-dpdk-mbufs.c
@@ -0,0 +1,619 @@
+/*
+ * Copyright (c) 2018 Intel Corporation
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "dp-packet.h"
+#include "ovstest.h"
+#include "dpdk.h"
+#include "smap.h"
+
+#define N_MBUFS 1024
+#define MBUF_DATA_LEN 2048
+
+static int num_tests = 0;
+
+/* Global var to hold a mempool instance, "test-mp", used in all of the tests
+ * below. This instance is instantiated in dpdk_setup_eal_with_mp(). */
+static struct rte_mempool *mp;
+
+/* Test data used to fill the packets with data. Note that this isn't a string
+ * that repsents a valid packet, by any means. The pattern is generated in set_
+ * testing_pattern_str() and the sole purpose is to verify the data remains the
+ * same after inserting and operating on multi-segment mbufs. */
+static char *test_str;
+
+/* Asserts a dp_packet that holds a single mbuf, where:
+ * - nb_segs must be 1;
+ * - pkt_len must be equal to data_len which in turn must equal the provided
+ *   'pkt_len';
+ * - d

[ovs-dev] [PATCH v8 09/14] dp-packet: Add support for data "linearization".

2018-08-20 Thread Tiago Lam
Previous commits have added support to the dp_packet API to handle
multi-segmented packets, where data is not stored contiguously in
memory. However, in some cases, it is inevitable and data must be
provided contiguously. Examples of such cases are when performing csums
over the entire packet data, or when write()'ing to a file descriptor
(for a tap interface, for example). For such cases, the dp_packet API
has been extended to provide a way to transform a multi-segmented
DPBUF_DPDK packet into a DPBUF_MALLOC system packet (at the expense of a
copy of memory). If the packet's data is already stored in memory
contigously then there's no need to convert the packet.

Additionally, the main use cases that were assuming that a dp_packet's
data is always held contiguously in memory were changed to make use of
the new "linear functions" in the dp_packet API when there's a need to
traverse the entire's packet data. Per the example above, when the
packet's data needs to be write() to the tap's file descriptor, or when
the conntrack module needs to verify a packet's checksum, the data is
now linearized.

Two new functions have also been added to the packets module to perform
the checksum over a dp_packet's data (using the alredy used csum API).
Initially, this is just a way to abstract the data's linearization, but
in the future this could be optimized to perform the checksum over the
multi-segmented packets, without the need to copy.

Signed-off-by: Tiago Lam 
---
 lib/bfd.c |  3 +-
 lib/conntrack.c   | 17 +
 lib/dp-packet.c   | 18 +
 lib/dp-packet.h   | 89 +++
 lib/dpif-netlink.c|  2 +-
 lib/dpif.c|  2 +-
 lib/netdev-bsd.c  |  2 +-
 lib/netdev-dummy.c|  5 ++-
 lib/netdev-linux.c|  5 ++-
 lib/netdev-native-tnl.c   | 24 ++--
 lib/odp-execute.c |  2 +-
 lib/ofp-print.c   |  2 +-
 lib/ovs-lldp.c|  3 +-
 lib/packets.c | 20 +-
 lib/packets.h |  3 ++
 ofproto/ofproto-dpif-sflow.c  |  2 +-
 ofproto/ofproto-dpif-upcall.c |  2 +-
 ofproto/ofproto-dpif-xlate.c  | 12 --
 18 files changed, 168 insertions(+), 45 deletions(-)

diff --git a/lib/bfd.c b/lib/bfd.c
index 5308262..d50d2da 100644
--- a/lib/bfd.c
+++ b/lib/bfd.c
@@ -722,7 +722,8 @@ bfd_process_packet(struct bfd *bfd, const struct flow *flow,
 if (!msg) {
 VLOG_INFO_RL(, "%s: Received too-short BFD control message (only "
  "%"PRIdPTR" bytes long, at least %d required).",
- bfd->name, (uint8_t *) dp_packet_tail(p) - l7,
+ bfd->name, dp_packet_size(p) -
+ (l7 - (uint8_t *) dp_packet_data(p)),
  BFD_PACKET_LEN);
 goto out;
 }
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 974f985..15d1ed2 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -636,6 +636,8 @@ reverse_pat_packet(struct dp_packet *pkt, const struct conn 
*conn)
 static void
 reverse_nat_packet(struct dp_packet *pkt, const struct conn *conn)
 {
+void *l3 = dp_packet_linear_ofs(pkt, pkt->l3_ofs);
+void *l4 = dp_packet_linear_ofs(pkt, pkt->l4_ofs);
 char *tail = dp_packet_tail(pkt);
 char pad = dp_packet_l2_pad_size(pkt);
 struct conn_key inner_key;
@@ -644,8 +646,8 @@ reverse_nat_packet(struct dp_packet *pkt, const struct conn 
*conn)
 uint16_t orig_l4_ofs = pkt->l4_ofs;
 
 if (conn->key.dl_type == htons(ETH_TYPE_IP)) {
-struct ip_header *nh = dp_packet_l3(pkt);
-struct icmp_header *icmp = dp_packet_l4(pkt);
+struct ip_header *nh = l3;
+struct icmp_header *icmp = l4;
 struct ip_header *inner_l3 = (struct ip_header *) (icmp + 1);
 extract_l3_ipv4(_key, inner_l3, tail - ((char *)inner_l3) - pad,
 _l4, false);
@@ -664,8 +666,8 @@ reverse_nat_packet(struct dp_packet *pkt, const struct conn 
*conn)
 icmp->icmp_csum = 0;
 icmp->icmp_csum = csum(icmp, tail - (char *) icmp - pad);
 } else {
-struct ovs_16aligned_ip6_hdr *nh6 = dp_packet_l3(pkt);
-struct icmp6_error_header *icmp6 = dp_packet_l4(pkt);
+struct ovs_16aligned_ip6_hdr *nh6 = l3;
+struct icmp6_error_header *icmp6 = l4;
 struct ovs_16aligned_ip6_hdr *inner_l3_6 =
 (struct ovs_16aligned_ip6_hdr *) (icmp6 + 1);
 extract_l3_ipv6(_key, inner_l3_6,
@@ -1320,6 +1322,7 @@ conntrack_execute(struct conntrack *ct, struct 
dp_packet_batch *pkt_batch,
 write_ct_md(packet, zone, NULL, NULL, NULL);
 continue;
 }
+
 process_one(ct, packet, , zone, force, commit, now, setmark,
 setlabel, nat_action_info, tp_src, tp_dst, helper);
 }
@@ -1902,8 +1905,8 @@ conn_key_extrac

[ovs-dev] [PATCH v8 11/14] netdev-dpdk: support multi-segment jumbo frames

2018-08-20 Thread Tiago Lam
From: Mark Kavanagh 

Currently, jumbo frame support for OvS-DPDK is implemented by
increasing the size of mbufs within a mempool, such that each mbuf
within the pool is large enough to contain an entire jumbo frame of
a user-defined size. Typically, for each user-defined MTU,
'requested_mtu', a new mempool is created, containing mbufs of size
~requested_mtu.

With the multi-segment approach, a port uses a single mempool,
(containing standard/default-sized mbufs of ~2k bytes), irrespective
of the user-requested MTU value. To accommodate jumbo frames, mbufs
are chained together, where each mbuf in the chain stores a portion of
the jumbo frame. Each mbuf in the chain is termed a segment, hence the
name.

== Enabling multi-segment mbufs ==
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 Documentation/topics/dpdk/jumbo-frames.rst | 67 ++
 Documentation/topics/dpdk/memory.rst   | 36 
 NEWS   |  1 +
 lib/dpdk.c |  8 
 lib/netdev-dpdk.c  | 66 +
 lib/netdev-dpdk.h  |  2 +
 vswitchd/vswitch.xml   | 22 ++
 7 files changed, 194 insertions(+), 8 deletions(-)

diff --git a/Documentation/topics/dpdk/jumbo-frames.rst 
b/Documentation/topics/dpdk/jumbo-frames.rst
index 00360b4..07bf3ca 100644
--- a/Documentation/topics/dpdk/jumbo-frames.rst
+++ b/Documentation/topics/dpdk/jumbo-frames.rst
@@ -71,3 +71,70 @@ Jumbo frame support has been validated against 9728B frames, 
which is the
 largest frame size supported by Fortville NIC using the DPDK i40e driver, but
 larger frames and other DPDK NIC drivers may be supported. These cases are
 common for use cases involving East-West traffic only.
+
+---
+Multi-segment mbufs
+---
+
+Instead of increasing the size of mbufs within a mempool, such that each mbuf
+within the pool is large enough to contain an entire jumbo frame of a
+user-defined size, mbufs can be chained together instead. In this approach each
+mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
+irrespective of the user-requested MTU value. Since each mbuf in the chain is
+termed a segment, this approach is named "multi-segment mbufs".
+
+This approach may bring more flexibility in use cases where the maximum packet
+length may be hard to guess. For example, in cases where packets originate from
+sources marked for oflload (such as TSO), each packet may be larger than the
+MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
+enough to hold all of the packet's data.
+
+Multi-segment and single-segment mbufs are mutually exclusive, and the user
+must decide on which approach to adopt on initialisation. If multi-segment
+mbufs is to be enabled, it can be done so with the following command::
+
+$ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
+
+Single-segment mbufs still remain the default when using OvS-DPDK, and the
+above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
+multi-segment mbufs are to be used.
+
+~
+Performance notes
+~
+
+When using multi-segment mbufs some PMDs may not support vectorized Tx
+functions, due to its non-contiguous nature. As a result this can hit
+performance for smaller packet sizes. For example, on a setup sending 64B
+packets at line rate, a decrease of ~20% has been observed. The performance
+impact stops being noticeable for larger packet sizes, although the exact size
+will between PMDs, and depending on the architecture one's using.
+
+Tests performed with the i40e PMD driver only showed this limitation for 64B
+packets, and the same rate was observed when comparing multi-segment mbufs and
+single-segment mbuf for 128B packets. In other words, the 20% drop in
+performance was not observed for packets >= 128B during this test case.
+
+Because of this, multi-segment mbufs is not advised to be used with

[ovs-dev] [PATCH v8 14/14] dpdk-tests: End-to-end tests for multi-seg mbufs.

2018-08-20 Thread Tiago Lam
The following tests are added to the DPDK testsuite to add some
coverage for the multi-segment mbufs:
- Check that multi-segment mbufs are disabled by default;
- Check that providing `other_config:dpdk-multi-seg-mbufs=true` indeed
  enables mbufs;
- Using a DPDK port, send a random packet out and check that `ofctl
  dump-flows` shows the correct amount of packets and bytes sent.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/system-dpdk.at | 65 
 1 file changed, 65 insertions(+)

diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 3d21b01..af8de8c 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -71,3 +71,68 @@ OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel 
module is probably n
 ")
 AT_CLEANUP
 dnl --
+
+AT_SETUP([Jumbo frames - Multi-segment disabled by default])
+OVS_DPDK_START()
+
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [1], [])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment enabled])
+OVS_DPDK_START([dpdk-multi-seg-mbufs=true])
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [], [stdout])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment mbufs Tx])
+OVS_DPDK_PRE_CHECK()
+OVS_DPDK_START([per-port-memory=true dpdk-multi-seg-mbufs=true])
+
+dnl Add userspace bridge and attach it to OVS
+AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
+AT_CHECK([ovs-vsctl add-port br10 dpdk0 \
+-- set Interface dpdk0 type=dpdk options:dpdk-devargs=$(cat PCI_ADDR) \
+-- set Interface dpdk0 mtu_request=9000], [], [stdout], [stderr])
+
+AT_CHECK([ovs-vsctl show], [], [stdout])
+
+dnl Add flows to send packets out from the 'dpdk0' port
+AT_CHECK([
+ovs-ofctl del-flows br10
+ovs-ofctl add-flow br10 in_port=LOCAL,actions=output:dpdk0
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [], [stdout])
+
+dnl Send packet out, of the 'dpdk0' port
+AT_CHECK([
+ARP_HEADER="09000B0009000A000806000108000604000100010A\
+0100020A02"
+dnl Build a random hex string to append to the ARP_HEADER
+RANDOM_BODY=$(printf '0102030405%.0s' {1..1750})
+dnl 8792B ARP packet
+RANDOM_ARP="$ARP_HEADER$RANDOM_BODY"
+
+ovs-ofctl packet-out br10 "packet=$RANDOM_ARP,action=resubmit:LOCAL"
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [0], [stdout])
+
+dnl Confirm the single packet as been sent with correct size
+AT_CHECK([ovs-ofctl dump-flows br10 | ofctl_strip | grep in_port], [0], [dnl
+ n_packets=1, n_bytes=8792, in_port=LOCAL actions=output:1
+])
+
+dnl Clean up
+OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel module is probably 
not loaded./d
+/Failed to enable flow control/d
+/failed to connect to \/tmp\/dpdkvhostclient0: No such file or directory/d
+/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v8 08/14] dp-packet: copy data from multi-seg. DPDK mbuf

2018-08-20 Thread Tiago Lam
From: Michael Qiu 

When doing packet clone, if packet source is from DPDK driver,
multi-segment must be considered, and copy the segment's data one by
one.

Also, lots of DPDK mbuf's info is missed during a copy, like packet
type, ol_flags, etc.  That information is very important for DPDK to do
packets processing.

Co-authored-by: Mark Kavanagh 
Co-authored-by: Tiago Lam 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c   | 69 ++-
 lib/dp-packet.h   |  3 +++
 lib/netdev-dpdk.c |  1 +
 3 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 167bf43..806640b 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -48,6 +48,22 @@ dp_packet_use__(struct dp_packet *b, void *base, size_t 
allocated,
 dp_packet_set_size(b, 0);
 }
 
+#ifdef DPDK_NETDEV
+void
+dp_packet_copy_mbuf_flags(struct dp_packet *dst, const struct dp_packet *src)
+{
+ovs_assert(dst != NULL && src != NULL);
+struct rte_mbuf *buf_dst = &(dst->mbuf);
+struct rte_mbuf buf_src = src->mbuf;
+
+buf_dst->ol_flags = buf_src.ol_flags;
+buf_dst->packet_type = buf_src.packet_type;
+buf_dst->tx_offload = buf_src.tx_offload;
+}
+#else
+#define dp_packet_copy_mbuf_flags(arg1, arg2)
+#endif
+
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
  * memory starting at 'base'.  'base' should be the first byte of a region
  * obtained from malloc().  It will be freed (with free()) if 'b' is resized or
@@ -158,6 +174,44 @@ dp_packet_clone(const struct dp_packet *buffer)
 return dp_packet_clone_with_headroom(buffer, 0);
 }
 
+#ifdef DPDK_NETDEV
+struct dp_packet *
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom) {
+struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
+
+/* copy multi-seg data */
+if (b->source == DPBUF_DPDK && !rte_pktmbuf_is_contiguous(>mbuf)) {
+void *dst = NULL;
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, >mbuf);
+
+new_buffer = dp_packet_new_with_headroom(pkt_len, headroom);
+dst = dp_packet_data(new_buffer);
+dp_packet_set_size(new_buffer, pkt_len);
+
+if (!rte_pktmbuf_read(mbuf, 0, pkt_len, dst)) {
+return NULL;
+}
+} else {
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+dp_packet_size(b),
+headroom);
+}
+
+/* Copy the following fields into the returned buffer: l2_pad_size,
+ * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
+memcpy(_buffer->l2_pad_size, >l2_pad_size,
+   sizeof(struct dp_packet) -
+   offsetof(struct dp_packet, l2_pad_size));
+
+dp_packet_copy_mbuf_flags(new_buffer, b);
+if (dp_packet_rss_valid(new_buffer)) {
+new_buffer->mbuf.hash.rss = b->mbuf.hash.rss;
+}
+
+return new_buffer;
+}
+#else
 /* Creates and returns a new dp_packet whose data are copied from 'buffer'.
  * The returned dp_packet will additionally have 'headroom' bytes of
  * headroom. */
@@ -165,32 +219,25 @@ struct dp_packet *
 dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t headroom)
 {
 struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(buffer);
 
 new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(buffer),
- dp_packet_size(buffer),
- headroom);
+ pkt_len, headroom);
+
 /* Copy the following fields into the returned buffer: l2_pad_size,
  * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
 memcpy(_buffer->l2_pad_size, >l2_pad_size,
 sizeof(struct dp_packet) -
 offsetof(struct dp_packet, l2_pad_size));
 
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.ol_flags = buffer->mbuf.ol_flags;
-#else
 new_buffer->rss_hash_valid = buffer->rss_hash_valid;
-#endif
-
 if (dp_packet_rss_valid(new_buffer)) {
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.hash.rss = buffer->mbuf.hash.rss;
-#else
 new_buffer->rss_hash = buffer->rss_hash;
-#endif
 }
 
 return new_buffer;
 }
+#endif
 
 /* Creates and returns a new dp_packet that initially contains a copy of the
  * 'size' bytes of data starting at 'data' with no headroom or tailroom. */
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 3a99044..022e420 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -124,6 +124,9 @@ void dp_packet_init_dpdk(struct dp_packet *);
 void dp_packet_init(struct dp_packet *, size_t);
 void dp_packet_uninit(struct dp_packet *);
 
+void dp_packet_copy_mbuf_fl

  1   2   3   >