Re: [ovs-dev] [PATCH v5] dpif-netdev: dpcls per in_port with sorted subtables

2016-08-12 Thread Daniele Di Proietto
I have added the ack from Antonio given to the previous version.

I fixed a couple minor sparse warnings about conversions between odp_port_t
and uint32_t

I removed an 'inline' attribute from a function to silence a sparse
warning.  I think that's a sparse bug, I'm going to report it.

Not a big deal, but the 'signoff' email doesn't match the 'from' email.  I
didn't change it, but it'd be nice to have the same address next time.

I pushed this to master.

Thanks,

Daniele

2016-08-11 3:02 GMT-07:00 Jan Scheurich :

> The user-space datapath (dpif-netdev) consists of a first level "exact
> match
> cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With
> many parallel packet flows (e.g. TCP connections) the EMC becomes
> inefficient
> and the OVS forwarding performance is determined by the megaflow
> classifier.
>
> The megaflow classifier (dpcls) consists of a variable number of hash
> tables
> (aka subtables), each containing megaflow entries with the same mask of
> packet header and metadata fields to match upon. A dpcls lookup matches a
> given packet against all subtables in sequence until it hits a match. As
> megaflow cache entries are by construction non-overlapping, the first match
> is the only match.
>
> Today the order of the subtables in the dpcls is essentially random so that
> on average a dpcls lookup has to visit N/2 subtables for a hit, when N is
> the
> total number of subtables. Even though every single hash-table lookup is
> fast, the performance of the current dpcls degrades when there are many
> subtables.
>
> How does the patch address this issue:
>
> In reality there is often a strong correlation between the ingress port
> and a
> small subset of subtables that have hits. The entire megaflow cache
> typically
> decomposes nicely into partitions that are hit only by packets entering
> from
> a range of similar ports (e.g. traffic from Phy  -> VM vs. traffic from VM
> ->
> Phy).
>
> Therefore, maintaining a separate dpcls instance per ingress port with its
> subtable vector sorted by frequency of hits reduces the average number of
> subtables lookups in the dpcls to a minimum, even if the total number of
> subtables gets large. This is possible because megaflows always have an
> exact
> match on in_port, so every megaflow belongs to unique dpcls instance.
>
> For thread safety, the PMD thread needs to block out revalidators during
> the
> periodic optimization. We use ovs_mutex_trylock() to avoid blocking the
> PMD.
>
> To monitor the effectiveness of the patch we have enhanced the ovs-appctl
> dpif-netdev/pmd-stats-show command with an extra line "avg. subtable
> lookups
> per hit" to report the average number of subtable lookup needed for a
> megaflow match. Ideally, this should be close to 1 and almost all cases
> much
> smaller than N/2.
>
> The PMD tests have been adjusted to the additional line in pmd-stats-show.
>
> We have benchmarked a L3-VPN pipeline on top of a VXLAN overlay mesh.
> With pure L3 tenant traffic between VMs on different nodes the resulting
> netdev dpcls contains N=4 subtables. Each packet traversing the OVS
> datapath is subject to dpcls lookup twice due to the tunnel termination.
>
> Disabling the EMC, we have measured a baseline performance (in+out) of
> ~1.45
> Mpps (64 bytes, 10K L4 packet flows). The average number of subtable
> lookups
> per dpcls match is 2.5. With the patch the average number of subtable
> lookups
> per dpcls match is reduced to 1 and the forwarding performance grows by
> ~50%
> to 2.13 Mpps.
>
> Even with EMC enabled, the patch improves the performance by 9% (for 1000
> L4
> flows) and 34% (for 50K+ L4 flows).
>
> As the actual number of subtables will often be higher in reality, we can
> assume that this is at the lower end of the speed-up one can expect from
> this
> optimization. Just running a parallel ping between the VXLAN tunnel
> endpoints
> increases the number of subtables and hence the average number of subtable
> lookups from 2.5 to 3.5 on master with a corresponding decrease of
> throughput
> to 1.2 Mpps. With the patch the parallel ping has no impact on average
> number
> of subtable lookups and performance. The performance gain is then ~75%.
>
> Signed-off-by: Jan Scheurich 
>
>

> ---
>
> Changes in v5:
> - Rebased to master (commit dd52de45b719)
> - Implemented review comments by Daniele
>
> Changes in v4:
> - Renamed cpvector back to pvector after Jarno's revert patch
>   http://patchwork.ozlabs.org/patch/657508
>
> Changes in v3:
> - Rebased to master (commit 6ef5fa92eb70)
> - Updated performance benchmark figures
> - Adapted to renamed cpvector API
> - Reverted dplcs to using cpvector due to threading issue during flow
> removal
> - Implemented v2 comments by Antonio Fischetti
>
> Changes in v2:
> - Rebased to master (commit 3041e1fc9638)
> - Take the pmd->flow_mutex during optimization to block out revalidators
>   Use trylock in order to not 

[ovs-dev] [PATCH v5] dpif-netdev: dpcls per in_port with sorted subtables

2016-08-11 Thread Jan Scheurich
The user-space datapath (dpif-netdev) consists of a first level "exact match
cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With
many parallel packet flows (e.g. TCP connections) the EMC becomes inefficient
and the OVS forwarding performance is determined by the megaflow classifier.

The megaflow classifier (dpcls) consists of a variable number of hash tables
(aka subtables), each containing megaflow entries with the same mask of
packet header and metadata fields to match upon. A dpcls lookup matches a
given packet against all subtables in sequence until it hits a match. As
megaflow cache entries are by construction non-overlapping, the first match
is the only match.

Today the order of the subtables in the dpcls is essentially random so that
on average a dpcls lookup has to visit N/2 subtables for a hit, when N is the
total number of subtables. Even though every single hash-table lookup is
fast, the performance of the current dpcls degrades when there are many
subtables.

How does the patch address this issue:

In reality there is often a strong correlation between the ingress port and a
small subset of subtables that have hits. The entire megaflow cache typically
decomposes nicely into partitions that are hit only by packets entering from
a range of similar ports (e.g. traffic from Phy  -> VM vs. traffic from VM ->
Phy).

Therefore, maintaining a separate dpcls instance per ingress port with its
subtable vector sorted by frequency of hits reduces the average number of
subtables lookups in the dpcls to a minimum, even if the total number of
subtables gets large. This is possible because megaflows always have an exact
match on in_port, so every megaflow belongs to unique dpcls instance.

For thread safety, the PMD thread needs to block out revalidators during the
periodic optimization. We use ovs_mutex_trylock() to avoid blocking the PMD.

To monitor the effectiveness of the patch we have enhanced the ovs-appctl
dpif-netdev/pmd-stats-show command with an extra line "avg. subtable lookups
per hit" to report the average number of subtable lookup needed for a
megaflow match. Ideally, this should be close to 1 and almost all cases much
smaller than N/2.

The PMD tests have been adjusted to the additional line in pmd-stats-show.

We have benchmarked a L3-VPN pipeline on top of a VXLAN overlay mesh.
With pure L3 tenant traffic between VMs on different nodes the resulting
netdev dpcls contains N=4 subtables. Each packet traversing the OVS
datapath is subject to dpcls lookup twice due to the tunnel termination.

Disabling the EMC, we have measured a baseline performance (in+out) of ~1.45
Mpps (64 bytes, 10K L4 packet flows). The average number of subtable lookups
per dpcls match is 2.5. With the patch the average number of subtable lookups
per dpcls match is reduced to 1 and the forwarding performance grows by ~50%
to 2.13 Mpps.

Even with EMC enabled, the patch improves the performance by 9% (for 1000 L4
flows) and 34% (for 50K+ L4 flows).

As the actual number of subtables will often be higher in reality, we can
assume that this is at the lower end of the speed-up one can expect from this
optimization. Just running a parallel ping between the VXLAN tunnel endpoints
increases the number of subtables and hence the average number of subtable
lookups from 2.5 to 3.5 on master with a corresponding decrease of throughput
to 1.2 Mpps. With the patch the parallel ping has no impact on average number
of subtable lookups and performance. The performance gain is then ~75%.

Signed-off-by: Jan Scheurich 


---

Changes in v5:
- Rebased to master (commit dd52de45b719)
- Implemented review comments by Daniele

Changes in v4:
- Renamed cpvector back to pvector after Jarno's revert patch 
  http://patchwork.ozlabs.org/patch/657508

Changes in v3:
- Rebased to master (commit 6ef5fa92eb70)
- Updated performance benchmark figures
- Adapted to renamed cpvector API
- Reverted dplcs to using cpvector due to threading issue during flow removal
- Implemented v2 comments by Antonio Fischetti

Changes in v2:
- Rebased to master (commit 3041e1fc9638)
- Take the pmd->flow_mutex during optimization to block out revalidators
  Use trylock in order to not block the PMD thread
- Made in_port an explicit input parameter to fast_path_processing()
- Fixed coding style issues



 lib/dpif-netdev.c | 202 +++---
 tests/pmd.at  |   6 +-
 2 files changed, 181 insertions(+), 27 deletions(-)


diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index fe19b38..038f0a5 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -161,7 +161,12 @@ struct emc_cache {
 
 /* Simple non-wildcarding single-priority classifier. */
 
+/* Time in ms between successive optimizations of the dpcls subtable vector */
+#define DPCLS_OPTIMIZATION_INTERVAL 1000
+
 struct dpcls {
+struct cmap_node node;  /* Within dp_netdev_pmd_thread.classifier */
+odp_port_t