Re: [ovs-dev] [PATCH v5] dpif-netdev: dpcls per in_port with sorted subtables
I have added the ack from Antonio given to the previous version. I fixed a couple minor sparse warnings about conversions between odp_port_t and uint32_t I removed an 'inline' attribute from a function to silence a sparse warning. I think that's a sparse bug, I'm going to report it. Not a big deal, but the 'signoff' email doesn't match the 'from' email. I didn't change it, but it'd be nice to have the same address next time. I pushed this to master. Thanks, Daniele 2016-08-11 3:02 GMT-07:00 Jan Scheurich: > The user-space datapath (dpif-netdev) consists of a first level "exact > match > cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With > many parallel packet flows (e.g. TCP connections) the EMC becomes > inefficient > and the OVS forwarding performance is determined by the megaflow > classifier. > > The megaflow classifier (dpcls) consists of a variable number of hash > tables > (aka subtables), each containing megaflow entries with the same mask of > packet header and metadata fields to match upon. A dpcls lookup matches a > given packet against all subtables in sequence until it hits a match. As > megaflow cache entries are by construction non-overlapping, the first match > is the only match. > > Today the order of the subtables in the dpcls is essentially random so that > on average a dpcls lookup has to visit N/2 subtables for a hit, when N is > the > total number of subtables. Even though every single hash-table lookup is > fast, the performance of the current dpcls degrades when there are many > subtables. > > How does the patch address this issue: > > In reality there is often a strong correlation between the ingress port > and a > small subset of subtables that have hits. The entire megaflow cache > typically > decomposes nicely into partitions that are hit only by packets entering > from > a range of similar ports (e.g. traffic from Phy -> VM vs. traffic from VM > -> > Phy). > > Therefore, maintaining a separate dpcls instance per ingress port with its > subtable vector sorted by frequency of hits reduces the average number of > subtables lookups in the dpcls to a minimum, even if the total number of > subtables gets large. This is possible because megaflows always have an > exact > match on in_port, so every megaflow belongs to unique dpcls instance. > > For thread safety, the PMD thread needs to block out revalidators during > the > periodic optimization. We use ovs_mutex_trylock() to avoid blocking the > PMD. > > To monitor the effectiveness of the patch we have enhanced the ovs-appctl > dpif-netdev/pmd-stats-show command with an extra line "avg. subtable > lookups > per hit" to report the average number of subtable lookup needed for a > megaflow match. Ideally, this should be close to 1 and almost all cases > much > smaller than N/2. > > The PMD tests have been adjusted to the additional line in pmd-stats-show. > > We have benchmarked a L3-VPN pipeline on top of a VXLAN overlay mesh. > With pure L3 tenant traffic between VMs on different nodes the resulting > netdev dpcls contains N=4 subtables. Each packet traversing the OVS > datapath is subject to dpcls lookup twice due to the tunnel termination. > > Disabling the EMC, we have measured a baseline performance (in+out) of > ~1.45 > Mpps (64 bytes, 10K L4 packet flows). The average number of subtable > lookups > per dpcls match is 2.5. With the patch the average number of subtable > lookups > per dpcls match is reduced to 1 and the forwarding performance grows by > ~50% > to 2.13 Mpps. > > Even with EMC enabled, the patch improves the performance by 9% (for 1000 > L4 > flows) and 34% (for 50K+ L4 flows). > > As the actual number of subtables will often be higher in reality, we can > assume that this is at the lower end of the speed-up one can expect from > this > optimization. Just running a parallel ping between the VXLAN tunnel > endpoints > increases the number of subtables and hence the average number of subtable > lookups from 2.5 to 3.5 on master with a corresponding decrease of > throughput > to 1.2 Mpps. With the patch the parallel ping has no impact on average > number > of subtable lookups and performance. The performance gain is then ~75%. > > Signed-off-by: Jan Scheurich > > > --- > > Changes in v5: > - Rebased to master (commit dd52de45b719) > - Implemented review comments by Daniele > > Changes in v4: > - Renamed cpvector back to pvector after Jarno's revert patch > http://patchwork.ozlabs.org/patch/657508 > > Changes in v3: > - Rebased to master (commit 6ef5fa92eb70) > - Updated performance benchmark figures > - Adapted to renamed cpvector API > - Reverted dplcs to using cpvector due to threading issue during flow > removal > - Implemented v2 comments by Antonio Fischetti > > Changes in v2: > - Rebased to master (commit 3041e1fc9638) > - Take the pmd->flow_mutex during optimization to block out revalidators > Use trylock in order to not
[ovs-dev] [PATCH v5] dpif-netdev: dpcls per in_port with sorted subtables
The user-space datapath (dpif-netdev) consists of a first level "exact match cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With many parallel packet flows (e.g. TCP connections) the EMC becomes inefficient and the OVS forwarding performance is determined by the megaflow classifier. The megaflow classifier (dpcls) consists of a variable number of hash tables (aka subtables), each containing megaflow entries with the same mask of packet header and metadata fields to match upon. A dpcls lookup matches a given packet against all subtables in sequence until it hits a match. As megaflow cache entries are by construction non-overlapping, the first match is the only match. Today the order of the subtables in the dpcls is essentially random so that on average a dpcls lookup has to visit N/2 subtables for a hit, when N is the total number of subtables. Even though every single hash-table lookup is fast, the performance of the current dpcls degrades when there are many subtables. How does the patch address this issue: In reality there is often a strong correlation between the ingress port and a small subset of subtables that have hits. The entire megaflow cache typically decomposes nicely into partitions that are hit only by packets entering from a range of similar ports (e.g. traffic from Phy -> VM vs. traffic from VM -> Phy). Therefore, maintaining a separate dpcls instance per ingress port with its subtable vector sorted by frequency of hits reduces the average number of subtables lookups in the dpcls to a minimum, even if the total number of subtables gets large. This is possible because megaflows always have an exact match on in_port, so every megaflow belongs to unique dpcls instance. For thread safety, the PMD thread needs to block out revalidators during the periodic optimization. We use ovs_mutex_trylock() to avoid blocking the PMD. To monitor the effectiveness of the patch we have enhanced the ovs-appctl dpif-netdev/pmd-stats-show command with an extra line "avg. subtable lookups per hit" to report the average number of subtable lookup needed for a megaflow match. Ideally, this should be close to 1 and almost all cases much smaller than N/2. The PMD tests have been adjusted to the additional line in pmd-stats-show. We have benchmarked a L3-VPN pipeline on top of a VXLAN overlay mesh. With pure L3 tenant traffic between VMs on different nodes the resulting netdev dpcls contains N=4 subtables. Each packet traversing the OVS datapath is subject to dpcls lookup twice due to the tunnel termination. Disabling the EMC, we have measured a baseline performance (in+out) of ~1.45 Mpps (64 bytes, 10K L4 packet flows). The average number of subtable lookups per dpcls match is 2.5. With the patch the average number of subtable lookups per dpcls match is reduced to 1 and the forwarding performance grows by ~50% to 2.13 Mpps. Even with EMC enabled, the patch improves the performance by 9% (for 1000 L4 flows) and 34% (for 50K+ L4 flows). As the actual number of subtables will often be higher in reality, we can assume that this is at the lower end of the speed-up one can expect from this optimization. Just running a parallel ping between the VXLAN tunnel endpoints increases the number of subtables and hence the average number of subtable lookups from 2.5 to 3.5 on master with a corresponding decrease of throughput to 1.2 Mpps. With the patch the parallel ping has no impact on average number of subtable lookups and performance. The performance gain is then ~75%. Signed-off-by: Jan Scheurich--- Changes in v5: - Rebased to master (commit dd52de45b719) - Implemented review comments by Daniele Changes in v4: - Renamed cpvector back to pvector after Jarno's revert patch http://patchwork.ozlabs.org/patch/657508 Changes in v3: - Rebased to master (commit 6ef5fa92eb70) - Updated performance benchmark figures - Adapted to renamed cpvector API - Reverted dplcs to using cpvector due to threading issue during flow removal - Implemented v2 comments by Antonio Fischetti Changes in v2: - Rebased to master (commit 3041e1fc9638) - Take the pmd->flow_mutex during optimization to block out revalidators Use trylock in order to not block the PMD thread - Made in_port an explicit input parameter to fast_path_processing() - Fixed coding style issues lib/dpif-netdev.c | 202 +++--- tests/pmd.at | 6 +- 2 files changed, 181 insertions(+), 27 deletions(-) diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index fe19b38..038f0a5 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -161,7 +161,12 @@ struct emc_cache { /* Simple non-wildcarding single-priority classifier. */ +/* Time in ms between successive optimizations of the dpcls subtable vector */ +#define DPCLS_OPTIMIZATION_INTERVAL 1000 + struct dpcls { +struct cmap_node node; /* Within dp_netdev_pmd_thread.classifier */ +odp_port_t