Re: [ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation

2017-11-29 Thread Jan Scheurich
Hi Yipeng,

The benchmark I used is the same we applied last year when developing the first 
series of DPDK datapath optimizations. The test setup is described fairly 
detailed in the linked PDF:
https://drive.google.com/open?id=0ByBuumQUR_NYMTlmZFFuQmJUYmM.

The OVS bridge setup and the OpenFlow pipelines are modelled after the Netvirt 
service in OpenDaylight acting as SDN backend to OpenStack Neutron.

We are working on a more up-to-date variant of the OpenFlow test pipeline that 
includes newer features such as stateful Security Groups using conntrack to get 
even closer to realistic deployments. This should give us a slightly increased 
average number of subtable lookups. The current tests consistently only require 
1 subtable lookup per EMC/DFC miss.

This is due to two factors:
1. An exact match on a UDP port in a security rule flow entry is translated 
into a family of up to 16 datapath flow entries with prefix matches on the UDP 
port for prefix lengths 1 through 16, with decreasing amount of hits: "/1" port 
mask: 50%, "/2" port mask: 25%, "/3": 12.5%, and so on. On average this 
requires 2 subtable lookups per packet due to sorting of subtables.
2. The conntrack recirculation itself, which matches on megaflows with their 
own mask

Still it would be far away from your tests with ~20 equally visited subtables.

BR, Jan

> -Original Message-
> From: Wang, Yipeng1 [mailto:yipeng1.w...@intel.com]
> Sent: Tuesday, 21 November, 2017 23:17
> To: Jan Scheurich ; d...@openvswitch.org
> Cc: Gobriel, Sameh ; Fischetti, Antonio 
> ; Ben Pfaff (b...@ovn.org) 
> Subject: RE: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> 
> Hi, Jan,
> 
> Thanks for sharing the patch and nice talking with you in the conference. 
> Could you tell us more about the test case you used such as the
> rule set and traffic pattern? Since we have some other work at hand to be 
> done soon, we will definitely test the patch later and get back to
> you in a week or two.
> 
> Thank
> Yipeng
> 
> > -Original Message-
> > From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> > Sent: Monday, November 20, 2017 9:39 AM
> > To: Wang, Yipeng1 ; d...@openvswitch.org
> > Cc: Gobriel, Sameh ; Fischetti, Antonio
> > ; db...@vmware.com; Ben Pfaff (b...@ovn.org)
> > 
> > Subject: RE: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> >
> > Hi Yipeng and Sameh,
> >
> > As discussed on the OVS conference, I have just sent my Datapath Flow
> > Cache patch to the ML:
> > https://mail.openvswitch.org/pipermail/ovs-dev/2017-
> > November/341066.html
> > http://patchwork.ozlabs.org/patch/839645/
> >
> > I would be happy if you could test it in your setup and compare the
> > performance with your CD implementation. I don't have a test bed at hand
> > where I could easily test the speedup for many visited subtables.
> >
> > Thanks, Jan
> >
> > > -Original Message-
> > > From: Yipeng Wang [mailto:yipeng1.w...@intel.com]
> > > Sent: Wednesday, 01 November, 2017 00:40
> > > To: d...@openvswitch.org
> > > Cc: yipeng1.w...@intel.com; sameh.gobr...@intel.com;
> > antonio.fische...@intel.com; db...@vmware.com; Jan Scheurich
> > > 
> > > Subject: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> > >
> > > The Datapath Classifier uses tuple space search for flow classification.
> > > The rules are arranged into a set of tuples/subtables (each with a
> > > distinct mask).  Each subtable is implemented as a hash table and lookup
> > > is done with flow keys formed by selecting the bits from the packet header
> > > based on each subtable's mask. Tuple space search will sequentially search
> > > each subtable until a match is found. With a large number of subtables, a
> > > sequential search of the subtables could consume a lot of CPU cycles. In
> > > a testbench with a uniform traffic pattern equally distributed across 20
> > > subtables, we measured that up to 65% of total execution time is
> > attributed
> > > to the megaflow cache lookup.
> > >
> > > This patch presents the idea of the two-layer hierarchical lookup, where a
> > > low overhead first level of indirection is accessed first, we call this
> > > level cuckoo distributor (CD). If a flow key has been inserted in the flow
> > > table the first level will indicate with high probability that which
> > > subtable to look into. A lookup is performed on the second level (the
> > > target subtable) to retrieve the result. If the key doesn’t have a match,
> > > then we revert back to the sequential search of subtables. The patch is
> > > partially inspired by earlier concepts proposed in "simTable"[1] and
> > > "Cuckoo Filter"[2], and DPDK's Cuckoo Hash implementation.
> > >
> > > This patch can improve the already existing Subtable Ranking when traffic
> > > data has high entropy. Subtable Ranking helps minimize the number of
> > > traversed subtables when most of the traffic hit the same subtable.
> > > However, in the case of high entropy traf

Re: [ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation

2017-11-21 Thread Wang, Yipeng1
Hi, Jan,

Thanks for sharing the patch and nice talking with you in the conference. Could 
you tell us more about the test case you used such as the rule set and traffic 
pattern? Since we have some other work at hand to be done soon, we will 
definitely test the patch later and get back to you in a week or two.

Thank
Yipeng

> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Monday, November 20, 2017 9:39 AM
> To: Wang, Yipeng1 ; d...@openvswitch.org
> Cc: Gobriel, Sameh ; Fischetti, Antonio
> ; db...@vmware.com; Ben Pfaff (b...@ovn.org)
> 
> Subject: RE: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> 
> Hi Yipeng and Sameh,
> 
> As discussed on the OVS conference, I have just sent my Datapath Flow
> Cache patch to the ML:
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
> November/341066.html
> http://patchwork.ozlabs.org/patch/839645/
> 
> I would be happy if you could test it in your setup and compare the
> performance with your CD implementation. I don't have a test bed at hand
> where I could easily test the speedup for many visited subtables.
> 
> Thanks, Jan
> 
> > -Original Message-
> > From: Yipeng Wang [mailto:yipeng1.w...@intel.com]
> > Sent: Wednesday, 01 November, 2017 00:40
> > To: d...@openvswitch.org
> > Cc: yipeng1.w...@intel.com; sameh.gobr...@intel.com;
> antonio.fische...@intel.com; db...@vmware.com; Jan Scheurich
> > 
> > Subject: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> >
> > The Datapath Classifier uses tuple space search for flow classification.
> > The rules are arranged into a set of tuples/subtables (each with a
> > distinct mask).  Each subtable is implemented as a hash table and lookup
> > is done with flow keys formed by selecting the bits from the packet header
> > based on each subtable's mask. Tuple space search will sequentially search
> > each subtable until a match is found. With a large number of subtables, a
> > sequential search of the subtables could consume a lot of CPU cycles. In
> > a testbench with a uniform traffic pattern equally distributed across 20
> > subtables, we measured that up to 65% of total execution time is
> attributed
> > to the megaflow cache lookup.
> >
> > This patch presents the idea of the two-layer hierarchical lookup, where a
> > low overhead first level of indirection is accessed first, we call this
> > level cuckoo distributor (CD). If a flow key has been inserted in the flow
> > table the first level will indicate with high probability that which
> > subtable to look into. A lookup is performed on the second level (the
> > target subtable) to retrieve the result. If the key doesn’t have a match,
> > then we revert back to the sequential search of subtables. The patch is
> > partially inspired by earlier concepts proposed in "simTable"[1] and
> > "Cuckoo Filter"[2], and DPDK's Cuckoo Hash implementation.
> >
> > This patch can improve the already existing Subtable Ranking when traffic
> > data has high entropy. Subtable Ranking helps minimize the number of
> > traversed subtables when most of the traffic hit the same subtable.
> > However, in the case of high entropy traffic such as traffic coming from
> > a physical port, multiple subtables could be hit with a similar frequency.
> > In this case the average subtable lookups per hit would be much greater
> > than 1. In addition, CD can adaptively turn off when it finds the traffic
> > mostly hit one subtable. Thus, CD will not be an overhead when Subtable
> > Ranking works well.
> >
> > Scheme:
> > CD is in front of the subtables. Packets are directed to corresponding
> subtable
> > if hit in CD instead of searching each subtable sequentially.
> >  ---
> > |  CD   |
> >  ---
> >\
> > \
> >  -  - -
> > |sub  ||sub  |...|sub  |
> > |table||table|   |table|
> >  -  - -
> >
> >  Evaluation:
> >  --
> > We create a set of rules with various src IP. We feed traffic containing
> various
> > numbers of flows with various src IP and dst IP. All the flows hit 10/20/30
> > rules creating 10/20/30 subtables. We will explain the rule/traffic setup
> > in detail later.
> >
> > The table below shows the preliminary continuous testing results (full line
> > speed test) we collected with a uni-directional phy-to-phy setup. OvS
> > runs with 1 PMD. We use Spirent as the hardware traffic generator.
> >
> >  Before v2 rebase:
> >  
> > AVX2 data:
> > 20k flows:
> > no.subtable: 10  20  30
> > cd-ovs   4267332 3478251 3126763
> > orig-ovs 3260883 2174551 1689981
> > speedup  1.31x   1.60x   1.85x
> >
> > 100k flows:
> > no.subtable: 10  20  30
> > cd-ovs   4015783 3276100 2970645
> > orig-ovs 2692882 1711955 1302321
> > speedup  1.49x   1.91x   2.28x
> >
> > 1M flows:
> > no.subtable: 10  20  30
> > cd-ovs   3895961 31

Re: [ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation

2017-11-20 Thread Jan Scheurich
Hi Yipeng and Sameh,

As discussed on the OVS conference, I have just sent my Datapath Flow Cache 
patch to the ML:
https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341066.html
http://patchwork.ozlabs.org/patch/839645/

I would be happy if you could test it in your setup and compare the performance 
with your CD implementation. I don't have a test bed at hand where I could 
easily test the speedup for many visited subtables.

Thanks, Jan

> -Original Message-
> From: Yipeng Wang [mailto:yipeng1.w...@intel.com]
> Sent: Wednesday, 01 November, 2017 00:40
> To: d...@openvswitch.org
> Cc: yipeng1.w...@intel.com; sameh.gobr...@intel.com; 
> antonio.fische...@intel.com; db...@vmware.com; Jan Scheurich
> 
> Subject: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> 
> The Datapath Classifier uses tuple space search for flow classification.
> The rules are arranged into a set of tuples/subtables (each with a
> distinct mask).  Each subtable is implemented as a hash table and lookup
> is done with flow keys formed by selecting the bits from the packet header
> based on each subtable's mask. Tuple space search will sequentially search
> each subtable until a match is found. With a large number of subtables, a
> sequential search of the subtables could consume a lot of CPU cycles. In
> a testbench with a uniform traffic pattern equally distributed across 20
> subtables, we measured that up to 65% of total execution time is attributed
> to the megaflow cache lookup.
> 
> This patch presents the idea of the two-layer hierarchical lookup, where a
> low overhead first level of indirection is accessed first, we call this
> level cuckoo distributor (CD). If a flow key has been inserted in the flow
> table the first level will indicate with high probability that which
> subtable to look into. A lookup is performed on the second level (the
> target subtable) to retrieve the result. If the key doesn’t have a match,
> then we revert back to the sequential search of subtables. The patch is
> partially inspired by earlier concepts proposed in "simTable"[1] and
> "Cuckoo Filter"[2], and DPDK's Cuckoo Hash implementation.
> 
> This patch can improve the already existing Subtable Ranking when traffic
> data has high entropy. Subtable Ranking helps minimize the number of
> traversed subtables when most of the traffic hit the same subtable.
> However, in the case of high entropy traffic such as traffic coming from
> a physical port, multiple subtables could be hit with a similar frequency.
> In this case the average subtable lookups per hit would be much greater
> than 1. In addition, CD can adaptively turn off when it finds the traffic
> mostly hit one subtable. Thus, CD will not be an overhead when Subtable
> Ranking works well.
> 
> Scheme:
> CD is in front of the subtables. Packets are directed to corresponding 
> subtable
> if hit in CD instead of searching each subtable sequentially.
>  ---
> |  CD   |
>  ---
>\
> \
>  -  - -
> |sub  ||sub  |...|sub  |
> |table||table|   |table|
>  -  - -
> 
>  Evaluation:
>  --
> We create a set of rules with various src IP. We feed traffic containing 
> various
> numbers of flows with various src IP and dst IP. All the flows hit 10/20/30
> rules creating 10/20/30 subtables. We will explain the rule/traffic setup
> in detail later.
> 
> The table below shows the preliminary continuous testing results (full line
> speed test) we collected with a uni-directional phy-to-phy setup. OvS
> runs with 1 PMD. We use Spirent as the hardware traffic generator.
> 
>  Before v2 rebase:
>  
> AVX2 data:
> 20k flows:
> no.subtable: 10  20  30
> cd-ovs   4267332 3478251 3126763
> orig-ovs 3260883 2174551 1689981
> speedup  1.31x   1.60x   1.85x
> 
> 100k flows:
> no.subtable: 10  20  30
> cd-ovs   4015783 3276100 2970645
> orig-ovs 2692882 1711955 1302321
> speedup  1.49x   1.91x   2.28x
> 
> 1M flows:
> no.subtable: 10  20  30
> cd-ovs   3895961 3170530 2968555
> orig-ovs 2683455 1646227 1240501
> speedup  1.45x   1.92x   2.39x
> 
> Scalar data:
> 1M flows:
> no.subtable: 10  20  30
> cd-ovs   3658328 3028111 2863329
> orig_ovs 2683455 1646227 1240501
> speedup  1.36x   1.84x   2.31x
> 
>  After v2 rebase:
>  
> After rebase for v1, we tested 1M flows, 20 table cases, the results still 
> hold.
> 1M flows:
> no.subtable:   20
> cd-ovs 3066483
> orig-ovs   1588049
> speedup1.93x
> 
> 
>  Test rules/traffic setup:
>  
> To setup a test case with 20 subtables, the rule set we use is like below:
> tcp,nw_src=1.0.0.0/8, actions=output:1
> udp,nw_src=2.0.0.0/9, actions=output:1
> udp,nw_src=3.0.0.0/10,actions=output:1
> udp,nw_src=4.0.0.0/11,actions=output:1
> ...
> udp,nw_src=18.0.

Re: [ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation

2017-11-07 Thread Ben Pfaff
That's what I mean.

It could go in the 'tests' directory.

On Tue, Nov 07, 2017 at 05:01:04PM +, Wang, Yipeng1 wrote:
> Thanks Ben,
> 
> Do you mean to include TRex script into the repo? Could you suggest more 
> details like where would be a suitable place to put such kind of test scripts?
> 
> Thanks
> 
> > -Original Message-
> > From: Ben Pfaff [mailto:b...@ovn.org]
> > Sent: Friday, November 3, 2017 10:59 AM
> > To: Wang, Yipeng1 
> > Cc: d...@openvswitch.org
> > Subject: Re: [ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor
> > implementation
> > 
> > Is this something that should be included in the repo?
> > 
> > On Fri, Nov 03, 2017 at 04:14:56PM +, Wang, Yipeng1 wrote:
> > > To make it easier for the code reviewers to build and test the patchset, a
> > TREX profile that presents a very simple synthetic test case of random 
> > traffic
> > with 20 different IP src and 50K different IP dst is attached. It can be 
> > used
> > together with the rule set we mentioned in cover letter to generate uniform
> > distribution of hits among the 20 subtables. This synthetic traffic pattern
> > represents the worst-case scenario for the current subtable ranking method.
> > We observe about 2x speedup vs. the original OvS in this case. Note that the
> > patchset automatically detects if there is benefit to turn CD on or off to
> > accommodate any traffic pattern, so when the subtable ranking works
> > perfectly, CD will not be enabled and will not harm the performance.
> > >
> > > One can change the dstip and srcip_cnt variables to generate different
> > number of flows and subtable count scenarios.
> > >
> > >  
> > > import locale, sys, time
> > > from signal import *
> > >
> > > import stl_path
> > > from trex_stl_lib.api import *
> > >
> > > [tx_port, rx_port] = my_ports = [0, 1]
> > > tx_ports = [tx_port]
> > > rx_ports = [rx_port]
> > >
> > > global c
> > >
> > > # dst IP vary from 0.0.0.0 to 0.0.195.255 is about 50k flows.
> > > # src IP vary from 1.0.0.0 to 20.0.0.0 is 20 flows.
> > > # 50k * 20 is about 1M total flows
> > > dstip = "0.0.195.255"
> > > srcip_cnt = 20
> > > size = 64
> > >
> > > #create stream blocks. Each stream has one srcIP with various dstIP.
> > > #There are in total of 20 different srcIP.
> > > def make_streams():
> > > streams = [
> > > {"base_pkt":Ether()/IP(src="{}.0.0.0".format(i), tos=0x20)/UDP(),
> > >  "vm":[
> > >
> > STLVmFlowVar(name="ip_dst",min_value="0.0.0.0",max_value=dstip,size=4
> > ,op="random"),
> > > STLVmWrFlowVar(fv_name="ip_dst",pkt_offset="IP.dst"),
> > > ]
> > > }
> > > for i in range(1,srcip_cnt + 1)
> > > ]
> > > return streams
> > >
> > > if __name__ == "__main__":
> > >
> > > c = STLClient(verbose_level = LoggerApi.VERBOSE_QUIET)
> > > c.connect()
> > >
> > > c.reset(ports = my_ports)
> > > new_streams = make_streams()
> > >
> > > for s in new_streams:
> > > # 64 - 4 for FCS
> > > pad = max(0, size - 4 - len(s["base_pkt"])) * 'x'
> > > s["base_pkt"] = s["base_pkt"]/pad
> > >
> > > pkts = [STLPktBuilder(pkt = s["base_pkt"], vm = s["vm"]) for s in
> > new_streams]
> > >
> > > #generate contiguous traffic. Each stream has equal bandwidth.
> > > final_streams = [STLStream(packet = pkt, mode =
> > STLTXCont(percentage = 100.0/len(pkts))) for pkt in pkts]
> > > c.add_streams(final_streams, ports=[tx_port])
> > > c.set_port_attr(my_ports, promiscuous = True)
> > >
> > > #start the traffic
> > > c.start(ports = tx_ports)
> > > #wait for 20 seconds
> > > time.sleep(20)
> > > #write rx pps to stdio
> > > sys.stdout.write(str("RX PPS:
> > ")+str(int(c.get_stats(my_ports)[1]["rx_pps"])) + str("\n"))
> > > #stop the traffic
> > > c.stop(ports=my_ports)
> > > c.disconnect()
> > > c = None
&

Re: [ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation

2017-11-07 Thread Wang, Yipeng1
We are waiting for more comments before reworking this patchset. Please feel 
free to post any comment.

Thanks!

> -Original Message-
> From: Wang, Yipeng1
> Sent: Tuesday, October 31, 2017 4:40 PM
> To: d...@openvswitch.org
> Cc: Wang, Yipeng1 ; Gobriel, Sameh
> ; Fischetti, Antonio
> ; db...@vmware.com;
> jan.scheur...@ericsson.com
> Subject: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> 
> The Datapath Classifier uses tuple space search for flow classification.
> The rules are arranged into a set of tuples/subtables (each with a
> distinct mask).  Each subtable is implemented as a hash table and lookup
> is done with flow keys formed by selecting the bits from the packet header
> based on each subtable's mask. Tuple space search will sequentially search
> each subtable until a match is found. With a large number of subtables, a
> sequential search of the subtables could consume a lot of CPU cycles. In
> a testbench with a uniform traffic pattern equally distributed across 20
> subtables, we measured that up to 65% of total execution time is attributed
> to the megaflow cache lookup.
> 
> This patch presents the idea of the two-layer hierarchical lookup, where a
> low overhead first level of indirection is accessed first, we call this
> level cuckoo distributor (CD). If a flow key has been inserted in the flow
> table the first level will indicate with high probability that which
> subtable to look into. A lookup is performed on the second level (the
> target subtable) to retrieve the result. If the key doesn’t have a match,
> then we revert back to the sequential search of subtables. The patch is
> partially inspired by earlier concepts proposed in "simTable"[1] and
> "Cuckoo Filter"[2], and DPDK's Cuckoo Hash implementation.
> 
> This patch can improve the already existing Subtable Ranking when traffic
> data has high entropy. Subtable Ranking helps minimize the number of
> traversed subtables when most of the traffic hit the same subtable.
> However, in the case of high entropy traffic such as traffic coming from
> a physical port, multiple subtables could be hit with a similar frequency.
> In this case the average subtable lookups per hit would be much greater
> than 1. In addition, CD can adaptively turn off when it finds the traffic
> mostly hit one subtable. Thus, CD will not be an overhead when Subtable
> Ranking works well.
> 
> Scheme:
> CD is in front of the subtables. Packets are directed to corresponding
> subtable
> if hit in CD instead of searching each subtable sequentially.
>  ---
> |  CD   |
>  ---
>\
> \
>  -  - -
> |sub  ||sub  |...|sub  |
> |table||table|   |table|
>  -  - -
> 
>  Evaluation:
>  --
> We create a set of rules with various src IP. We feed traffic containing 
> various
> numbers of flows with various src IP and dst IP. All the flows hit 10/20/30
> rules creating 10/20/30 subtables. We will explain the rule/traffic setup
> in detail later.
> 
> The table below shows the preliminary continuous testing results (full line
> speed test) we collected with a uni-directional phy-to-phy setup. OvS
> runs with 1 PMD. We use Spirent as the hardware traffic generator.
> 
>  Before v2 rebase:
>  
> AVX2 data:
> 20k flows:
> no.subtable: 10  20  30
> cd-ovs   4267332 3478251 3126763
> orig-ovs 3260883 2174551 1689981
> speedup  1.31x   1.60x   1.85x
> 
> 100k flows:
> no.subtable: 10  20  30
> cd-ovs   4015783 3276100 2970645
> orig-ovs 2692882 1711955 1302321
> speedup  1.49x   1.91x   2.28x
> 
> 1M flows:
> no.subtable: 10  20  30
> cd-ovs   3895961 3170530 2968555
> orig-ovs 2683455 1646227 1240501
> speedup  1.45x   1.92x   2.39x
> 
> Scalar data:
> 1M flows:
> no.subtable: 10  20  30
> cd-ovs   3658328 3028111 2863329
> orig_ovs 2683455 1646227 1240501
> speedup  1.36x   1.84x   2.31x
> 
>  After v2 rebase:
>  
> After rebase for v1, we tested 1M flows, 20 table cases, the results still 
> hold.
> 1M flows:
> no.subtable:   20
> cd-ovs 3066483
> orig-ovs   1588049
> speedup1.93x
> 
> 
>  Test rules/traffic setup:
>  
> To setup a test case with 20 subtables, the rule set we use is like below:
> tcp,nw_src=1.0.0.0/8, actions=output:1
> udp,nw_src=2.0.0.0/9, actions=output:1
> udp,nw_src=3.0.0.0/10,actions=output:1
> udp,nw_src=4.0.0.0/11,actions=output:1
> ...
> udp,nw_src=18.0.0.0/25,actions=output:1
> udp,nw_src=19.0.0.0/26,actions=output:1
> udp,nw_src=20.0.0.0/27,actions=output:1
> 
> Then for the traffic generator, we generate corresponding traffics with
> src_ip varying from 1.0.0.0 to 20.0.0.0. For each src_ip, we change
> dst_ip for 5 different values. This will effectively generate 1M
> different flows hitting the 20 rules we created. And bec

Re: [ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation

2017-11-07 Thread Wang, Yipeng1
Thanks Ben,

Do you mean to include TRex script into the repo? Could you suggest more 
details like where would be a suitable place to put such kind of test scripts?

Thanks

> -Original Message-
> From: Ben Pfaff [mailto:b...@ovn.org]
> Sent: Friday, November 3, 2017 10:59 AM
> To: Wang, Yipeng1 
> Cc: d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor
> implementation
> 
> Is this something that should be included in the repo?
> 
> On Fri, Nov 03, 2017 at 04:14:56PM +, Wang, Yipeng1 wrote:
> > To make it easier for the code reviewers to build and test the patchset, a
> TREX profile that presents a very simple synthetic test case of random traffic
> with 20 different IP src and 50K different IP dst is attached. It can be used
> together with the rule set we mentioned in cover letter to generate uniform
> distribution of hits among the 20 subtables. This synthetic traffic pattern
> represents the worst-case scenario for the current subtable ranking method.
> We observe about 2x speedup vs. the original OvS in this case. Note that the
> patchset automatically detects if there is benefit to turn CD on or off to
> accommodate any traffic pattern, so when the subtable ranking works
> perfectly, CD will not be enabled and will not harm the performance.
> >
> > One can change the dstip and srcip_cnt variables to generate different
> number of flows and subtable count scenarios.
> >
> >  
> > import locale, sys, time
> > from signal import *
> >
> > import stl_path
> > from trex_stl_lib.api import *
> >
> > [tx_port, rx_port] = my_ports = [0, 1]
> > tx_ports = [tx_port]
> > rx_ports = [rx_port]
> >
> > global c
> >
> > # dst IP vary from 0.0.0.0 to 0.0.195.255 is about 50k flows.
> > # src IP vary from 1.0.0.0 to 20.0.0.0 is 20 flows.
> > # 50k * 20 is about 1M total flows
> > dstip = "0.0.195.255"
> > srcip_cnt = 20
> > size = 64
> >
> > #create stream blocks. Each stream has one srcIP with various dstIP.
> > #There are in total of 20 different srcIP.
> > def make_streams():
> > streams = [
> > {"base_pkt":Ether()/IP(src="{}.0.0.0".format(i), tos=0x20)/UDP(),
> >  "vm":[
> >
> STLVmFlowVar(name="ip_dst",min_value="0.0.0.0",max_value=dstip,size=4
> ,op="random"),
> > STLVmWrFlowVar(fv_name="ip_dst",pkt_offset="IP.dst"),
> > ]
> > }
> > for i in range(1,srcip_cnt + 1)
> > ]
> > return streams
> >
> > if __name__ == "__main__":
> >
> > c = STLClient(verbose_level = LoggerApi.VERBOSE_QUIET)
> > c.connect()
> >
> > c.reset(ports = my_ports)
> > new_streams = make_streams()
> >
> > for s in new_streams:
> > # 64 - 4 for FCS
> > pad = max(0, size - 4 - len(s["base_pkt"])) * 'x'
> > s["base_pkt"] = s["base_pkt"]/pad
> >
> > pkts = [STLPktBuilder(pkt = s["base_pkt"], vm = s["vm"]) for s in
> new_streams]
> >
> > #generate contiguous traffic. Each stream has equal bandwidth.
> > final_streams = [STLStream(packet = pkt, mode =
> STLTXCont(percentage = 100.0/len(pkts))) for pkt in pkts]
> > c.add_streams(final_streams, ports=[tx_port])
> > c.set_port_attr(my_ports, promiscuous = True)
> >
> > #start the traffic
> > c.start(ports = tx_ports)
> > #wait for 20 seconds
> > time.sleep(20)
> > #write rx pps to stdio
> > sys.stdout.write(str("RX PPS:
> ")+str(int(c.get_stats(my_ports)[1]["rx_pps"])) + str("\n"))
> > #stop the traffic
> > c.stop(ports=my_ports)
> > c.disconnect()
> > c = None
> >  
> >
> >
> > > -Original Message-
> > > From: Wang, Yipeng1
> > > Sent: Tuesday, October 31, 2017 4:40 PM
> > > To: d...@openvswitch.org
> > > Cc: Wang, Yipeng1 ; Gobriel, Sameh
> > > ; Fischetti, Antonio
> > > ; db...@vmware.com;
> > > jan.scheur...@ericsson.com
> > > Subject: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> > >
> > > The Datapath Classifier uses tuple space search for flow classification.
> > > The rules are arranged into a set of tuples/subtables (each with a
> > > distinct m

Re: [ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation

2017-11-03 Thread Ben Pfaff
Is this something that should be included in the repo?

On Fri, Nov 03, 2017 at 04:14:56PM +, Wang, Yipeng1 wrote:
> To make it easier for the code reviewers to build and test the patchset, a 
> TREX profile that presents a very simple synthetic test case of random 
> traffic with 20 different IP src and 50K different IP dst is attached. It can 
> be used together with the rule set we mentioned in cover letter to generate 
> uniform distribution of hits among the 20 subtables. This synthetic traffic 
> pattern  represents the worst-case scenario for the current subtable ranking 
> method.  We observe about 2x speedup vs. the original OvS in this case. Note 
> that the patchset automatically detects if there is benefit to turn CD on or 
> off to accommodate any traffic pattern, so when the subtable ranking works 
> perfectly, CD will not be enabled and will not harm the performance.
> 
> One can change the dstip and srcip_cnt variables to generate different number 
> of flows and subtable count scenarios.
> 
>  
> import locale, sys, time
> from signal import *
> 
> import stl_path
> from trex_stl_lib.api import *
> 
> [tx_port, rx_port] = my_ports = [0, 1]
> tx_ports = [tx_port]
> rx_ports = [rx_port]
> 
> global c
> 
> # dst IP vary from 0.0.0.0 to 0.0.195.255 is about 50k flows.
> # src IP vary from 1.0.0.0 to 20.0.0.0 is 20 flows.
> # 50k * 20 is about 1M total flows
> dstip = "0.0.195.255"
> srcip_cnt = 20
> size = 64
> 
> #create stream blocks. Each stream has one srcIP with various dstIP.
> #There are in total of 20 different srcIP.
> def make_streams():
> streams = [
> {"base_pkt":Ether()/IP(src="{}.0.0.0".format(i), tos=0x20)/UDP(),
>  "vm":[
> 
> STLVmFlowVar(name="ip_dst",min_value="0.0.0.0",max_value=dstip,size=4,op="random"),
> STLVmWrFlowVar(fv_name="ip_dst",pkt_offset="IP.dst"),
> ]
> }
> for i in range(1,srcip_cnt + 1)
> ]
> return streams
> 
> if __name__ == "__main__":
> 
> c = STLClient(verbose_level = LoggerApi.VERBOSE_QUIET)
> c.connect()
> 
> c.reset(ports = my_ports)
> new_streams = make_streams()
> 
> for s in new_streams:
> # 64 - 4 for FCS
> pad = max(0, size - 4 - len(s["base_pkt"])) * 'x'
> s["base_pkt"] = s["base_pkt"]/pad
> 
> pkts = [STLPktBuilder(pkt = s["base_pkt"], vm = s["vm"]) for s in 
> new_streams]
> 
> #generate contiguous traffic. Each stream has equal bandwidth.
> final_streams = [STLStream(packet = pkt, mode = STLTXCont(percentage 
> = 100.0/len(pkts))) for pkt in pkts]
> c.add_streams(final_streams, ports=[tx_port])
> c.set_port_attr(my_ports, promiscuous = True)
> 
> #start the traffic
> c.start(ports = tx_ports)
> #wait for 20 seconds
> time.sleep(20)
> #write rx pps to stdio
> sys.stdout.write(str("RX PPS: 
> ")+str(int(c.get_stats(my_ports)[1]["rx_pps"])) + str("\n"))
> #stop the traffic
> c.stop(ports=my_ports)
> c.disconnect()
> c = None
>  
> 
> 
> > -Original Message-
> > From: Wang, Yipeng1
> > Sent: Tuesday, October 31, 2017 4:40 PM
> > To: d...@openvswitch.org
> > Cc: Wang, Yipeng1 ; Gobriel, Sameh
> > ; Fischetti, Antonio
> > ; db...@vmware.com;
> > jan.scheur...@ericsson.com
> > Subject: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> > 
> > The Datapath Classifier uses tuple space search for flow classification.
> > The rules are arranged into a set of tuples/subtables (each with a
> > distinct mask).  Each subtable is implemented as a hash table and lookup
> > is done with flow keys formed by selecting the bits from the packet header
> > based on each subtable's mask. Tuple space search will sequentially search
> > each subtable until a match is found. With a large number of subtables, a
> > sequential search of the subtables could consume a lot of CPU cycles. In
> > a testbench with a uniform traffic pattern equally distributed across 20
> > subtables, we measured that up to 65% of total execution time is attributed
> > to the megaflow cache lookup.
> > 
> > This patch presents the idea of the two-layer hierarchical lookup, where a
> > low overhead first level of indirection is accessed first, we call this
> > level cuckoo distributor (CD). If a flow key has been inserted in the flow
> > table the first level will indicate with high probability that which
> > subtable to look into. A lookup is performed on the second level (the
> > target subtable) to retrieve the result. If the key doesn’t have a match,
> > then we revert back to the sequential search of subtables. The patch is
> > partially inspired by earlier concepts proposed in "simTable"[1] and
> > "Cuckoo Filter"[2], and DPDK's Cuckoo Hash implementation.
> > 
> > This patch can improve the already existing Subtable Ranking when traffic
> > data has high entropy. Subtable Ranking helps

Re: [ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation

2017-11-03 Thread Wang, Yipeng1
To make it easier for the code reviewers to build and test the patchset, a TREX 
profile that presents a very simple synthetic test case of random traffic with 
20 different IP src and 50K different IP dst is attached. It can be used 
together with the rule set we mentioned in cover letter to generate uniform 
distribution of hits among the 20 subtables. This synthetic traffic pattern  
represents the worst-case scenario for the current subtable ranking method.  We 
observe about 2x speedup vs. the original OvS in this case. Note that the 
patchset automatically detects if there is benefit to turn CD on or off to 
accommodate any traffic pattern, so when the subtable ranking works perfectly, 
CD will not be enabled and will not harm the performance.

One can change the dstip and srcip_cnt variables to generate different number 
of flows and subtable count scenarios.

 
import locale, sys, time
from signal import *

import stl_path
from trex_stl_lib.api import *

[tx_port, rx_port] = my_ports = [0, 1]
tx_ports = [tx_port]
rx_ports = [rx_port]

global c

# dst IP vary from 0.0.0.0 to 0.0.195.255 is about 50k flows.
# src IP vary from 1.0.0.0 to 20.0.0.0 is 20 flows.
# 50k * 20 is about 1M total flows
dstip = "0.0.195.255"
srcip_cnt = 20
size = 64

#create stream blocks. Each stream has one srcIP with various dstIP.
#There are in total of 20 different srcIP.
def make_streams():
streams = [
{"base_pkt":Ether()/IP(src="{}.0.0.0".format(i), tos=0x20)/UDP(),
 "vm":[

STLVmFlowVar(name="ip_dst",min_value="0.0.0.0",max_value=dstip,size=4,op="random"),
STLVmWrFlowVar(fv_name="ip_dst",pkt_offset="IP.dst"),
]
}
for i in range(1,srcip_cnt + 1)
]
return streams

if __name__ == "__main__":

c = STLClient(verbose_level = LoggerApi.VERBOSE_QUIET)
c.connect()

c.reset(ports = my_ports)
new_streams = make_streams()

for s in new_streams:
# 64 - 4 for FCS
pad = max(0, size - 4 - len(s["base_pkt"])) * 'x'
s["base_pkt"] = s["base_pkt"]/pad

pkts = [STLPktBuilder(pkt = s["base_pkt"], vm = s["vm"]) for s in 
new_streams]

#generate contiguous traffic. Each stream has equal bandwidth.
final_streams = [STLStream(packet = pkt, mode = STLTXCont(percentage = 
100.0/len(pkts))) for pkt in pkts]
c.add_streams(final_streams, ports=[tx_port])
c.set_port_attr(my_ports, promiscuous = True)

#start the traffic
c.start(ports = tx_ports)
#wait for 20 seconds
time.sleep(20)
#write rx pps to stdio
sys.stdout.write(str("RX PPS: 
")+str(int(c.get_stats(my_ports)[1]["rx_pps"])) + str("\n"))
#stop the traffic
c.stop(ports=my_ports)
c.disconnect()
c = None
 


> -Original Message-
> From: Wang, Yipeng1
> Sent: Tuesday, October 31, 2017 4:40 PM
> To: d...@openvswitch.org
> Cc: Wang, Yipeng1 ; Gobriel, Sameh
> ; Fischetti, Antonio
> ; db...@vmware.com;
> jan.scheur...@ericsson.com
> Subject: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> 
> The Datapath Classifier uses tuple space search for flow classification.
> The rules are arranged into a set of tuples/subtables (each with a
> distinct mask).  Each subtable is implemented as a hash table and lookup
> is done with flow keys formed by selecting the bits from the packet header
> based on each subtable's mask. Tuple space search will sequentially search
> each subtable until a match is found. With a large number of subtables, a
> sequential search of the subtables could consume a lot of CPU cycles. In
> a testbench with a uniform traffic pattern equally distributed across 20
> subtables, we measured that up to 65% of total execution time is attributed
> to the megaflow cache lookup.
> 
> This patch presents the idea of the two-layer hierarchical lookup, where a
> low overhead first level of indirection is accessed first, we call this
> level cuckoo distributor (CD). If a flow key has been inserted in the flow
> table the first level will indicate with high probability that which
> subtable to look into. A lookup is performed on the second level (the
> target subtable) to retrieve the result. If the key doesn’t have a match,
> then we revert back to the sequential search of subtables. The patch is
> partially inspired by earlier concepts proposed in "simTable"[1] and
> "Cuckoo Filter"[2], and DPDK's Cuckoo Hash implementation.
> 
> This patch can improve the already existing Subtable Ranking when traffic
> data has high entropy. Subtable Ranking helps minimize the number of
> traversed subtables when most of the traffic hit the same subtable.
> However, in the case of high entropy traffic such as traffic coming from
> a physical port, multiple subtables could be hit with a similar frequency.
> In this case the average subtable lookups per hit would be much greater
> than 1. In addition, CD can adapt

[ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation

2017-10-31 Thread Yipeng Wang
The Datapath Classifier uses tuple space search for flow classification.
The rules are arranged into a set of tuples/subtables (each with a
distinct mask).  Each subtable is implemented as a hash table and lookup
is done with flow keys formed by selecting the bits from the packet header
based on each subtable's mask. Tuple space search will sequentially search
each subtable until a match is found. With a large number of subtables, a
sequential search of the subtables could consume a lot of CPU cycles. In
a testbench with a uniform traffic pattern equally distributed across 20
subtables, we measured that up to 65% of total execution time is attributed
to the megaflow cache lookup.

This patch presents the idea of the two-layer hierarchical lookup, where a
low overhead first level of indirection is accessed first, we call this
level cuckoo distributor (CD). If a flow key has been inserted in the flow
table the first level will indicate with high probability that which
subtable to look into. A lookup is performed on the second level (the
target subtable) to retrieve the result. If the key doesn’t have a match,
then we revert back to the sequential search of subtables. The patch is
partially inspired by earlier concepts proposed in "simTable"[1] and
"Cuckoo Filter"[2], and DPDK's Cuckoo Hash implementation.

This patch can improve the already existing Subtable Ranking when traffic
data has high entropy. Subtable Ranking helps minimize the number of
traversed subtables when most of the traffic hit the same subtable.
However, in the case of high entropy traffic such as traffic coming from
a physical port, multiple subtables could be hit with a similar frequency.
In this case the average subtable lookups per hit would be much greater
than 1. In addition, CD can adaptively turn off when it finds the traffic
mostly hit one subtable. Thus, CD will not be an overhead when Subtable
Ranking works well.

Scheme:
CD is in front of the subtables. Packets are directed to corresponding subtable
if hit in CD instead of searching each subtable sequentially.
 ---
|  CD   |
 ---
   \
\
 -  - -
|sub  ||sub  |...|sub  |
|table||table|   |table|
 -  - -

 Evaluation:
 --
We create a set of rules with various src IP. We feed traffic containing various
numbers of flows with various src IP and dst IP. All the flows hit 10/20/30
rules creating 10/20/30 subtables. We will explain the rule/traffic setup
in detail later.

The table below shows the preliminary continuous testing results (full line
speed test) we collected with a uni-directional phy-to-phy setup. OvS
runs with 1 PMD. We use Spirent as the hardware traffic generator.

 Before v2 rebase:
 
AVX2 data:
20k flows:
no.subtable: 10  20  30
cd-ovs   4267332 3478251 3126763
orig-ovs 3260883 2174551 1689981
speedup  1.31x   1.60x   1.85x

100k flows:
no.subtable: 10  20  30
cd-ovs   4015783 3276100 2970645
orig-ovs 2692882 1711955 1302321
speedup  1.49x   1.91x   2.28x

1M flows:
no.subtable: 10  20  30
cd-ovs   3895961 3170530 2968555
orig-ovs 2683455 1646227 1240501
speedup  1.45x   1.92x   2.39x

Scalar data:
1M flows:
no.subtable: 10  20  30
cd-ovs   3658328 3028111 2863329
orig_ovs 2683455 1646227 1240501
speedup  1.36x   1.84x   2.31x

 After v2 rebase:
 
After rebase for v1, we tested 1M flows, 20 table cases, the results still hold.
1M flows:
no.subtable:   20
cd-ovs 3066483
orig-ovs   1588049
speedup1.93x


 Test rules/traffic setup:
 
To setup a test case with 20 subtables, the rule set we use is like below:
tcp,nw_src=1.0.0.0/8, actions=output:1
udp,nw_src=2.0.0.0/9, actions=output:1
udp,nw_src=3.0.0.0/10,actions=output:1
udp,nw_src=4.0.0.0/11,actions=output:1
...
udp,nw_src=18.0.0.0/25,actions=output:1
udp,nw_src=19.0.0.0/26,actions=output:1
udp,nw_src=20.0.0.0/27,actions=output:1

Then for the traffic generator, we generate corresponding traffics with
src_ip varying from 1.0.0.0 to 20.0.0.0. For each src_ip, we change
dst_ip for 5 different values. This will effectively generate 1M
different flows hitting the 20 rules we created. And because the different
wildcarding bits in nw_src, the 20 rules will belong to 20 subtables.
We use 64 Bytes packet across all tests.

How to check if CD works or not for your use case:
 
CD cannot improve throughput for all use cases. It targets on use cases when
multiple subtables exist and when the top-ranked subtable is not hit by the
vast majority of the traffic.

One can use $OVS_DIR/utilities/ovs-appctl dpif-netdev/pmd-stats-show
command to check CD statistics: hit/miss.
Another statistic also shown is: "avg. subtable lookups per hit".
In our test case, the original OvS will have an average subtable lookups value
as 10, because ther