Re: [ovs-dev] [PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison

2020-07-10 Thread Yanqin Wei
Hi Ilya,

> >
> >>> ---
> >>
> >> Hi.
> >> First of all, thanks for working on performance improvements!
> > Thanks, I saw some slides where OVS was used to compare flow scalability
> with other projects. It inspired me to optimize this code.
> >
> >>
> >> However, this doesn't look as a clean patch.
> > There are some trade-off for legacy code.
> 
> What trade-offs?
In some function, the parameter is struct flow_tnl instead of 'pkt_metadata' or 
'flow'. In these function, tunnel_valid cannot be used for valid check. So some 
function signatures need be modified.
> 
> >>
> >> Why we need both pkt_metadata_datapath_init() and pkt_metadata_init() ?
> >> Why we can't just not initialize ip_dst and use tunnel_valid flag 
> >> everywhere?
> >
> > This patch wants to reduce the scope of modification( only for fastpath),
> because performance is not critical for slow path. So tunnel dst@ is set 
> before
> leaving fast path(upcall).
> > Another reason is 'flow_tnl' member is defined in both ' pkt_metadata' and
> 'flow'.  If tunnel_valid flag is introduced into 'flow', the layout and  
> legacy flow
> API also need to be modified.
> 
> I understand that you didn't want to touch anything beside the performance
> critical parts.  However, dp_packet_/pkt_ API is already heavily overloaded
> and having a few very similar functions that can or can not be used in some
> contexts makes things even more complicated.  It's hard to read and maintain.
> And it's prone to errors in case someone will try t modify datapath code.
> I'd prefer not t have different initialization functions and only have one 
> variant.
> This will also solve the issue that every other part of code uses tunneling
> metadata without checking 'tunnel_valid' flag.  This is actually a logical
> mistake.
OK, it makes sense. I'll check all the places where flow_tnl is used and update 
v2 for tunnel_valid checking. 

> And yes, 'tunnel_valid' flag really needs a comment inside the structure
> definition.
OK, will add comment in V2.
> 
> >
> >>
> >> Current version complicates code making it less readable and prone to
> errors.
> > Do you prefer to use tunnel_valid in both fast path and slow path? I could
> send v2 for this modification.
> >
> >> Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 0/6] Memory access optimization for flow scalability of userspace datapath.

2020-07-08 Thread Yanqin Wei
Hi Harry,

> >
> > OVS userspace datapath is a program with heavy memory access. It needs
> > to load/store a large number of memory, including packet header,
> > metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of cache
> > line missing and refilling, which has a great impact on flow
> > scalability. And in some cases, EMC has a negative impact on the
> > overall performance. It is difficult for user to dynamically manage the
> enabling of EMC.
> >
> > This series of patches improve memory access of userspace datapath as
> > follows:
> > 1. Reduce the number of metadata cache line accessed by non-tunnel traffic.
> > 2. Decrease unnecessary memory load/store for batch/flow.
> > 3. Modify the layout of EMC data struct. Centralize the storage of
> > hash value.
> >
> > In the NIC2NIC traffic tests, the overall performance improvement is
> > observed, especially in multi-flow cases.
> > Flows   delta
> > 1-1K flows  5-10%
> > 10K flows   20%
> > 100K flows  40%
> > EMC disable 10%
> 
> Hi Yanqin,
> 
> A quick simple test here with EMC disabled shows similar performance results
> to your data above, nice work. I think the optimizations here make sense, to
> not touch extra cache-lines until required (eg tunnel metadata), particularly
> for outer packet parsing.
Many thanks for your time to test and review the patch. 
> 
> I hope to enable more optimizations around dpif-netdev in 2.15, so if you are
> also planning to do more work in this area, it would be good to sync to avoid
> excessive rebasing in future?
That is great to hear that. If we have new work planed in 2.15, we will discuss 
with you and community.
> 
> Regards, -Harry
> 
> 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

2020-07-06 Thread Yanqin Wei
The 2nd one is another periodic task for dpcls ranking.

From: Yanqin Wei
Sent: Tuesday, July 7, 2020 1:19 PM
To: Shahaji Bhosle 
Cc: Flavio Leitner ; ovs-dev@openvswitch.org; nd 
; Ilya Maximets ; Lee Reed 
; Vinay Gupta ; Alex Barba 

Subject: RE: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Hi Shahaji,

Yes, update some counter each 10 second for pmd balance and pmd info 
collection.  I have no idea of how to disable them from outside.
You could try to modify the following number and observe packet loss.

/* Time in microseconds of the interval in which rxq processing cycles used
* in rxq to pmd assignments is measured and stored. */
#define PMD_RXQ_INTERVAL_LEN 1000LL

/* Time in microseconds between successive optimizations of the dpcls
* subtable vector */
#define DPCLS_OPTIMIZATION_INTERVAL 100LL

Best Regards,
Wei Yanqin

From: Shahaji Bhosle 
mailto:shahaji.bho...@broadcom.com>>
Sent: Tuesday, July 7, 2020 12:23 PM
To: Yanqin Wei mailto:yanqin@arm.com>>
Cc: Flavio Leitner mailto:f...@sysclose.org>>; 
ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; nd 
mailto:n...@arm.com>>; Ilya Maximets 
mailto:i.maxim...@samsung.com>>; Lee Reed 
mailto:lee.r...@broadcom.com>>; Vinay Gupta 
mailto:vinay.gu...@broadcom.com>>; Alex Barba 
mailto:alex.ba...@broadcom.com>>
Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Thanks Yangin,
What does this define mean? Every 10 second some kind of book keeping of the 
packet processing cycles ? Are you saying to make this even bigger in time. 
1000 seconds or something? If I want to disable what do I do?
Thanks, Shahaji

On Mon, Jul 6, 2020 at 10:30 PM Yanqin Wei 
mailto:yanqin@arm.com>> wrote:
Hi Shahaji,

It seems to be caused by some periodic task.  In the pmd thread, pmd auto load 
balance would be done periodically.
/* Time in microseconds of the interval in which rxq processing cycles used
* in rxq to pmd assignments is measured and stored. */
#define PMD_RXQ_INTERVAL_LEN 1000LL

Would you like to disable it if it is not necessary?

Best Regards,
Wei Yanqin

From: Shahaji Bhosle 
mailto:shahaji.bho...@broadcom.com>>
Sent: Monday, July 6, 2020 8:24 PM
To: Yanqin Wei mailto:yanqin@arm.com>>
Cc: Flavio Leitner mailto:f...@sysclose.org>>; 
ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; nd 
mailto:n...@arm.com>>; Ilya Maximets 
mailto:i.maxim...@samsung.com>>; Lee Reed 
mailto:lee.r...@broadcom.com>>; Vinay Gupta 
mailto:vinay.gu...@broadcom.com>>; Alex Barba 
mailto:alex.ba...@broadcom.com>>
Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Hi Yanqin,
The drops are random intervals, sometimes I can run for minutes without drops. 
The case is very borderline with when CPUs are close to 99% and with around 
1000 flows. We see the drops once every 10-15 seconds and its random in nature. 
If I use one ring per core the drops go away, if I enable EMC then the drops go 
away etc.
Thanks, Shahaji

On Mon, Jul 6, 2020 at 5:27 AM Yanqin Wei 
mailto:yanqin@arm.com>> wrote:
Hi Shahaji,

I have not measured context switch overhead, but I feel it should be 
acceptable. Because 10Mpps throughput with zero-packet drop(20s) could be 
achieved in some arm server.  Maybe you could make performance profiling on 
your test bench to find out the root cause of performance degradation of  
multi-rings.

Best Regards,
Wei Yanqin

From: Shahaji Bhosle 
mailto:shahaji.bho...@broadcom.com>>
Sent: Thursday, July 2, 2020 9:27 PM
To: Yanqin Wei mailto:yanqin@arm.com>>
Cc: Flavio Leitner mailto:f...@sysclose.org>>; 
ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; nd 
mailto:n...@arm.com>>; Ilya Maximets 
mailto:i.maxim...@samsung.com>>; Lee Reed 
mailto:lee.r...@broadcom.com>>; Vinay Gupta 
mailto:vinay.gu...@broadcom.com>>; Alex Barba 
mailto:alex.ba...@broadcom.com>>
Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Thanks Yanqin,
I am not seeing any context switches beyond 40usec in our do nothing loop test. 
But when OvS packets multiple rings(queues) on the same CPU and the number of 
packet it starts batching (MAX_BURST_SIZE) the toops will will take more time, 
I can see rings getting getting filled up. And then its a feedback loop. CPUs 
are running close to 100% any disturbance at that point I think is too much.
Do you have any data that you use to monitor OvS. I am doing all the above 
experiments without OvS.
Thanks, Shahaji

On Thu, Jul 2, 2020 at 4:43 AM Yanqin Wei 
mailto:yanqin@arm.com>> wrote:
Hi Shahaji,

IIUC, 1Hz time tick cannot be disabled even if full dynticks, right? But I have 
no idea of why it caused packet loss because it should be only a small overhead 
when rcu_nocbs is enabled .

Best Regards,
Wei Yanqin

Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

2020-07-06 Thread Yanqin Wei
Hi Shahaji,

Yes, update some counter each 10 second for pmd balance and pmd info 
collection.  I have no idea of how to disable them from outside.
You could try to modify the following number and observe packet loss.

/* Time in microseconds of the interval in which rxq processing cycles used
* in rxq to pmd assignments is measured and stored. */
#define PMD_RXQ_INTERVAL_LEN 1000LL

/* Time in microseconds between successive optimizations of the dpcls
* subtable vector */
#define DPCLS_OPTIMIZATION_INTERVAL 100LL

Best Regards,
Wei Yanqin

From: Shahaji Bhosle 
Sent: Tuesday, July 7, 2020 12:23 PM
To: Yanqin Wei 
Cc: Flavio Leitner ; ovs-dev@openvswitch.org; nd 
; Ilya Maximets ; Lee Reed 
; Vinay Gupta ; Alex Barba 

Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Thanks Yangin,
What does this define mean? Every 10 second some kind of book keeping of the 
packet processing cycles ? Are you saying to make this even bigger in time. 
1000 seconds or something? If I want to disable what do I do?
Thanks, Shahaji

On Mon, Jul 6, 2020 at 10:30 PM Yanqin Wei 
mailto:yanqin@arm.com>> wrote:
Hi Shahaji,

It seems to be caused by some periodic task.  In the pmd thread, pmd auto load 
balance would be done periodically.
/* Time in microseconds of the interval in which rxq processing cycles used
* in rxq to pmd assignments is measured and stored. */
#define PMD_RXQ_INTERVAL_LEN 1000LL

Would you like to disable it if it is not necessary?

Best Regards,
Wei Yanqin

From: Shahaji Bhosle 
mailto:shahaji.bho...@broadcom.com>>
Sent: Monday, July 6, 2020 8:24 PM
To: Yanqin Wei mailto:yanqin@arm.com>>
Cc: Flavio Leitner mailto:f...@sysclose.org>>; 
ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; nd 
mailto:n...@arm.com>>; Ilya Maximets 
mailto:i.maxim...@samsung.com>>; Lee Reed 
mailto:lee.r...@broadcom.com>>; Vinay Gupta 
mailto:vinay.gu...@broadcom.com>>; Alex Barba 
mailto:alex.ba...@broadcom.com>>
Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Hi Yanqin,
The drops are random intervals, sometimes I can run for minutes without drops. 
The case is very borderline with when CPUs are close to 99% and with around 
1000 flows. We see the drops once every 10-15 seconds and its random in nature. 
If I use one ring per core the drops go away, if I enable EMC then the drops go 
away etc.
Thanks, Shahaji

On Mon, Jul 6, 2020 at 5:27 AM Yanqin Wei 
mailto:yanqin@arm.com>> wrote:
Hi Shahaji,

I have not measured context switch overhead, but I feel it should be 
acceptable. Because 10Mpps throughput with zero-packet drop(20s) could be 
achieved in some arm server.  Maybe you could make performance profiling on 
your test bench to find out the root cause of performance degradation of  
multi-rings.

Best Regards,
Wei Yanqin

From: Shahaji Bhosle 
mailto:shahaji.bho...@broadcom.com>>
Sent: Thursday, July 2, 2020 9:27 PM
To: Yanqin Wei mailto:yanqin@arm.com>>
Cc: Flavio Leitner mailto:f...@sysclose.org>>; 
ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; nd 
mailto:n...@arm.com>>; Ilya Maximets 
mailto:i.maxim...@samsung.com>>; Lee Reed 
mailto:lee.r...@broadcom.com>>; Vinay Gupta 
mailto:vinay.gu...@broadcom.com>>; Alex Barba 
mailto:alex.ba...@broadcom.com>>
Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Thanks Yanqin,
I am not seeing any context switches beyond 40usec in our do nothing loop test. 
But when OvS packets multiple rings(queues) on the same CPU and the number of 
packet it starts batching (MAX_BURST_SIZE) the toops will will take more time, 
I can see rings getting getting filled up. And then its a feedback loop. CPUs 
are running close to 100% any disturbance at that point I think is too much.
Do you have any data that you use to monitor OvS. I am doing all the above 
experiments without OvS.
Thanks, Shahaji

On Thu, Jul 2, 2020 at 4:43 AM Yanqin Wei 
mailto:yanqin@arm.com>> wrote:
Hi Shahaji,

IIUC, 1Hz time tick cannot be disabled even if full dynticks, right? But I have 
no idea of why it caused packet loss because it should be only a small overhead 
when rcu_nocbs is enabled .

Best Regards,
Wei Yanqin

===

From: Shahaji Bhosle 
mailto:shahaji.bho...@broadcom.com>>
Sent: Thursday, July 2, 2020 6:11 AM
To: Yanqin Wei mailto:yanqin@arm.com>>
Cc: Flavio Leitner mailto:f...@sysclose.org>>; 
ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; nd 
mailto:n...@arm.com>>; Ilya Maximets 
mailto:i.maxim...@samsung.com>>; Lee Reed 
mailto:lee.r...@broadcom.com>>; Vinay Gupta 
mailto:vinay.gu...@broadcom.com>>; Alex Barba 
mailto:alex.ba...@broadcom.com>>
Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Hi Yanqin,
I added the patch you gave me to my s

Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

2020-07-06 Thread Yanqin Wei
Hi Shahaji,

It seems to be caused by some periodic task.  In the pmd thread, pmd auto load 
balance would be done periodically.
/* Time in microseconds of the interval in which rxq processing cycles used
* in rxq to pmd assignments is measured and stored. */
#define PMD_RXQ_INTERVAL_LEN 1000LL

Would you like to disable it if it is not necessary?

Best Regards,
Wei Yanqin

From: Shahaji Bhosle 
Sent: Monday, July 6, 2020 8:24 PM
To: Yanqin Wei 
Cc: Flavio Leitner ; ovs-dev@openvswitch.org; nd 
; Ilya Maximets ; Lee Reed 
; Vinay Gupta ; Alex Barba 

Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Hi Yanqin,
The drops are random intervals, sometimes I can run for minutes without drops. 
The case is very borderline with when CPUs are close to 99% and with around 
1000 flows. We see the drops once every 10-15 seconds and its random in nature. 
If I use one ring per core the drops go away, if I enable EMC then the drops go 
away etc.
Thanks, Shahaji

On Mon, Jul 6, 2020 at 5:27 AM Yanqin Wei 
mailto:yanqin@arm.com>> wrote:
Hi Shahaji,

I have not measured context switch overhead, but I feel it should be 
acceptable. Because 10Mpps throughput with zero-packet drop(20s) could be 
achieved in some arm server.  Maybe you could make performance profiling on 
your test bench to find out the root cause of performance degradation of  
multi-rings.

Best Regards,
Wei Yanqin

From: Shahaji Bhosle 
mailto:shahaji.bho...@broadcom.com>>
Sent: Thursday, July 2, 2020 9:27 PM
To: Yanqin Wei mailto:yanqin@arm.com>>
Cc: Flavio Leitner mailto:f...@sysclose.org>>; 
ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; nd 
mailto:n...@arm.com>>; Ilya Maximets 
mailto:i.maxim...@samsung.com>>; Lee Reed 
mailto:lee.r...@broadcom.com>>; Vinay Gupta 
mailto:vinay.gu...@broadcom.com>>; Alex Barba 
mailto:alex.ba...@broadcom.com>>
Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Thanks Yanqin,
I am not seeing any context switches beyond 40usec in our do nothing loop test. 
But when OvS packets multiple rings(queues) on the same CPU and the number of 
packet it starts batching (MAX_BURST_SIZE) the toops will will take more time, 
I can see rings getting getting filled up. And then its a feedback loop. CPUs 
are running close to 100% any disturbance at that point I think is too much.
Do you have any data that you use to monitor OvS. I am doing all the above 
experiments without OvS.
Thanks, Shahaji

On Thu, Jul 2, 2020 at 4:43 AM Yanqin Wei 
mailto:yanqin@arm.com>> wrote:
Hi Shahaji,

IIUC, 1Hz time tick cannot be disabled even if full dynticks, right? But I have 
no idea of why it caused packet loss because it should be only a small overhead 
when rcu_nocbs is enabled .

Best Regards,
Wei Yanqin

===

From: Shahaji Bhosle 
mailto:shahaji.bho...@broadcom.com>>
Sent: Thursday, July 2, 2020 6:11 AM
To: Yanqin Wei mailto:yanqin@arm.com>>
Cc: Flavio Leitner mailto:f...@sysclose.org>>; 
ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; nd 
mailto:n...@arm.com>>; Ilya Maximets 
mailto:i.maxim...@samsung.com>>; Lee Reed 
mailto:lee.r...@broadcom.com>>; Vinay Gupta 
mailto:vinay.gu...@broadcom.com>>; Alex Barba 
mailto:alex.ba...@broadcom.com>>
Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Hi Yanqin,
I added the patch you gave me to my script which runs a do nothing for loop. 
You can see the spikes in the below plot. 976/1000 times we are perfect, but 
around every 1 second u can see something going wrong. I dont see anything 
wrong in the trace-cmd world.
Thanks, Shahaji

root@bcm958802a8046c:~/vinay_rx/dynticks-testing# ./run_isb_rdtsc
+ TARGET=2
+ MASK=4
+ NUM_ITER=1000
+ NUM_MS=100
+ N=3750
+ LOGFILE=loop_1000iter_100ms.log
+ tee loop_1000iter_100ms.log
+ trace-cmd record -p function_graph -e all -M 4 -o trace_1000iter_100ms.dat 
taskset -c 2 /home/root/arm_stb_user_loop_isb_rdtsc 1000 3750
  plugin 'function_graph'
Cycles/Second (Hz) = 30
Nano-seconds per cycle = 0.

Using ISB() before rte_rdtsc()
num_iter: 1000
do_nothing_loop for (N)=3750
Running 1000 iterations of do_nothing_loop for (N)=3750

Average =  100282.193430333 u-secs
Max =  124777.48867 u-secs
Min =  10.01767 u-secs
\u03c3  =1931.352376508 u-secs

Average =  300846580.29 cycles
Max =  374332466.00 cycles
Min =  30053.00 cycles
\u03c3  =5794057.13 cycles

#\u03c3 = events
 0 = 976
 1 = 3
 2 = 4
 3 = 3
 4 = 3
 5 = 2
 6 = 2
 7 = 2
 8 = 1
 9 = 1
10 = 1
12 = 2




On Wed, Jul 1, 2020 at 3:57 AM Yanqin Wei 
<mailto:yanqin@arm.com<mailto:yanqin@arm.com>> wrote:
Hi Shahaji,

Adding isb instruction can help rdtsc precise, which sync system counter to 
cntvct_

Re: [ovs-dev] [PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison

2020-07-06 Thread Yanqin Wei
Hi Ilya,

> > ---
> 
> Hi.
> First of all, thanks for working on performance improvements!
Thanks, I saw some slides where OVS was used to compare flow scalability with 
other projects. It inspired me to optimize this code.

> 
> However, this doesn't look as a clean patch.
There are some trade-off for legacy code.
> 
> Why we need both pkt_metadata_datapath_init() and pkt_metadata_init() ?
> Why we can't just not initialize ip_dst and use tunnel_valid flag everywhere?

This patch wants to reduce the scope of modification( only for fastpath), 
because performance is not critical for slow path. So tunnel dst@ is set before 
leaving fast path(upcall).
Another reason is 'flow_tnl' member is defined in both ' pkt_metadata' and 
'flow'.  If tunnel_valid flag is introduced into 'flow', the layout and  legacy 
flow API also need to be modified.

> 
> Current version complicates code making it less readable and prone to errors.
Do you prefer to use tunnel_valid in both fast path and slow path? I could send 
v2 for this modification.
 
> Best regards, Ilya Maximets.
> 
> >  lib/dpif-netdev.c | 14 +++---
> >  lib/flow.c|  2 +-
> >  lib/packets.h | 46 --
> >  3 files changed, 52 insertions(+), 10 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > 51c888501..c94d5e8c7 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -6625,12 +6625,16 @@ dfc_processing(struct dp_netdev_pmd_thread
> *pmd,
> >  if (i != cnt - 1) {
> >  struct dp_packet **packets = packets_->packets;
> >  /* Prefetch next packet data and metadata. */
> > -OVS_PREFETCH(dp_packet_data(packets[i+1]));
> > -pkt_metadata_prefetch_init([i+1]->md);
> > +OVS_PREFETCH(dp_packet_data(packets[i + 1]));
> > +if (md_is_valid) {
> > +pkt_metadata_prefetch([i + 1]->md);
> > +} else {
> > +pkt_metadata_prefetch_init([i + 1]->md);
> > +}
> >  }
> >
> >  if (!md_is_valid) {
> > -pkt_metadata_init(>md, port_no);
> > +pkt_metadata_datapath_init(>md, port_no);
> >  }
> >
> >  if ((*recirc_depth_get() == 0) && @@ -6730,6 +6734,10 @@
> > handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
> >  miniflow_expand(>mf, );
> >  memset(, 0, sizeof match.wc);
> >
> > +if (!packet->md.tunnel_valid) {
> > +pkt_metadata_tnl_dst_init(>md);
> > +}
> > +
> >  ofpbuf_clear(actions);
> >  ofpbuf_clear(put_actions);
> >
> > diff --git a/lib/flow.c b/lib/flow.c
> > index cc1b3f2db..1f0b3d4dc 100644
> > --- a/lib/flow.c
> > +++ b/lib/flow.c
> > @@ -747,7 +747,7 @@ miniflow_extract(struct dp_packet *packet, struct
> miniflow *dst)
> >  ovs_be16 ct_tp_src = 0, ct_tp_dst = 0;
> >
> >  /* Metadata. */
> > -if (flow_tnl_dst_is_set(>tunnel)) {
> > +if (md->tunnel_valid && flow_tnl_dst_is_set(>tunnel)) {
> >  miniflow_push_words(mf, tunnel, >tunnel,
> >  offsetof(struct flow_tnl, metadata) /
> >  sizeof(uint64_t)); diff --git
> > a/lib/packets.h b/lib/packets.h index 447e6f6fa..3b507d2a3 100644
> > --- a/lib/packets.h
> > +++ b/lib/packets.h
> > @@ -103,15 +103,16 @@
> PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
> > action. */
> >  uint32_t skb_priority;  /* Packet priority for QoS. */
> >  uint32_t pkt_mark;  /* Packet mark. */
> > +struct conn *conn;  /* Cached conntrack connection. */
> >  uint8_t  ct_state;  /* Connection state. */
> >  bool ct_orig_tuple_ipv6;
> >  uint16_t ct_zone;   /* Connection zone. */
> >  uint32_t ct_mark;   /* Connection mark. */
> >  ovs_u128 ct_label;  /* Connection label. */
> >  union flow_in_port in_port; /* Input port. */
> > -struct conn *conn;  /* Cached conntrack connection. */
> >  bool reply; /* True if reply direction. */
> >  bool icmp_related;  /* True if ICMP related. */
> > +bool tunnel_valid;
> >  );
> >
> >  PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
> @@
> > -141,6 +142,7 @@ pkt_metadata_init_tnl(struct pkt_metadata *md)
> >   * are before this and as long as they are empty, the options won't
> >   * be looked at. */
> >  memset(md, 0, offsetof(struct pkt_metadata,
> > tunnel.metadata.opts));
> > +md->tunnel_valid = true;
> >  }
> >
> >  static inline void
> > @@ -151,6 +153,25 @@ pkt_metadata_init_conn(struct pkt_metadata *md)
> >
> >  static inline void
> >  pkt_metadata_init(struct pkt_metadata *md, odp_port_t port)
> > +{
> > +/* Initialize only till ct_state. Once the ct_state is zeroed out rest
> > + * of ct fields will not be looked at unless ct_state != 0.
> > + */
> > +memset(md, 0, 

Re: [ovs-dev] [PATCH v1 0/6] Memory access optimization for flow scalability of userspace datapath.

2020-07-06 Thread Yanqin Wei
Hi William,
> 
> On Tue, Jun 2, 2020 at 12:10 AM Yanqin Wei  wrote:
> >
> > OVS userspace datapath is a program with heavy memory access. It needs
> > to load/store a large number of memory, including packet header,
> > metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of cache
> > line missing and refilling, which has a great impact on flow
> > scalability. And in some cases, EMC has a negative impact on the
> > overall performance. It is difficult for user to dynamically manage the
> enabling of EMC.
> 
> I'm just curious.
> Did you do some micro performance benchmark to find out these cache line
> issues?
Yes, we did some micro benchmarking for packet parsing, EMC and DPCLS.  But 
end2end test is more important because in the fastpath, different data access 
will affect each other. For example, a large number of EMC table will also 
impact the access efficiency of metadata or packet header.

> If so, what kind of tool do you use?
"perf stat -e" could record many kinds of PMU event. We could use "perf list" 
to list all events, some of them can be used to measure memory access 
efficiency (cache miss/refill/evict).

> Or do you do it by inspecting the code?
Code analysis is also important. We need to analyze the main data accessed in 
the fast path and their layout. 

> 
> Thanks
> William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 0/6] Memory access optimization for flow scalability of userspace datapath.

2020-07-06 Thread Yanqin Wei
Hi William,

Many thanks for your time to test these patches. The number is achieved on Arm 
server, but x86 has the similar improvement. 
And CPU cache size will slightly impact the performance data, because the 
larger cache size, the lower probability of cache refilling/eviction . 

Best Regards,
Wei Yanqin 
> 
> On Tue, Jun 30, 2020 at 2:26 AM Yanqin Wei  wrote:
> >
> > Hi, every contributor
> >
> > These patches could significantly improve multi-flow throughput of
> userspace datapath.  If you feel it will take too much time to review all 
> patches,
> I suggest you could look at the 2nd/3rd first, which have the major
> improvement in these patches.
> > [ovs-dev][PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip
> > ip/ipv6 address comparison [ovs-dev][PATCH v1 3/6] dpif-netdev: improve
> emc lookup performance by contiguous storage of hash value.
> >
> > Any comments from anyone are appreciated.
> >
> > Best Regards,
> > Wei Yanqin
> >
> > > -Original Message-
> > > From: Yanqin Wei 
> > > Sent: Tuesday, June 2, 2020 3:10 PM
> > > To: d...@openvswitch.org
> > > Cc: nd ; i.maxim...@ovn.org; u9012...@gmail.com;
> Malvika
> > > Gupta ; Lijian Zhang ;
> > > Ruifeng Wang ; Lance Yang
> > > ; Yanqin Wei 
> > > Subject: [ovs-dev][PATCH v1 0/6] Memory access optimization for flow
> > > scalability of userspace datapath.
> > >
> > > OVS userspace datapath is a program with heavy memory access. It
> > > needs to load/store a large number of memory, including packet
> > > header, metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of
> > > cache line missing and refilling, which has a great impact on flow
> > > scalability. And in some cases, EMC has a negative impact on the
> > > overall performance. It is difficult for user to dynamically manage the
> enabling of EMC.
> > >
> > > This series of patches improve memory access of userspace datapath
> > > as
> > > follows:
> > > 1. Reduce the number of metadata cache line accessed by non-tunnel
> traffic.
> > > 2. Decrease unnecessary memory load/store for batch/flow.
> > > 3. Modify the layout of EMC data struct. Centralize the storage of hash
> value.
> > >
> > > In the NIC2NIC traffic tests, the overall performance improvement is
> > > observed, especially in multi-flow cases.
> > > Flows   delta
> > > 1-1K flows  5-10%
> > > 10K flows   20%
> > > 100K flows  40%
> > > EMC disable 10%
> 
> Thanks for submitting the patch series. I apply the series and I do see the
> above performance improvement you describe above.
> btw, is your number on ARM server or x86?

> Below is my number using single flow and drop action on Intel(R)
> Xeon(R) CPU @ 2.00GHz
> In summary I see around 10% improvement using 1flow.
> 
> === master ===
> root@instance-3:~/ovs# ovs-appctl dpif-netdev/pmd-stats-show pmd thread
> numa_id 0 core_id 0:
>   packets received: 96269888
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 87513839
>   smc hits: 0
>   megaflow hits: 8755584
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 432
>   avg. packets per output batch: 0.00
>   idle cycles: 0 (0.00%)
>   processing cycles: 20083008856 (100.00%)
>   avg cycles per packet: 208.61 (20083008856/96269888)
>   avg processing cycles per packet: 208.61 (20083008856/96269888)
> 
> === master without EMC ===
> pmd thread numa_id 0 core_id 1:
>   packets received: 90775936
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 0
>   smc hits: 0
>   megaflow hits: 90775424
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 479
>   avg. packets per output batch: 0.00
>   idle cycles: 0 (0.00%)
>   processing cycles: 21239087946 (100.00%)
>   avg cycles per packet: 233.97 (21239087946/90775936)
>   avg processing cycles per packet: 233.97 (21239087946/90775936)
> 
> === yanqin v1: ===
> pmd thread numa_id 0 core_id 1:
>   packets received: 156582112
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 142344109
>   smc hits: 0
>   megaflow hits: 14237554
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 448
>   avg. packets per output batch: 0.00
>   idle cycles: 4320112 (0.01%)
>   processing cycles: 30503055968 (99.99%)
>   avg

Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

2020-07-06 Thread Yanqin Wei
Hi Shahaji,

I have not measured context switch overhead, but I feel it should be 
acceptable. Because 10Mpps throughput with zero-packet drop(20s) could be 
achieved in some arm server.  Maybe you could make performance profiling on 
your test bench to find out the root cause of performance degradation of  
multi-rings.

Best Regards,
Wei Yanqin

From: Shahaji Bhosle 
Sent: Thursday, July 2, 2020 9:27 PM
To: Yanqin Wei 
Cc: Flavio Leitner ; ovs-dev@openvswitch.org; nd 
; Ilya Maximets ; Lee Reed 
; Vinay Gupta ; Alex Barba 

Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Thanks Yanqin,
I am not seeing any context switches beyond 40usec in our do nothing loop test. 
But when OvS packets multiple rings(queues) on the same CPU and the number of 
packet it starts batching (MAX_BURST_SIZE) the toops will will take more time, 
I can see rings getting getting filled up. And then its a feedback loop. CPUs 
are running close to 100% any disturbance at that point I think is too much.
Do you have any data that you use to monitor OvS. I am doing all the above 
experiments without OvS.
Thanks, Shahaji

On Thu, Jul 2, 2020 at 4:43 AM Yanqin Wei 
mailto:yanqin@arm.com>> wrote:
Hi Shahaji,

IIUC, 1Hz time tick cannot be disabled even if full dynticks, right? But I have 
no idea of why it caused packet loss because it should be only a small overhead 
when rcu_nocbs is enabled .

Best Regards,
Wei Yanqin

===

From: Shahaji Bhosle 
mailto:shahaji.bho...@broadcom.com>>
Sent: Thursday, July 2, 2020 6:11 AM
To: Yanqin Wei mailto:yanqin@arm.com>>
Cc: Flavio Leitner mailto:f...@sysclose.org>>; 
ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; nd 
mailto:n...@arm.com>>; Ilya Maximets 
mailto:i.maxim...@samsung.com>>; Lee Reed 
mailto:lee.r...@broadcom.com>>; Vinay Gupta 
mailto:vinay.gu...@broadcom.com>>; Alex Barba 
mailto:alex.ba...@broadcom.com>>
Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Hi Yanqin,
I added the patch you gave me to my script which runs a do nothing for loop. 
You can see the spikes in the below plot. 976/1000 times we are perfect, but 
around every 1 second u can see something going wrong. I dont see anything 
wrong in the trace-cmd world.
Thanks, Shahaji

root@bcm958802a8046c:~/vinay_rx/dynticks-testing# ./run_isb_rdtsc
+ TARGET=2
+ MASK=4
+ NUM_ITER=1000
+ NUM_MS=100
+ N=3750
+ LOGFILE=loop_1000iter_100ms.log
+ tee loop_1000iter_100ms.log
+ trace-cmd record -p function_graph -e all -M 4 -o trace_1000iter_100ms.dat 
taskset -c 2 /home/root/arm_stb_user_loop_isb_rdtsc 1000 3750
  plugin 'function_graph'
Cycles/Second (Hz) = 30
Nano-seconds per cycle = 0.

Using ISB() before rte_rdtsc()
num_iter: 1000
do_nothing_loop for (N)=3750
Running 1000 iterations of do_nothing_loop for (N)=3750

Average =  100282.193430333 u-secs
Max =  124777.48867 u-secs
Min =  10.01767 u-secs
\u03c3  =1931.352376508 u-secs

Average =  300846580.29 cycles
Max =  374332466.00 cycles
Min =  30053.00 cycles
\u03c3  =5794057.13 cycles

#\u03c3 = events
 0 = 976
 1 = 3
 2 = 4
 3 = 3
 4 = 3
 5 = 2
 6 = 2
 7 = 2
 8 = 1
 9 = 1
10 = 1
12 = 2




On Wed, Jul 1, 2020 at 3:57 AM Yanqin Wei 
<mailto:yanqin@arm.com<mailto:yanqin@arm.com>> wrote:
Hi Shahaji,

Adding isb instruction can help rdtsc precise, which sync system counter to 
cntvct_el0. There is a patch in DPDK. https://patchwork.dpdk.org/patch/66561/
So it may be not related with intermittent drops you observed.

Best Regards,
Wei Yanqin

> -Original Message-
> From: dev 
> <mailto:ovs-dev-boun...@openvswitch.org<mailto:ovs-dev-boun...@openvswitch.org>>
>  On Behalf Of Shahaji Bhosle
> via dev
> Sent: Wednesday, July 1, 2020 6:05 AM
> To: Flavio Leitner <mailto:f...@sysclose.org<mailto:f...@sysclose.org>>
> Cc: mailto:ovs-dev@openvswitch.org<mailto:ovs-dev@openvswitch.org>; Ilya 
> Maximets <mailto:i.maxim...@samsung.com<mailto:i.maxim...@samsung.com>>;
> Lee Reed <mailto:lee.r...@broadcom.com<mailto:lee.r...@broadcom.com>>; Vinay 
> Gupta
> <mailto:vinay.gu...@broadcom.com<mailto:vinay.gu...@broadcom.com>>; Alex 
> Barba <mailto:alex.ba...@broadcom.com<mailto:alex.ba...@broadcom.com>>
> Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP 
> (iperf3)
>
> Hi Flavio,
> I still see intermittent drops with rcu_nocbs. So I wrote that do_nothing()
> loop..to avoid all the other distractions to see if Linux is messing with the 
> OVS
> loop just to see what is going on. The interesting thing I see the case *BOLD*
> below where I use an ISB() instruction my STD deviation is well within Both 
> the

Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

2020-07-02 Thread Yanqin Wei
Hi Shahaji,

IIUC, 1Hz time tick cannot be disabled even if full dynticks, right? But I have 
no idea of why it caused packet loss because it should be only a small overhead 
when rcu_nocbs is enabled .

Best Regards,
Wei Yanqin

===

From: Shahaji Bhosle  
Sent: Thursday, July 2, 2020 6:11 AM
To: Yanqin Wei 
Cc: Flavio Leitner ; ovs-dev@openvswitch.org; nd 
; Ilya Maximets ; Lee Reed 
; Vinay Gupta ; Alex Barba 

Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Hi Yanqin, 
I added the patch you gave me to my script which runs a do nothing for loop. 
You can see the spikes in the below plot. 976/1000 times we are perfect, but 
around every 1 second u can see something going wrong. I dont see anything 
wrong in the trace-cmd world.
Thanks, Shahaji

root@bcm958802a8046c:~/vinay_rx/dynticks-testing# ./run_isb_rdtsc 
+ TARGET=2
+ MASK=4
+ NUM_ITER=1000
+ NUM_MS=100
+ N=3750
+ LOGFILE=loop_1000iter_100ms.log
+ tee loop_1000iter_100ms.log
+ trace-cmd record -p function_graph -e all -M 4 -o trace_1000iter_100ms.dat 
taskset -c 2 /home/root/arm_stb_user_loop_isb_rdtsc 1000 3750
  plugin 'function_graph'
Cycles/Second (Hz) = 30
Nano-seconds per cycle = 0.

Using ISB() before rte_rdtsc()
num_iter: 1000
do_nothing_loop for (N)=3750 
Running 1000 iterations of do_nothing_loop for (N)=3750

Average =          100282.193430333 u-secs
Max =          124777.48867 u-secs
Min =          10.01767 u-secs
\u03c3  =            1931.352376508 u-secs

Average =              300846580.29 cycles
Max =              374332466.00 cycles
Min =              30053.00 cycles
\u03c3  =                5794057.13 cycles

#\u03c3 = events
 0 = 976
 1 = 3
 2 = 4
 3 = 3
 4 = 3
 5 = 2
 6 = 2
 7 = 2
 8 = 1
 9 = 1
10 = 1
12 = 2




On Wed, Jul 1, 2020 at 3:57 AM Yanqin Wei <mailto:yanqin@arm.com> wrote:
Hi Shahaji,

Adding isb instruction can help rdtsc precise, which sync system counter to 
cntvct_el0. There is a patch in DPDK. https://patchwork.dpdk.org/patch/66561/
So it may be not related with intermittent drops you observed.

Best Regards,
Wei Yanqin

> -Original Message-
> From: dev <mailto:ovs-dev-boun...@openvswitch.org> On Behalf Of Shahaji Bhosle
> via dev
> Sent: Wednesday, July 1, 2020 6:05 AM
> To: Flavio Leitner <mailto:f...@sysclose.org>
> Cc: mailto:ovs-dev@openvswitch.org; Ilya Maximets 
> <mailto:i.maxim...@samsung.com>;
> Lee Reed <mailto:lee.r...@broadcom.com>; Vinay Gupta
> <mailto:vinay.gu...@broadcom.com>; Alex Barba <mailto:alex.ba...@broadcom.com>
> Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP 
> (iperf3)
>
> Hi Flavio,
> I still see intermittent drops with rcu_nocbs. So I wrote that do_nothing()
> loop..to avoid all the other distractions to see if Linux is messing with the 
> OVS
> loop just to see what is going on. The interesting thing I see the case *BOLD*
> below where I use an ISB() instruction my STD deviation is well within Both 
> the
> results are basically DO NOTHING FOR 100msec and see what happens to
> time :) Thanks, Shahaji
>
> static inline uint64_t
> *rte_get_tsc_cycles*(void)
> {
> uint64_t tsc;
> #ifdef USE_ISB
> asm volatile("*isb*; mrs %0, pmccntr_el0" : "=r"(tsc)); #else asm
> volatile("mrs %0, pmccntr_el0" : "=r"(tsc)); #endif return tsc; } #endif
> /*RTE_ARM_EAL_RDTSC_USE_PMU*/
>
> ==
> usleep(100);
> for (volatile int i=0; i rte_get_tsc_cycles();
> /* do nothig for 1us second */
> *#ifdef USE_ISB*
> for(volatile int j=0; j < num_us; j++);       *<<<<<<<<<<<< THIS IS MESSED
> UP, 100msec do nothing, I am getting 2033 usec STD DEVIATION* #else
> *for(volatile int j=0; j < num_us; j++);       <<<<<<<<<<<< THIS LOOP HAS
> VERY LOW STD DEVIATION*
> * rte_isb();*
> #endif
> volatile uint64_t tsc_end = rte_get_tsc_cycles(); cycles[i] = tsc_end - 
> tsc_start; }
> usleep(100); calc_avg_var_stddev(num_iter, [0]);
> ===
> *#ifdef USE_ISB*
> root@bcm958802a8046c:~/vinay_rx/dynticks-testing# ./run_isb_rdtsc
> + TARGET=2
> + MASK=4
> + NUM_ITER=1000
> + NUM_MS=100
> + N=3750
> + LOGFILE=loop_1000iter_100ms.log
> + tee loop_1000iter_100ms.log
> + trace-cmd record -p function_graph -e all -M 4 -o
> trace_1000iter_100ms.dat taskset -c 2
> /home/root/arm_stb_user_loop_isb_rdtsc 1000 3750
>   plugin 'function_graph'
> Cycles/Second (Hz) = 30
> Nano-seconds per cycle = 0.
>
> Using ISB() before rte_rdtsc()
> num_iter: 1000
> do_nothing_loop for (N)=3750
> Running 1000 iterations of do_nothing_loop for (N)=3750
>
> Average =    

Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

2020-07-01 Thread Yanqin Wei
Hi Shahaji,

Adding isb instruction can help rdtsc precise, which sync system counter to 
cntvct_el0. There is a patch in DPDK. https://patchwork.dpdk.org/patch/66561/
So it may be not related with intermittent drops you observed.

Best Regards,
Wei Yanqin

> -Original Message-
> From: dev  On Behalf Of Shahaji Bhosle
> via dev
> Sent: Wednesday, July 1, 2020 6:05 AM
> To: Flavio Leitner 
> Cc: ovs-dev@openvswitch.org; Ilya Maximets ;
> Lee Reed ; Vinay Gupta
> ; Alex Barba 
> Subject: Re: [ovs-dev] 10-25 packet drops every few (10-50) seconds TCP 
> (iperf3)
>
> Hi Flavio,
> I still see intermittent drops with rcu_nocbs. So I wrote that do_nothing()
> loop..to avoid all the other distractions to see if Linux is messing with the 
> OVS
> loop just to see what is going on. The interesting thing I see the case *BOLD*
> below where I use an ISB() instruction my STD deviation is well within Both 
> the
> results are basically DO NOTHING FOR 100msec and see what happens to
> time :) Thanks, Shahaji
>
> static inline uint64_t
> *rte_get_tsc_cycles*(void)
> {
> uint64_t tsc;
> #ifdef USE_ISB
> asm volatile("*isb*; mrs %0, pmccntr_el0" : "=r"(tsc)); #else asm
> volatile("mrs %0, pmccntr_el0" : "=r"(tsc)); #endif return tsc; } #endif
> /*RTE_ARM_EAL_RDTSC_USE_PMU*/
>
> ==
> usleep(100);
> for (volatile int i=0; i rte_get_tsc_cycles();
> /* do nothig for 1us second */
> *#ifdef USE_ISB*
> for(volatile int j=0; j < num_us; j++);   * THIS IS MESSED
> UP, 100msec do nothing, I am getting 2033 usec STD DEVIATION* #else
> *for(volatile int j=0; j < num_us; j++);    THIS LOOP HAS
> VERY LOW STD DEVIATION*
> * rte_isb();*
> #endif
> volatile uint64_t tsc_end = rte_get_tsc_cycles(); cycles[i] = tsc_end - 
> tsc_start; }
> usleep(100); calc_avg_var_stddev(num_iter, [0]);
> ===
> *#ifdef USE_ISB*
> root@bcm958802a8046c:~/vinay_rx/dynticks-testing# ./run_isb_rdtsc
> + TARGET=2
> + MASK=4
> + NUM_ITER=1000
> + NUM_MS=100
> + N=3750
> + LOGFILE=loop_1000iter_100ms.log
> + tee loop_1000iter_100ms.log
> + trace-cmd record -p function_graph -e all -M 4 -o
> trace_1000iter_100ms.dat taskset -c 2
> /home/root/arm_stb_user_loop_isb_rdtsc 1000 3750
>   plugin 'function_graph'
> Cycles/Second (Hz) = 30
> Nano-seconds per cycle = 0.
>
> Using ISB() before rte_rdtsc()
> num_iter: 1000
> do_nothing_loop for (N)=3750
> Running 1000 iterations of do_nothing_loop for (N)=3750
>
> Average =  100328.158561667 u-secs
> Max =  123024.79533 u-secs
> Min =  10.01767 u-secs
> *\sigma  =2033.118969489 u-secs*
>
> Average =  300984475.69 cycles
> Max =  369074386.00 cycles
> Min =  30053.00 cycles
> \sigma  =6099356.91 cycles
>
> #\sigma = events
>  0 = 968
>  1 = 8
>  2 = 5
>  3 = 3
>  4 = 3
>  5 = 3
>  6 = 3
>  8 = 3
> 10 = 3
> 11 = 1
>
> *#ELSE*
> root@bcm958802a8046c:~/vinay_rx/dynticks-testing# ./run_isb_loop
> + TARGET=2
> + MASK=4
> + NUM_ITER=1000
> + NUM_MS=100
> + N=7316912
> + LOGFILE=loop_1000iter_100ms.log
> + tee loop_1000iter_100ms.log
> + trace-cmd record -p function_graph -e all -M 4 -o
> trace_1000iter_100ms.dat taskset -c 2
> /home/root/arm_stb_user_loop_isb_loop
> 1000 7316912
>   plugin 'function_graph'
> Cycles/Second (Hz) = 30
> Nano-seconds per cycle = 0.
>
> NO ISB() before rte_rdtsc()
> num_iter: 1000
> do_nothing_loop for (N)=7316912
> Running 1000 iterations of do_nothing_loop for (N)=7316912
>
> Average =   9.863256333 u-secs
> Max =  100052.79033 u-secs
> Min =   7.80733 u-secs
> *\u03c3 =   6.497043982 u-secs*
>
> Average =  29589.77 cycles
> Max =  300158371.00 cycles
> Min =  23422.00 cycles
> \u03c3 =  19491.13 cycles
>
> #\u03c3 = events
>  0 = 900
>  2 = 79
>  4 = 17
>  5 = 3
>  8 = 1
>
>
> On Tue, Jun 30, 2020 at 4:42 PM Flavio Leitner  wrote:
>
> >
> >
> > Hi Shahaji,
> >
> > Did it help with the rcu_nocbs?
> >
> > fbl
> >
> > On Tue, Jun 30, 2020 at 12:56:27PM -0400, Shahaji Bhosle wrote:
> > > Thanks Flavio,
> > > Are there any special requirements for RCU on ARM vs x86.
> > >
> > > I am following what the above document is saying...Do you think I
> > > need to do something more than the below?
> > > Thanks again and appreciate the help. Shahaji
> > >
> > > 1. Isolate the CPU cores
> > > *isolcpus=1,2,3,4,5,6,7 nohz_full=1-7 rcu_nocbs=1-7* 2. Setting
> > > CONFIG_NO_HZ_FULL=y
> > > root@bcm958802a8046c:~/vinay_rx/dynticks-testing# zcat
> > > /proc/config.gz
> > > |grep HZ
> > > CONFIG_NO_HZ_COMMON=y
> > > # CONFIG_HZ_PERIODIC is not set
> > > # CONFIG_NO_HZ_IDLE is not set
> > > *CONFIG_NO_HZ_FULL*=y
> > > # CONFIG_NO_HZ_FULL_ALL is not set
> > > # CONFIG_NO_HZ is not set
> > > # CONFIG_HZ_100 is not set
> > > CONFIG_HZ_250=y
> > > # CONFIG_HZ_300 is not set

Re: [ovs-dev] [PATCH v1 0/6] Memory access optimization for flow scalability of userspace datapath.

2020-06-30 Thread Yanqin Wei
Hi, every contributor

These patches could significantly improve multi-flow throughput of userspace 
datapath.  If you feel it will take too much time to review all patches, I 
suggest you could look at the 2nd/3rd first, which have the major improvement 
in these patches.
[ovs-dev][PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip ip/ipv6 
address comparison
[ovs-dev][PATCH v1 3/6] dpif-netdev: improve emc lookup performance by 
contiguous storage of hash value.

Any comments from anyone are appreciated.

Best Regards,
Wei Yanqin

> -Original Message-
> From: Yanqin Wei 
> Sent: Tuesday, June 2, 2020 3:10 PM
> To: d...@openvswitch.org
> Cc: nd ; i.maxim...@ovn.org; u9012...@gmail.com; Malvika
> Gupta ; Lijian Zhang ;
> Ruifeng Wang ; Lance Yang
> ; Yanqin Wei 
> Subject: [ovs-dev][PATCH v1 0/6] Memory access optimization for flow
> scalability of userspace datapath.
> 
> OVS userspace datapath is a program with heavy memory access. It needs to
> load/store a large number of memory, including packet header, metadata,
> EMC/SMC/DPCLS tables and so on. It causes a lot of cache line missing and
> refilling, which has a great impact on flow scalability. And in some cases, 
> EMC
> has a negative impact on the overall performance. It is difficult for user to
> dynamically manage the enabling of EMC.
> 
> This series of patches improve memory access of userspace datapath as
> follows:
> 1. Reduce the number of metadata cache line accessed by non-tunnel traffic.
> 2. Decrease unnecessary memory load/store for batch/flow.
> 3. Modify the layout of EMC data struct. Centralize the storage of hash value.
> 
> In the NIC2NIC traffic tests, the overall performance improvement is observed,
> especially in multi-flow cases.
> Flows   delta
> 1-1K flows  5-10%
> 10K flows   20%
> 100K flows  40%
> EMC disable 10%
> 
> Malvika Gupta (1):
>   [ovs-dev] dpif-netdev: Modify dfc_processing function to void function
> 
> Yanqin Wei (5):
>   netdev: avoid unnecessary packet batch refilling in netdev feature
> check
>   dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison
>   dpif-netdev: improve emc lookup performance by contiguous storage of
> hash value.
>   dpif-netdev: skip flow hash calculation in case of smc disabled
>   dpif-netdev: remove unnecessary key length calculation in fast path
> 
>  lib/dp-packet.h   |  12 +++--
>  lib/dpif-netdev.c | 115 --
>  lib/flow.c|   2 +-
>  lib/netdev.c  |  13 --
>  lib/packets.h |  46 ---
>  5 files changed, 120 insertions(+), 68 deletions(-)
> 
> --
> 2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] ovs rcu: update rcu pointer first

2020-06-03 Thread Yanqin Wei
Hi Ben,

I read the comments and codes about ovs-rcu. Each thread is added into 
ovsrcu_threads in the thread start routine.

And most of threads quiesce in the poll_block/ xsleep/ ovsrcu_quiesce/ 
ovsrcu_quiesce_start+ ovsrcu_quiesce_end. 
Only one thread "afsync_thread" may not be in the ovsrcu_threads list, because 
it invokes ovsrcu_quiesce_start but does not call ovsrcu_quiesce_end. But this 
thread does not update/free rcu memory, so no writer is in the quiescent state.

I am sorry to make an incorrect comment for Haifeng's patch.  He has a correct 
description in original patch.  Please help to correct.
https://patchwork.ozlabs.org/project/openvswitch/patch/4099de2e54afad489356c6c9161d53339735a...@dggeml522-mbs.china.huawei.com/

Best Regards,
Wei Yanqin

> -Original Message-
> From: Ben Pfaff 
> Sent: Wednesday, June 3, 2020 8:48 AM
> To: Yanqin Wei 
> Cc: Linhaifeng ; d...@openvswitch.org; nd
> ; Lilijun (Jerry) ; chenchanghu
> ; Lichunhe 
> Subject: Re: [ovs-dev] [PATCH v2] ovs rcu: update rcu pointer first
> 
> Oh, I apologize that I made a mistake about the author.
> 
> I appreciate feedback from anyone.
> 
> On Wed, Jun 03, 2020 at 12:37:27AM +, Yanqin Wei wrote:
> > Hi Ben,
> >
> > This patch is from Linhai, but I have the same concern about this.  I will 
> > read
> ovs-rcu comments and feedback.
> > Thanks for your time.
> >
> > Best Regards,
> > Wei Yanqin
> >
> > > -Original Message-
> > > From: Ben Pfaff 
> > > Sent: Wednesday, June 3, 2020 8:35 AM
> > > To: Yanqin Wei 
> > > Cc: Linhaifeng ; d...@openvswitch.org; nd
> > > ; Lilijun (Jerry) ;
> > > chenchanghu ; Lichunhe
> 
> > > Subject: Re: [ovs-dev] [PATCH v2] ovs rcu: update rcu pointer first
> > >
> > > This is not how RCU works in OVS.  Every thread is by default considered
> active.
> > > They rarely quiesce except implicitly inside poll_block().
> > > Please read the large comment at the top of ovs-rcu.h.
> > >
> > > Is your patch based on actual bugs that you have found, or is it
> > > just some kind of precaution?  If it is the latter, then it is not needed.
> > >
> > > On Tue, Jun 02, 2020 at 11:22:57PM +, Yanqin Wei wrote:
> > > > Hi Ben,
> > > >
> > > > If my understanding is correct, the writer could not be a rcu
> > > > thread because it
> > > does not need report holding or not holding pointers.
> > > > So old memory will be freed after all rcu thread report quiesce.
> > > >
> > > > Best Regards,
> > > > Wei Yanqin
> > > >
> > > > > -Original Message-
> > > > > From: Ben Pfaff 
> > > > > Sent: Wednesday, June 3, 2020 1:28 AM
> > > > > To: Linhaifeng 
> > > > > Cc: Yanqin Wei ; d...@openvswitch.org; nd
> > > > > ; Lilijun (Jerry) ;
> > > > > chenchanghu ; Lichunhe
> > > 
> > > > > Subject: Re: [ovs-dev] [PATCH v2] ovs rcu: update rcu pointer
> > > > > first
> > > > >
> > > > > On Tue, Jun 02, 2020 at 07:27:59AM +, Linhaifeng wrote:
> > > > > > We should update rcu pointer first then use ovsrcu_postpone to
> > > > > > free otherwise maybe cause use-after-free.
> > > > > > e.g.,reader indicates momentary quiescent and access old
> > > > > > pointer after writer postpone free old pointer and before setting 
> > > > > > new
> pointer.
> > > > > >
> > > > > > Signed-off-by: Linhaifeng 
> > > > >
> > > > > I don't see how that's possible, since the writer hasn't quiesced.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] ovs rcu: update rcu pointer first

2020-06-02 Thread Yanqin Wei
Hi Ben,

This patch is from Linhai, but I have the same concern about this.  I will read 
ovs-rcu comments and feedback.
Thanks for your time.

Best Regards,
Wei Yanqin

> -Original Message-
> From: Ben Pfaff 
> Sent: Wednesday, June 3, 2020 8:35 AM
> To: Yanqin Wei 
> Cc: Linhaifeng ; d...@openvswitch.org; nd
> ; Lilijun (Jerry) ; chenchanghu
> ; Lichunhe 
> Subject: Re: [ovs-dev] [PATCH v2] ovs rcu: update rcu pointer first
> 
> This is not how RCU works in OVS.  Every thread is by default considered 
> active.
> They rarely quiesce except implicitly inside poll_block().
> Please read the large comment at the top of ovs-rcu.h.
> 
> Is your patch based on actual bugs that you have found, or is it just some 
> kind
> of precaution?  If it is the latter, then it is not needed.
> 
> On Tue, Jun 02, 2020 at 11:22:57PM +, Yanqin Wei wrote:
> > Hi Ben,
> >
> > If my understanding is correct, the writer could not be a rcu thread 
> > because it
> does not need report holding or not holding pointers.
> > So old memory will be freed after all rcu thread report quiesce.
> >
> > Best Regards,
> > Wei Yanqin
> >
> > > -----Original Message-
> > > From: Ben Pfaff 
> > > Sent: Wednesday, June 3, 2020 1:28 AM
> > > To: Linhaifeng 
> > > Cc: Yanqin Wei ; d...@openvswitch.org; nd
> > > ; Lilijun (Jerry) ;
> > > chenchanghu ; Lichunhe
> 
> > > Subject: Re: [ovs-dev] [PATCH v2] ovs rcu: update rcu pointer first
> > >
> > > On Tue, Jun 02, 2020 at 07:27:59AM +, Linhaifeng wrote:
> > > > We should update rcu pointer first then use ovsrcu_postpone to
> > > > free otherwise maybe cause use-after-free.
> > > > e.g.,reader indicates momentary quiescent and access old pointer
> > > > after writer postpone free old pointer and before setting new pointer.
> > > >
> > > > Signed-off-by: Linhaifeng 
> > >
> > > I don't see how that's possible, since the writer hasn't quiesced.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] ovs rcu: update rcu pointer first

2020-06-02 Thread Yanqin Wei
Hi Ben,

If my understanding is correct, the writer could not be a rcu thread because it 
does not need report holding or not holding pointers.
So old memory will be freed after all rcu thread report quiesce.

Best Regards,
Wei Yanqin

> -Original Message-
> From: Ben Pfaff 
> Sent: Wednesday, June 3, 2020 1:28 AM
> To: Linhaifeng 
> Cc: Yanqin Wei ; d...@openvswitch.org; nd
> ; Lilijun (Jerry) ; chenchanghu
> ; Lichunhe 
> Subject: Re: [ovs-dev] [PATCH v2] ovs rcu: update rcu pointer first
> 
> On Tue, Jun 02, 2020 at 07:27:59AM +, Linhaifeng wrote:
> > We should update rcu pointer first then use ovsrcu_postpone to free
> > otherwise maybe cause use-after-free.
> > e.g.,reader indicates momentary quiescent and access old pointer after
> > writer postpone free old pointer and before setting new pointer.
> >
> > Signed-off-by: Linhaifeng 
> 
> I don't see how that's possible, since the writer hasn't quiesced.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ovs rcu: update rcu pointer first

2020-06-02 Thread Yanqin Wei
Hi Haifeng,

One more comment. Since this is a bug fix, it is possible that the maintainer 
will backport to the previous branch. Therefore, it is recommended to split 
into several patches and add fixes tag.
http://docs.openvswitch.org/en/latest/internals/contributing/submitting-patches/?highlight=submit
e.g.
"Fixes: 63bc9fb1c69f ("packets: Reorder CS_* flags to remove gap.")
If you would like to record which commit introduced a bug being fixed, you may 
do that with a "Fixes" header. This assists in determining which OVS releases 
have the bug, so the patch can be applied to all affected versions. The easiest 
way to generate the header in the proper format is with this git command. This 
command also CCs the author of the commit being fixed, which makes sense unless 
the author also made the fix or is already named in another tag:
$ git log -1 --pretty=format:"CC: %an <%ae>%nFixes: %h (\"%s\")" \
  --abbrev=12 COMMIT_REF"

Best Regards,
Wei Yanqin

> -Original Message-
> From: Linhaifeng 
> Sent: Tuesday, June 2, 2020 3:13 PM
> To: Yanqin Wei ; d...@openvswitch.org
> Cc: nd 
> Subject: RE: [PATCH] ovs rcu: update rcu pointer first
> 
> Hi Yanqin,
> 
> Thank you for your suggestions. I will send a new patch.
> 
> -Original Message-
> From: Yanqin Wei [mailto:yanqin@arm.com]
> Sent: Tuesday, June 2, 2020 11:51 AM
> To: Linhaifeng ; d...@openvswitch.org
> Cc: nd 
> Subject: RE: [PATCH] ovs rcu: update rcu pointer first
> 
> Hi Haifeng,
> 
> It looks indeed a risk for using ovs-rcu. A few comments inline.
> 
> Best Regards,
> Wei Yanqin
> 
> > -Original Message-
> > From: dev  On Behalf Of Linhaifeng
> > Sent: Monday, June 1, 2020 11:13 AM
> > To: d...@openvswitch.org
> > Subject: [ovs-dev] [PATCH] ovs rcu: update rcu pointer first
> >
> > We should update rcu pointer first then use ovsrcu_postpone to free
> > otherwise maybe cause use-after-free.
> >
> > e.g, thead are two threads A and B:
> >
> > 1. thread A call ovsrcu_postpone and flush cbset, this time have not
> > call ovsrcu_quiesce
> >
> > 2. thread rcu wait all threads call ovsrcu_quiesce
> >
> > 3. thread B call ovsrcu_quiesce
> >
> > 4. thread B get the old pointer next round
> >
> > 5. thrad A call ovsrcu_quiesce, now all threads have called
> > ovsrcu_quiesce
> [Yanqin]   Thread A is a writer and does not have to call ovsrcu_quiesce. I 
> think
> this scenario can be simplified as follows: reader indicates momentary
> quiescent and access old pointer after writer postpone free old pointer and
> before setting new pointer.
> 
> >
> > 6. thread rcu free old pointer
> >
> > 7. thread B use-after-free
> >
> > Signed-off-by: Linhaifeng 
> > ---
> >  lib/classifier.c  |  4 ++--
> >  lib/ovs-rcu.h |  2 +-
> >  lib/pvector.c | 15 ---
> >  ofproto/ofproto-dpif-mirror.c |  4 ++-- ofproto/ofproto-dpif-upcall.c
> > |  3 +--
> >  5 files changed, 14 insertions(+), 14 deletions(-)
> >
> > diff --git a/lib/classifier.c b/lib/classifier.c index
> > f2c3497c2..6bff76e07 100644
> > --- a/lib/classifier.c
> > +++ b/lib/classifier.c
> > @@ -249,11 +249,11 @@ cls_rule_set_conjunctions(struct cls_rule *cr,
> >  unsigned int old_n = old ? old->n : 0;
> >
> >  if (old_n != n || (n && memcmp(old_conj, conj, n * sizeof
> > *conj))) {
> > +ovsrcu_set(>conj_set,
> > +   cls_conjunction_set_alloc(match, conj, n));
> >  if (old) {
> >  ovsrcu_postpone(free, old);
> >  }
> > -ovsrcu_set(>conj_set,
> > -   cls_conjunction_set_alloc(match, conj, n));
> >  }
> >  }
> >
> > diff --git a/lib/ovs-rcu.h b/lib/ovs-rcu.h index ecc4c9201..a66d868ea
> > 100644
> > --- a/lib/ovs-rcu.h
> > +++ b/lib/ovs-rcu.h
> > @@ -119,9 +119,9 @@
> >   * change_flow(struct flow *new_flow)
> >   * {
> >   * ovs_mutex_lock();
> > + * ovsrcu_set(, new_flow);
> >   * ovsrcu_postpone(free,
> >   * ovsrcu_get_protected(struct flow *, ));
> [Yanqin] flowp has been set to new flow pointer here. Maybe a new variable is
> needed to store old pointer.
> > - * ovsrcu_set(, new_flow);
> >   * ovs_mutex_unlock();
> >   * }
> >   *
> > diff --git a/lib/pvector.c b/lib/pvector.c index cc527fdc4..aa8c6cb24
> > 100644
> > --- a/lib/pvector.c
> > +++ b/lib/pvect

[ovs-dev] [PATCH v1 1/6] netdev: avoid unnecessary packet batch refilling in netdev feature check

2020-06-02 Thread Yanqin Wei
Before sending packets on netdev, feature compatibility is always checked
and incompatible traffic should be dropped. But, packet batch is refilled
even when no packet needs to be dropped. This patch improves it by keeping
the original batch if no packet should be dropped.

Reviewed-by: Lijian Zhang 
Reviewed-by: Malvika Gupta 
Signed-off-by: Yanqin Wei 
---
 lib/dp-packet.h | 12 
 lib/netdev.c| 13 ++---
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 0430cca8e..1345f46e7 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -762,12 +762,12 @@ dp_packet_batch_size(const struct dp_packet_batch *batch)
 return batch->count;
 }
 
-/* Clear 'batch' for refill. Use dp_packet_batch_refill() to add
+/* Clear 'batch' from 'offset' for refill. Use dp_packet_batch_refill() to add
  * packets back into the 'batch'. */
 static inline void
-dp_packet_batch_refill_init(struct dp_packet_batch *batch)
+dp_packet_batch_refill_prepare(struct dp_packet_batch *batch, size_t offset)
 {
-batch->count = 0;
+batch->count = offset;
 };
 
 static inline void
@@ -801,6 +801,10 @@ dp_packet_batch_is_full(const struct dp_packet_batch 
*batch)
 for (size_t IDX = 0; IDX < dp_packet_batch_size(BATCH); IDX++)  \
 if (PACKET = BATCH->packets[IDX], true)
 
+#define DP_PACKET_BATCH_FOR_EACH_WITH_SIZE(IDX, SIZE, PACKET, BATCH) \
+ for (size_t IDX = 0; IDX < SIZE; IDX++) \
+if (PACKET = BATCH->packets[IDX], true)
+
 /* Use this macro for cases where some packets in the 'BATCH' may be
  * dropped after going through each packet in the 'BATCH'.
  *
@@ -813,7 +817,7 @@ dp_packet_batch_is_full(const struct dp_packet_batch *batch)
  * the 'const' modifier since it should not be modified by
  * the iterator.  */
 #define DP_PACKET_BATCH_REFILL_FOR_EACH(IDX, SIZE, PACKET, BATCH)   \
-for (dp_packet_batch_refill_init(BATCH), IDX=0; IDX < SIZE; IDX++)  \
+for (dp_packet_batch_refill_prepare(BATCH, 0), IDX=0; IDX < SIZE; IDX++)  \
  if (PACKET = BATCH->packets[IDX], true)
 
 static inline void
diff --git a/lib/netdev.c b/lib/netdev.c
index 90962eec6..7934a00d4 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -838,15 +838,22 @@ netdev_send_prepare_batch(const struct netdev *netdev,
   struct dp_packet_batch *batch)
 {
 struct dp_packet *packet;
-size_t i, size = dp_packet_batch_size(batch);
+size_t batch_cnt = dp_packet_batch_size(batch);;
+bool refill = false;
 
-DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
+DP_PACKET_BATCH_FOR_EACH_WITH_SIZE (i, batch_cnt, packet, batch) {
 char *errormsg = NULL;
 
 if (netdev_send_prepare_packet(netdev->ol_flags, packet, )) {
-dp_packet_batch_refill(batch, packet, i);
+if ( OVS_UNLIKELY(refill) ) {
+dp_packet_batch_refill(batch, packet, i);
+}
 } else {
 dp_packet_delete(packet);
+if( !refill ) {
+dp_packet_batch_refill_prepare(batch, i);
+refill = true;
+}
 COVERAGE_INC(netdev_send_prepare_drops);
 VLOG_WARN_RL(, "%s: Packet dropped: %s",
  netdev_get_name(netdev), errormsg);
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 3/6] dpif-netdev: improve emc lookup performance by contiguous storage of hash value.

2020-06-02 Thread Yanqin Wei
In the emc lookup function, hash/flow/key are checked for matching entry.
Each entry comparison loads several cachelines into CPU. So in the
multifow case, processor will wait for the data to be fetched from lower
level cacheline or main memory because of cache miss.
This patch modifies emc table layout to contiguously store the hash items.
It can reduce the number of cache miss for fetching hash value in EMC
lookup miss case. And EMC lookup hit case can also benefit because
processor can parallelly load and match hash and key in different cacheline.

Reviewed-by: Lijian Zhang 
Reviewed-by: Malvika Gupta 
Reviewed-by: Ruifeng Wang 
Signed-off-by: Yanqin Wei 
---
 lib/dpif-netdev.c | 55 +++
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index c94d5e8c7..3994f41e4 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -192,13 +192,19 @@ static struct odp_support dp_netdev_support = {
 #define DEFAULT_EM_FLOW_INSERT_INV_PROB 100
 #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX / \
 DEFAULT_EM_FLOW_INSERT_INV_PROB)
+struct emc_key {
+uint32_t len;/* Length of the following miniflow (incl. map). */
+struct miniflow mf;
+uint64_t buf[FLOW_MAX_PACKET_U64S];
+};
 
 struct emc_entry {
 struct dp_netdev_flow *flow;
-struct netdev_flow_key key;   /* key.hash used for emc hash value. */
+struct emc_key key;
 };
 
 struct emc_cache {
+uint32_t hash[EM_FLOW_HASH_ENTRIES];
 struct emc_entry entries[EM_FLOW_HASH_ENTRIES];
 int sweep_idx;/* For emc_cache_slow_sweep(). */
 };
@@ -220,9 +226,9 @@ struct dfc_cache {
 
 /* Iterate in the exact match cache through every entry that might contain a
  * miniflow with hash 'HASH'. */
-#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH) \
+#define EMC_FOR_EACH_POS_WITH_HASH(ID, HASH) \
 for (uint32_t i__ = 0, srch_hash__ = (HASH); \
- (CURRENT_ENTRY) = &(EMC)->entries[srch_hash__ & EM_FLOW_HASH_MASK], \
+ (ID) = srch_hash__ & EM_FLOW_HASH_MASK, \
  i__ < EM_FLOW_HASH_SEGS;\
  i__++, srch_hash__ >>= EM_FLOW_HASH_SHIFT)
 
@@ -877,8 +883,8 @@ emc_cache_init(struct emc_cache *flow_cache)
 
 flow_cache->sweep_idx = 0;
 for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
+flow_cache->hash[i] = 0;
 flow_cache->entries[i].flow = NULL;
-flow_cache->entries[i].key.hash = 0;
 flow_cache->entries[i].key.len = sizeof(struct miniflow);
 flowmap_init(_cache->entries[i].key.mf.map);
 }
@@ -2733,12 +2739,12 @@ netdev_flow_key_equal(const struct netdev_flow_key *a,
 return a->hash == b->hash && !memcmp(>mf, >mf, a->len);
 }
 
-/* Used to compare 'netdev_flow_key' in the exact match cache to a miniflow.
+/* Used to compare 'emc_key' in the exact match cache to a miniflow.
  * The maps are compared bitwise, so both 'key->mf' and 'mf' must have been
  * generated by miniflow_extract. */
 static inline bool
-netdev_flow_key_equal_mf(const struct netdev_flow_key *key,
- const struct miniflow *mf)
+emc_key_equal_mf(const struct emc_key *key,
+ const struct miniflow *mf)
 {
 return !memcmp(>mf, mf, key->len);
 }
@@ -2840,7 +2846,8 @@ emc_change_entry(struct emc_entry *ce, struct 
dp_netdev_flow *flow,
 }
 }
 if (key) {
-netdev_flow_key_clone(>key, key);
+ce->key.len = key->len;
+memcpy(>key.mf, >mf, key->len);
 }
 }
 
@@ -2849,12 +2856,14 @@ emc_insert(struct emc_cache *cache, const struct 
netdev_flow_key *key,
struct dp_netdev_flow *flow)
 {
 struct emc_entry *to_be_replaced = NULL;
-struct emc_entry *current_entry;
+uint32_t *to_be_replaced_hash = NULL;
+uint32_t id;
 
-EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
-if (netdev_flow_key_equal(_entry->key, key)) {
+EMC_FOR_EACH_POS_WITH_HASH (id, key->hash) {
+if (key->hash == cache->hash[id]
+&& emc_key_equal_mf(>entries[id].key, >mf)) {
 /* We found the entry with the 'mf' miniflow */
-emc_change_entry(current_entry, flow, NULL);
+emc_change_entry(>entries[id], flow, NULL);
 return;
 }
 
@@ -2862,14 +2871,15 @@ emc_insert(struct emc_cache *cache, const struct 
netdev_flow_key *key,
  * in the first entry where it can be */
 if (!to_be_replaced
 || (emc_entry_alive(to_be_replaced)
-&& !emc_entry_alive(current_entry))
-|| current_entry->key.hash < to_be_replaced->key.hash) {
-to_be_replaced = curre

[ovs-dev] [PATCH v1 5/6] dpif-netdev: remove unnecessary key length calculation in fast path

2020-06-02 Thread Yanqin Wei
Key length is only useful for emc table. This patch moves the key length
calculation into emc_change_entry for fast path performance.

Reviewed-by: Lijian Zhang 
Reviewed-by: Malvika Gupta 
Reviewed-by: Ruifeng Wang 
Signed-off-by: Yanqin Wei 
---
 lib/dpif-netdev.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index d575edefd..6ff8194ab 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2846,8 +2846,8 @@ emc_change_entry(struct emc_entry *ce, struct 
dp_netdev_flow *flow,
 }
 }
 if (key) {
-ce->key.len = key->len;
-memcpy(>key.mf, >mf, key->len);
+ce->key.len = netdev_flow_key_size(miniflow_n_values(>mf));
+memcpy(>key.mf, >mf, ce->key.len);
 }
 }
 
@@ -6541,8 +6541,6 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
 tcp_flags = miniflow_get_tcp_flags([i].mf);
 
 /* SMC hit and emc miss, we insert into EMC */
-keys[i].len =
-netdev_flow_key_size(miniflow_n_values([i].mf));
 emc_probabilistic_insert(pmd, [i], flow);
 /* Add these packets into the flow map in the same order
  * as received.
@@ -6831,10 +6829,6 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
 int lookup_cnt = 0, add_lookup_cnt;
 bool any_miss;
 
-for (size_t i = 0; i < cnt; i++) {
-/* Key length is needed in all the cases, hash computed on demand. */
-keys[i]->len = netdev_flow_key_size(miniflow_n_values([i]->mf));
-}
 /* Get the classifier for the in_port */
 cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
 if (OVS_LIKELY(cls)) {
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison

2020-06-02 Thread Yanqin Wei
miniflow_extract checks the validation of tunnel metadata by comparing
tunnel destination address, including 16 bytes ipv6 address.
This patch introduces a 'tunnel_valid' flag. If it is false,
md->cacheline2 will not be touched. This improvement is beneficial to
miniflow_extract performance for all kinds of traffic.

Reviewed-by: Lijian Zhang 
Reviewed-by: Malvika Gupta 
Signed-off-by: Yanqin Wei 
---
 lib/dpif-netdev.c | 14 +++---
 lib/flow.c|  2 +-
 lib/packets.h | 46 --
 3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 51c888501..c94d5e8c7 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -6625,12 +6625,16 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
 if (i != cnt - 1) {
 struct dp_packet **packets = packets_->packets;
 /* Prefetch next packet data and metadata. */
-OVS_PREFETCH(dp_packet_data(packets[i+1]));
-pkt_metadata_prefetch_init([i+1]->md);
+OVS_PREFETCH(dp_packet_data(packets[i + 1]));
+if (md_is_valid) {
+pkt_metadata_prefetch([i + 1]->md);
+} else {
+pkt_metadata_prefetch_init([i + 1]->md);
+}
 }
 
 if (!md_is_valid) {
-pkt_metadata_init(>md, port_no);
+pkt_metadata_datapath_init(>md, port_no);
 }
 
 if ((*recirc_depth_get() == 0) &&
@@ -6730,6 +6734,10 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
 miniflow_expand(>mf, );
 memset(, 0, sizeof match.wc);
 
+if (!packet->md.tunnel_valid) {
+pkt_metadata_tnl_dst_init(>md);
+}
+
 ofpbuf_clear(actions);
 ofpbuf_clear(put_actions);
 
diff --git a/lib/flow.c b/lib/flow.c
index cc1b3f2db..1f0b3d4dc 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -747,7 +747,7 @@ miniflow_extract(struct dp_packet *packet, struct miniflow 
*dst)
 ovs_be16 ct_tp_src = 0, ct_tp_dst = 0;
 
 /* Metadata. */
-if (flow_tnl_dst_is_set(>tunnel)) {
+if (md->tunnel_valid && flow_tnl_dst_is_set(>tunnel)) {
 miniflow_push_words(mf, tunnel, >tunnel,
 offsetof(struct flow_tnl, metadata) /
 sizeof(uint64_t));
diff --git a/lib/packets.h b/lib/packets.h
index 447e6f6fa..3b507d2a3 100644
--- a/lib/packets.h
+++ b/lib/packets.h
@@ -103,15 +103,16 @@ PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, 
cacheline0,
action. */
 uint32_t skb_priority;  /* Packet priority for QoS. */
 uint32_t pkt_mark;  /* Packet mark. */
+struct conn *conn;  /* Cached conntrack connection. */
 uint8_t  ct_state;  /* Connection state. */
 bool ct_orig_tuple_ipv6;
 uint16_t ct_zone;   /* Connection zone. */
 uint32_t ct_mark;   /* Connection mark. */
 ovs_u128 ct_label;  /* Connection label. */
 union flow_in_port in_port; /* Input port. */
-struct conn *conn;  /* Cached conntrack connection. */
 bool reply; /* True if reply direction. */
 bool icmp_related;  /* True if ICMP related. */
+bool tunnel_valid;
 );
 
 PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
@@ -141,6 +142,7 @@ pkt_metadata_init_tnl(struct pkt_metadata *md)
  * are before this and as long as they are empty, the options won't
  * be looked at. */
 memset(md, 0, offsetof(struct pkt_metadata, tunnel.metadata.opts));
+md->tunnel_valid = true;
 }
 
 static inline void
@@ -151,6 +153,25 @@ pkt_metadata_init_conn(struct pkt_metadata *md)
 
 static inline void
 pkt_metadata_init(struct pkt_metadata *md, odp_port_t port)
+{
+/* Initialize only till ct_state. Once the ct_state is zeroed out rest
+ * of ct fields will not be looked at unless ct_state != 0.
+ */
+memset(md, 0, offsetof(struct pkt_metadata, ct_orig_tuple_ipv6));
+
+/* It can be expensive to zero out all of the tunnel metadata. However,
+ * we can just zero out ip_dst and the rest of the data will never be
+ * looked at. */
+md->tunnel_valid = true;
+md->tunnel.ip_dst = 0;
+md->tunnel.ipv6_dst = in6addr_any;
+
+md->in_port.odp_port = port;
+}
+
+/* This function initializes those members used by userspace datapath */
+static inline void
+pkt_metadata_datapath_init(struct pkt_metadata *md, odp_port_t port)
 {
 /* This is called for every packet in userspace datapath and affects
  * performance if all the metadata is initialized. Hence, fields should
@@ -162,12 +183,19 @@ pkt_metadata_init(struct pkt_metadata *md, odp_port_t 
port)
 memset(md, 0, offsetof(struct pkt_metadata, ct_orig_tuple_ipv6));
 
 /* It can be expensive to zero out all of the tunnel metadata. However,
- * we can just zero out ip_dst and the rest 

[ovs-dev] [PATCH v1 0/6] Memory access optimization for flow scalability of userspace datapath.

2020-06-02 Thread Yanqin Wei
OVS userspace datapath is a program with heavy memory access. It needs to
load/store a large number of memory, including packet header, metadata,
EMC/SMC/DPCLS tables and so on. It causes a lot of cache line missing and
refilling, which has a great impact on flow scalability. And in some cases,
EMC has a negative impact on the overall performance. It is difficult for
user to dynamically manage the enabling of EMC. 

This series of patches improve memory access of userspace datapath as
follows:
1. Reduce the number of metadata cache line accessed by non-tunnel traffic. 
2. Decrease unnecessary memory load/store for batch/flow. 
3. Modify the layout of EMC data struct. Centralize the storage of hash
value. 

In the NIC2NIC traffic tests, the overall performance improvement is
observed, especially in multi-flow cases. 
Flows   delta
1-1K flows  5-10%
10K flows   20%
100K flows  40%
EMC disable 10%

Malvika Gupta (1):
  [ovs-dev] dpif-netdev: Modify dfc_processing function to void function

Yanqin Wei (5):
  netdev: avoid unnecessary packet batch refilling in netdev feature
check
  dpif-netdev: add tunnel_valid flag to skip ip/ipv6 address comparison
  dpif-netdev: improve emc lookup performance by contiguous storage of
hash value.
  dpif-netdev: skip flow hash calculation in case of smc disabled
  dpif-netdev: remove unnecessary key length calculation in fast path

 lib/dp-packet.h   |  12 +++--
 lib/dpif-netdev.c | 115 --
 lib/flow.c|   2 +-
 lib/netdev.c  |  13 --
 lib/packets.h |  46 ---
 5 files changed, 120 insertions(+), 68 deletions(-)

-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 4/6] dpif-netdev: skip flow hash calculation in case of smc disabled

2020-06-02 Thread Yanqin Wei
In case of 10k+ flows, emc lookup will usually miss. Flow hash value is
always calculated in this case no matter smc is enabled or not. This patch
moves it from smc_insert function into fast_path_processing and
handle_packet_upcall function to avoid unnecessary hash calculation and memory
access(flow->ufid) in smc disabled case.

Reviewed-by: Lijian Zhang 
Reviewed-by: Malvika Gupta 
Reviewed-by: Lance Yang 
Signed-off-by: Yanqin Wei 
---
 lib/dpif-netdev.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 3994f41e4..d575edefd 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2957,14 +2957,8 @@ smc_insert(struct dp_netdev_pmd_thread *pmd,
 struct smc_bucket *bucket = _cache->buckets[key->hash & SMC_MASK];
 uint16_t index;
 uint32_t cmap_index;
-bool smc_enable_db;
 int i;
 
-atomic_read_relaxed(>dp->smc_enable_db, _enable_db);
-if (!smc_enable_db) {
-return;
-}
-
 cmap_index = cmap_find_index(>flow_table, hash);
 index = (cmap_index >= UINT16_MAX) ? UINT16_MAX : (uint16_t)cmap_index;
 
@@ -6794,8 +6788,13 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
  add_actions->size);
 }
 ovs_mutex_unlock(>flow_mutex);
-uint32_t hash = dp_netdev_flow_hash(_flow->ufid);
-smc_insert(pmd, key, hash);
+
+bool smc_enable_db;
+atomic_read_relaxed(>dp->smc_enable_db, _enable_db);
+if (smc_enable_db) {
+uint32_t hash = dp_netdev_flow_hash(_flow->ufid);
+smc_insert(pmd, key, hash);
+}
 emc_probabilistic_insert(pmd, key, netdev_flow);
 }
 if (pmd_perf_metrics_enabled(pmd)) {
@@ -6904,9 +6903,13 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
 }
 
 flow = dp_netdev_flow_cast(rules[i]);
-uint32_t hash =  dp_netdev_flow_hash(>ufid);
-smc_insert(pmd, keys[i], hash);
 
+bool smc_enable_db;
+atomic_read_relaxed(>dp->smc_enable_db, _enable_db);
+if (smc_enable_db) {
+uint32_t hash =  dp_netdev_flow_hash(>ufid);
+smc_insert(pmd, keys[i], hash);
+}
 emc_probabilistic_insert(pmd, keys[i], flow);
 /* Add these packets into the flow map in the same order
  * as received.
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 6/6] dpif-netdev: Modify dfc_processing function to void function

2020-06-02 Thread Yanqin Wei
From: Malvika Gupta 

dfc_processing function returns the number of packets left to be processed
in 'packets' array via dp_packet_batch_size() function. dfc_processing function
is called only from dp_netdev_input__ and its return value is not checked upon
function return. Moreover, dp_packet_batch_is_empty() called after function
return from dfc_processing itself calls dp_packet_batch_size to check if
'packets' array is empty. This patch modifies the dfc_processing function to a
void function to remove the above code redundancy and cleans the code.

Reviewed-by: Yanqin Wei 
Signed-off-by: Malvika Gupta 
---
 lib/dpif-netdev.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 6ff8194ab..b3750017b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -6577,16 +6577,13 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
  * beginning of the 'packets' array. The pointers of missed keys are put in the
  * missed_keys pointer array for future processing.
  *
- * The function returns the number of packets that needs to be processed in the
- * 'packets' array (they have been moved to the beginning of the vector).
- *
  * For performance reasons a caller may choose not to initialize the metadata
  * in 'packets_'.  If 'md_is_valid' is false, the metadata in 'packets'
  * is not valid and must be initialized by this function using 'port_no'.
  * If 'md_is_valid' is true, the metadata is already valid and 'port_no'
  * will be ignored.
  */
-static inline size_t
+static inline void
 dfc_processing(struct dp_netdev_pmd_thread *pmd,
struct dp_packet_batch *packets_,
struct netdev_flow_key *keys,
@@ -6707,15 +6704,11 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
 
 pmd_perf_update_counter(>perf_stats, PMD_STAT_EXACT_HIT, n_emc_hit);
 
-if (!smc_enable_db) {
-return dp_packet_batch_size(packets_);
+if (smc_enable_db) {
+/* Packets miss EMC will do a batch lookup in SMC if enabled */
+smc_lookup_batch(pmd, keys, missed_keys, packets_,
+  n_missed, flow_map, index_map);
 }
-
-/* Packets miss EMC will do a batch lookup in SMC if enabled */
-smc_lookup_batch(pmd, keys, missed_keys, packets_,
- n_missed, flow_map, index_map);
-
-return dp_packet_batch_size(packets_);
 }
 
 static inline int
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ovs rcu: update rcu pointer first

2020-06-01 Thread Yanqin Wei
Hi Haifeng,

It looks indeed a risk for using ovs-rcu. A few comments inline.

Best Regards,
Wei Yanqin

> -Original Message-
> From: dev  On Behalf Of Linhaifeng
> Sent: Monday, June 1, 2020 11:13 AM
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH] ovs rcu: update rcu pointer first
> 
> We should update rcu pointer first then use ovsrcu_postpone to free
> otherwise maybe cause use-after-free.
> 
> e.g, thead are two threads A and B:
> 
> 1. thread A call ovsrcu_postpone and flush cbset, this time have not call
> ovsrcu_quiesce
> 
> 2. thread rcu wait all threads call ovsrcu_quiesce
> 
> 3. thread B call ovsrcu_quiesce
> 
> 4. thread B get the old pointer next round
> 
> 5. thrad A call ovsrcu_quiesce, now all threads have called ovsrcu_quiesce
[Yanqin]   Thread A is a writer and does not have to call ovsrcu_quiesce. I 
think this scenario can be simplified as follows: reader indicates momentary 
quiescent and access old pointer after writer postpone free old pointer and 
before setting new pointer. 
> 
> 6. thread rcu free old pointer
> 
> 7. thread B use-after-free
> 
> Signed-off-by: Linhaifeng 
> ---
>  lib/classifier.c  |  4 ++--
>  lib/ovs-rcu.h |  2 +-
>  lib/pvector.c | 15 ---
>  ofproto/ofproto-dpif-mirror.c |  4 ++--  ofproto/ofproto-dpif-upcall.c |  3 
> +--
>  5 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/lib/classifier.c b/lib/classifier.c index f2c3497c2..6bff76e07 
> 100644
> --- a/lib/classifier.c
> +++ b/lib/classifier.c
> @@ -249,11 +249,11 @@ cls_rule_set_conjunctions(struct cls_rule *cr,
>  unsigned int old_n = old ? old->n : 0;
> 
>  if (old_n != n || (n && memcmp(old_conj, conj, n * sizeof *conj))) {
> +ovsrcu_set(>conj_set,
> +   cls_conjunction_set_alloc(match, conj, n));
>  if (old) {
>  ovsrcu_postpone(free, old);
>  }
> -ovsrcu_set(>conj_set,
> -   cls_conjunction_set_alloc(match, conj, n));
>  }
>  }
> 
> diff --git a/lib/ovs-rcu.h b/lib/ovs-rcu.h index ecc4c9201..a66d868ea 100644
> --- a/lib/ovs-rcu.h
> +++ b/lib/ovs-rcu.h
> @@ -119,9 +119,9 @@
>   * change_flow(struct flow *new_flow)
>   * {
>   * ovs_mutex_lock();
> + * ovsrcu_set(, new_flow);
>   * ovsrcu_postpone(free,
>   * ovsrcu_get_protected(struct flow *, ));
[Yanqin] flowp has been set to new flow pointer here. Maybe a new variable is 
needed to store old pointer.
> - * ovsrcu_set(, new_flow);
>   * ovs_mutex_unlock();
>   * }
>   *
> diff --git a/lib/pvector.c b/lib/pvector.c index cc527fdc4..aa8c6cb24 100644
> --- a/lib/pvector.c
> +++ b/lib/pvector.c
> @@ -67,10 +67,11 @@ pvector_init(struct pvector *pvec)  void
> pvector_destroy(struct pvector *pvec)  {
> +struct pvector_impl *old = pvector_impl_get(pvec);
>  free(pvec->temp);
>  pvec->temp = NULL;
> -ovsrcu_postpone(free, pvector_impl_get(pvec));
>  ovsrcu_set(>impl, NULL); /* Poison. */
> +ovsrcu_postpone(free, old);
>  }
> 
>  /* Iterators for callers that need the 'index' afterward. */ @@ -205,11 
> +206,11
> @@ pvector_change_priority(struct pvector *pvec, void *ptr, int priority)
>  /* Make the modified pvector available for iteration. */  void
> pvector_publish__(struct pvector *pvec)  {
> -struct pvector_impl *temp = pvec->temp;
> -
> +struct pvector_impl *new = pvec->temp;
> +struct pvector_impl *old = ovsrcu_get_protected(struct pvector_impl *,
> +   >impl);
>  pvec->temp = NULL;
> -pvector_impl_sort(temp); /* Also removes gaps. */
> -ovsrcu_postpone(free, ovsrcu_get_protected(struct pvector_impl *,
> -   >impl));
> -ovsrcu_set(>impl, temp);
> +pvector_impl_sort(new); /* Also removes gaps. */
> +ovsrcu_set(>impl, new);
> +ovsrcu_postpone(free, old);
>  }
> diff --git a/ofproto/ofproto-dpif-mirror.c b/ofproto/ofproto-dpif-mirror.c 
> index
> 343b75f0e..343100c08 100644
> --- a/ofproto/ofproto-dpif-mirror.c
> +++ b/ofproto/ofproto-dpif-mirror.c
> @@ -276,9 +276,9 @@ mirror_set(struct mbridge *mbridge, void *aux, const
> char *name,
>  hmapx_destroy(_map);
> 
>  if (vlans || src_vlans) {
> +unsigned long *new_vlans = vlan_bitmap_clone(src_vlans);
> +ovsrcu_set(>vlans, new_vlans);
>  ovsrcu_postpone(free, vlans);
> -vlans = vlan_bitmap_clone(src_vlans);
> -ovsrcu_set(>vlans, vlans);
>  }
> 
>  mirror->out = out;
> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c 
> index
> 5e08ef10d..be6dafb78 100644
> --- a/ofproto/ofproto-dpif-upcall.c
> +++ b/ofproto/ofproto-dpif-upcall.c
> @@ -1658,11 +1658,10 @@ ukey_set_actions(struct udpif_key *ukey, const
> struct ofpbuf *actions)
>  struct ofpbuf *old_actions = ovsrcu_get_protected(struct ofpbuf *,
>  

Re: [ovs-dev] [PATCH 1/2] Make ByteQ safe for simultaneous producer/consumer

2020-05-12 Thread Yanqin Wei
Hi Anton,

I am curious of what is the concurrency use case for the ByteQ lib.  Because 
currently we found some UT cases failure on some weak memory model CPU, like 
arm and ppc.

Besides this question, I have a few comments inline.

Best Regards,
Wei Yanqin

> -Original Message-
> From: dev  On Behalf Of
> anton.iva...@cambridgegreys.com
> Sent: Tuesday, May 12, 2020 3:41 PM
> To: d...@openvswitch.org
> Cc: Anton Ivanov 
> Subject: [ovs-dev] [PATCH 1/2] Make ByteQ safe for simultaneous
> producer/consumer
>
> From: Anton Ivanov 
>
> A ByteQ with unlocked head and tail is unsafe for simultaneous
> consume/produce.
>
> If simultaneous use is desired, these either need to be locked or there needs 
> to
> be a third atomic or lock guarded variable "used".

[Yanqin]  Is the requirement only single consume /single produce? It is better 
to make use case clear.
>
> An atomic "used" allows the producer to enqueue safely because it "owns" the
> head and even if the consumer changes the head it will only increase the
> space available versus the value in "used".
>
> Once the data has been written and the enqueued should be made visible it
> fenced and the used is updated.
>
> Similar for "consumer" - it can safely consume now as it "owns" tail and never
> reads beyond tail + used (wrapped around as needed).
>
> Signed-off-by: Anton Ivanov 
> ---
>  lib/byteq.c | 17 -
>  lib/byteq.h |  2 ++
>  2 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/lib/byteq.c b/lib/byteq.c
> index 3f865cf9e..da40c2530 100644
> --- a/lib/byteq.c
> +++ b/lib/byteq.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include "util.h"
> +#include "ovs-atomic.h"
>
>  /* Initializes 'q' as an empty byteq that uses the 'size' bytes of 'buffer' 
> to
>   * store data.  'size' must be a power of 2.
> @@ -32,13 +33,16 @@ byteq_init(struct byteq *q, uint8_t *buffer, size_t size)
>  q->buffer = buffer;
>  q->size = size;
>  q->head = q->tail = 0;
> +q->used = ATOMIC_VAR_INIT(0);
>  }
>
>  /* Returns the number of bytes current queued in 'q'. */  int
> byteq_used(const struct byteq *q)  {
> -return q->head - q->tail;
> +int retval;
> +atomic_read_relaxed(>used, );
[Yanqin] "acquire ordering" is required to make dequeue/enqueue not be 
reordered before it.
> +return retval;
>  }
>
>  /* Returns the number of bytes that can be added to 'q' without overflow. */
> @@ -68,9 +72,11 @@ byteq_is_full(const struct byteq *q)  void
> byteq_put(struct byteq *q, uint8_t c)  {
> +int discard;
>  ovs_assert(!byteq_is_full(q));
>  *byteq_head(q) = c;
>  q->head++;
> +atomic_add(>used, 1, );
[Yanqin] atomic_add use seq_cst memory ordering. I think release ordering is 
enough to make enqueue not be reordered after "used" update.
>  }
>
>  /* Adds the 'n' bytes in 'p' at the head of 'q', which must have at least 'n'
> @@ -79,6 +85,7 @@ void
>  byteq_putn(struct byteq *q, const void *p_, size_t n)  {
>  const uint8_t *p = p_;
> +int discard;
>  ovs_assert(byteq_avail(q) >= n);
>  while (n > 0) {
>  size_t chunk = MIN(n, byteq_headroom(q)); @@ -86,6 +93,7 @@
> byteq_putn(struct byteq *q, const void *p_, size_t n)
>  byteq_advance_head(q, chunk);
>  p += chunk;
>  n -= chunk;
> +atomic_add(>used, chunk, );

[Yanqin] Ditto
>  }
>  }
>
> @@ -103,9 +111,11 @@ uint8_t
>  byteq_get(struct byteq *q)
>  {
>  uint8_t c;
> +int discard;
>  ovs_assert(!byteq_is_empty(q));
>  c = *byteq_tail(q);
>  q->tail++;
> +atomic_sub(>used, 1, );
[Yanqin] Ditto
>  return c;
>  }
>
> @@ -168,8 +178,10 @@ byteq_tail(const struct byteq *q)  void
> byteq_advance_tail(struct byteq *q, unsigned int n)  {
> +int discard;
>  ovs_assert(byteq_tailroom(q) >= n);
>  q->tail += n;
> +atomic_sub_relaxed(>used, n, );
[Yanqin] Why relax ordering here? Tail updating may be reordered after that.
>  }
>
>  /* Returns the byte after the last in-use byte of 'q', the point at which new
> @@ -195,6 +207,9 @@ byteq_headroom(const struct byteq *q)  void
> byteq_advance_head(struct byteq *q, unsigned int n)  {
> +int discard;
>  ovs_assert(byteq_headroom(q) >= n);
>  q->head += n;
> +atomic_thread_fence(memory_order_release);
> +atomic_add_relaxed(>used, n, );
[Yanqin] Suggest to use one-way barrier here.
>  }
> diff --git a/lib/byteq.h b/lib/byteq.h
> index d73e3684e..e829efab0 100644
> --- a/lib/byteq.h
> +++ b/lib/byteq.h
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include "ovs-atomic.h"
>
>  /* General-purpose circular queue of bytes. */  struct byteq { @@ -26,6 +27,7
> @@ struct byteq {
>  unsigned int size;  /* Number of bytes allocated for 'buffer'. */
>  unsigned int head;  /* Head of queue. */
>  unsigned int tail;  /* Chases the head. */
> +atomic_int used;
>  };
>
>  void byteq_init(struct byteq *, uint8_t *buffer, size_t size);
> --
> 

Re: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI for arm

2020-03-19 Thread Yanqin Wei
Hi Ilya,

The patch is updated for reducing Arm CI running time.  Do you think if the 
time increment is acceptable?
https://patchwork.ozlabs.org/patch/1258100/

Best Regards,
Wei Yanqin

> -Original Message-
> From: Yanqin Wei
> Sent: Thursday, March 12, 2020 5:39 PM
> To: Ilya Maximets ; ovs-dev@openvswitch.org;
> b...@ovn.org; Lance Yang 
> Cc: dwil...@us.ibm.com; Gavin Hu ; Ruifeng Wang
> ; Jieqiang Wang ;
> Malvika Gupta ; nd 
> Subject: RE: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI for arm
> 
> Thanks for the feedback. Replied in line.
> 
> > -Original Message-
> > From: Ilya Maximets 
> > Sent: Thursday, March 12, 2020 12:20 AM
> > To: Yanqin Wei ; Ilya Maximets
> > ; ovs-dev@openvswitch.org; b...@ovn.org; Lance
> Yang
> > 
> > Cc: dwil...@us.ibm.com; Gavin Hu ; Ruifeng Wang
> > ; Jieqiang Wang ;
> Malvika
> > Gupta ; nd 
> > Subject: Re: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI for
> > arm
> >
> > On 3/11/20 5:57 AM, Yanqin Wei wrote:
> > > Hi Ilya,
> > >
> > > This patch has been in the review pipeline for some time. It runs
> > > stable on
> > our internal repo more than two months.
> > > Could you give us some suggestion about the next action I can take
> > > to speed
> > up the merge of this patch?
> >
> > Hi.  Sorry for things taking so long.
> > I have this patch in my backlog for this or next week.
> >
> > The main concern right know is possible significant increase of the
> > checking time.  Are you sure that we need all the listed jobs?
> > Do you expect some arm64 specific issues on the linking stage?
> > I mean, maybe we could reduce number of different combinations of
> "shared"
> > flags.  I had no chance to run this, so I don't know how much time
> > these jobs really takes and what is the total time difference.
> 
> [Yanqin]  This is latest build report for x86 and arm.  Sparse is disabled 
> here
> because it has some compiling issue for both x86 and arm today.
> X86 and Arm https://travis-
> ci.com/github/MarcoDWei/ovs/builds/152933388?utm_medium=notification
> _source=email
> Ran for 58 min 3 sec/ Total time 4 hrs 12 min 24 sec
> X86 only https://travis-
> ci.com/github/MarcoDWei/ovs/builds/152942934?utm_medium=notification
> _source=email
> Ran for 38 min 40 sec /Total time 2 hrs 55 min 4 sec
> 
> The total time increases around 1hr 17min for SIX new arm jobs. Running time
> increases around 20 mins. Kernel datapath jobs look most time consuming
> jobs.
> OPTS="--disable-ssl"4 min 31 sec
> KERNEL_LIST="5.0 4.20 4.19 4.18 4.17 4.16" 22 min 39 sec
> KERNEL_LIST="4.15 4.14 4.9 4.4 3.16"17 min 34 sec
> DPDK=1 OPTS="--enable-shared"11 min 17 sec
> DPDK_SHARED=1   10 min 35 sec
> DPDK_SHARED=1 OPTS="--enable-shared"   11 min 21 sec
> 
> I agree with you to remove some shared tag combination because they have
> low risk of CPU specific issues in linking stage.
> Moreover we could chose a part of kernel version for ARM jobs, which could
> significantly reduce total time and running time.  The running time is 
> expected
> to increase around 10 minutes.
> OPTS="--disable-ssl"
> KERNEL_LIST="5.0 4.20 4.16 4.9 3.16"
> DPDK=1 OPTS="--enable-shared"
> DPDK_SHARED=1
> 
> >
> > Best regards, Ilya Maximets.
> >
> >
> > >
> > > Best Regards,
> > > Wei Yanqin
> > >
> > >> -Original Message-
> > >> From: Lance Yang 
> > >> Sent: Tuesday, January 21, 2020 9:06 AM
> > >> To: Ilya Maximets ; ovs-dev@openvswitch.org
> > >> Cc: b...@ovn.org; Yanqin Wei ;
> > dwil...@us.ibm.com;
> > >> Gavin Hu ; Ruifeng Wang
> ;
> > >> Jieqiang Wang ; Malvika Gupta
> > >> ; nd 
> > >> Subject: RE: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI
> > >> for arm
> > >>
> > >>
> > >>> -Original Message-
> > >>> From: Ilya Maximets 
> > >>> Sent: Saturday, December 7, 2019 12:39 AM
> > >>> To: Lance Yang (Arm Technology China) ; ovs-
> > >>> d...@openvswitch.org
> > >>> Cc: i.maxim...@ovn.org; b...@ovn.org; Yanqin Wei (Arm Technology
> > >>> China) ; dwil...@us.ibm.com; Gavin Hu (Arm
> > >>> Technology
> > >>> China) ; Ruifeng Wang (Arm Technology China)
> > >

Re: [ovs-dev] [PATCH v2 1/2] dpif-netdev: Expand the meter capacity using cmap

2020-03-15 Thread Yanqin Wei
Hi Xiangxia,

The meter id is allocated by id_pool,  which can always return the lowest 
available id.
There is a light scalable data struct which supports direct address lookup. It 
can achieve several times performance improvement than cmap_find. And it has 
not copy memory when expanding.
https://patchwork.ozlabs.org/patch/1253447/

I suggest using this light data struct for >65535 meter instance. Would you 
like to take a look at it?

Best Regards,
Wei Yanqin

> -Original Message-
> From: dev  On Behalf Of
> xiangxia.m@gmail.com
> Sent: Sunday, March 15, 2020 2:43 PM
> To: d...@openvswitch.org
> Cc: Ilya Maximets 
> Subject: [ovs-dev] [PATCH v2 1/2] dpif-netdev: Expand the meter capacity
> using cmap
>
> From: Tonghao Zhang 
>
> For now, ovs-vswitchd use the array of the dp_meter struct to store meter's
> data, and at most, there are only 65536 (defined by MAX_METERS) meters
> that can be used. But in some case, for example, in the edge gateway, we
> should use 200,000+, at least, meters for IP address bandwidth limitation.
> Every one IP address will use two meters for its rx and tx path[1]. In other 
> way,
> ovs-vswitchd should support meter-offload (rte_mtr_xxx api introduced by
> dpdk.), but there are more than
> 65536 meters in the hardware, such as Mellanox ConnectX-6.
>
> This patch use cmap to manage the meter, instead of the array.
>
> * Insertion performance, ovs-ofctl add-meter 1000+ meters,
>   the cmap takes abount 4000ms, as same as previous implementation.
> * Lookup performance in datapath, we add 1000+ meter which rate is
>   10G (the NIC cards are 10Gb, so netdev-datapath will not drop the
>   packets.), and a flow which only forwarding the packets from p0
>   to p1, with meter action[2]. On other server, the pktgen-dpdk
>   will generate 64B packets to p0.
>   The forwarding performance is 4,814,400 pps. Without this path,
>   4,935,584 pps. There are about 1% performance loss. For addressing
>   this issue, next patch add a meter cache.
>
> [1].
> $ in_port=p0,ip,ip_dst=1.1.1.x action=meter:n,output:p1
> $ in_port=p1,ip,ip_src=1.1.1.x action=meter:m,output:p0
>
> [2].
> $ in_port=p0 action=meter:100,output:p1
>
> Cc: Ben Pfaff 
> Cc: Jarno Rajahalme 
> Cc: Ilya Maximets 
> Cc: Andy Zhou 
> Signed-off-by: Tonghao Zhang 
> ---
>  lib/dpif-netdev.c | 199 +++-
> --
>  1 file changed, 130 insertions(+), 69 deletions(-)
>
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index d393aab..5474d52
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -98,9 +98,11 @@ DEFINE_STATIC_PER_THREAD_DATA(uint32_t,
> recirc_depth, 0)
>
>  /* Configuration parameters. */
>  enum { MAX_FLOWS = 65536 }; /* Maximum number of flows in flow table.
> */
> -enum { MAX_METERS = 65536 };/* Maximum number of meters. */
> -enum { MAX_BANDS = 8 }; /* Maximum number of bands / meter. */
> -enum { N_METER_LOCKS = 64 };/* Maximum number of meters. */
> +
> +/* Maximum number of meters in the table. */ #define METER_ENTRY_MAX
> (1
> +<< 19)
> +/* Maximum number of bands / meter. */
> +#define METER_BAND_MAX (8)
>
>  COVERAGE_DEFINE(datapath_drop_meter);
>  COVERAGE_DEFINE(datapath_drop_upcall_error);
> @@ -280,6 +282,9 @@ struct dp_meter_band {  };
>
>  struct dp_meter {
> +struct cmap_node node;
> +struct ovs_mutex lock;
> +uint32_t id;
>  uint16_t flags;
>  uint16_t n_bands;
>  uint32_t max_delta_t;
> @@ -289,6 +294,12 @@ struct dp_meter {
>  struct dp_meter_band bands[];
>  };
>
> +struct dp_netdev_meter {
> +struct cmap table OVS_GUARDED;
> +struct ovs_mutex lock;  /* Used for meter table. */
> +uint32_t hash_basis;
> +};
> +
>  struct pmd_auto_lb {
>  bool auto_lb_requested; /* Auto load balancing requested by user. */
>  bool is_enabled;/* Current status of Auto load balancing. */
> @@ -329,8 +340,7 @@ struct dp_netdev {
>  atomic_uint32_t tx_flush_interval;
>
>  /* Meters. */
> -struct ovs_mutex meter_locks[N_METER_LOCKS];
> -struct dp_meter *meters[MAX_METERS]; /* Meter bands. */
> +struct dp_netdev_meter *meter;
>
>  /* Probability of EMC insertions is a factor of 'emc_insert_min'.*/
>  OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
> @@ -378,19 +388,6 @@ struct dp_netdev {
>  struct pmd_auto_lb pmd_alb;
>  };
>
> -static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> -OVS_ACQUIRES(dp->meter_locks[meter_id % N_METER_LOCKS])
> -{
> -ovs_mutex_lock(>meter_locks[meter_id % N_METER_LOCKS]);
> -}
> -
> -static void meter_unlock(const struct dp_netdev *dp, uint32_t meter_id)
> -OVS_RELEASES(dp->meter_locks[meter_id % N_METER_LOCKS])
> -{
> -ovs_mutex_unlock(>meter_locks[meter_id % N_METER_LOCKS]);
> -}
> -
> -
>  static struct dp_netdev_port *dp_netdev_lookup_port(const struct
> dp_netdev *dp,
>  odp_port_t)
>  

Re: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI for arm

2020-03-12 Thread Yanqin Wei
Thanks for the feedback. Replied in line.

> -Original Message-
> From: Ilya Maximets 
> Sent: Thursday, March 12, 2020 12:20 AM
> To: Yanqin Wei ; Ilya Maximets
> ; ovs-dev@openvswitch.org; b...@ovn.org; Lance Yang
> 
> Cc: dwil...@us.ibm.com; Gavin Hu ; Ruifeng Wang
> ; Jieqiang Wang ;
> Malvika Gupta ; nd 
> Subject: Re: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI for arm
> 
> On 3/11/20 5:57 AM, Yanqin Wei wrote:
> > Hi Ilya,
> >
> > This patch has been in the review pipeline for some time. It runs stable on
> our internal repo more than two months.
> > Could you give us some suggestion about the next action I can take to speed
> up the merge of this patch?
> 
> Hi.  Sorry for things taking so long.
> I have this patch in my backlog for this or next week.
> 
> The main concern right know is possible significant increase of the checking
> time.  Are you sure that we need all the listed jobs?
> Do you expect some arm64 specific issues on the linking stage?
> I mean, maybe we could reduce number of different combinations of "shared"
> flags.  I had no chance to run this, so I don't know how much time these jobs
> really takes and what is the total time difference.

[Yanqin]  This is latest build report for x86 and arm.  Sparse is disabled here 
because it has some compiling issue for both x86 and arm today.
X86 and Arm 
https://travis-ci.com/github/MarcoDWei/ovs/builds/152933388?utm_medium=notification_source=email
Ran for 58 min 3 sec/ Total time 4 hrs 12 min 24 sec
X86 only 
https://travis-ci.com/github/MarcoDWei/ovs/builds/152942934?utm_medium=notification_source=email
Ran for 38 min 40 sec /Total time 2 hrs 55 min 4 sec

The total time increases around 1hr 17min for SIX new arm jobs. Running time 
increases around 20 mins. Kernel datapath jobs look most time consuming jobs.
OPTS="--disable-ssl"4 min 31 sec 
KERNEL_LIST="5.0 4.20 4.19 4.18 4.17 4.16" 22 min 39 sec 
KERNEL_LIST="4.15 4.14 4.9 4.4 3.16"17 min 34 sec 
DPDK=1 OPTS="--enable-shared"11 min 17 sec 
DPDK_SHARED=1   10 min 35 sec 
DPDK_SHARED=1 OPTS="--enable-shared"   11 min 21 sec

I agree with you to remove some shared tag combination because they have low 
risk of CPU specific issues in linking stage.
Moreover we could chose a part of kernel version for ARM jobs, which could 
significantly reduce total time and running time.  The running time is expected 
to increase around 10 minutes.
OPTS="--disable-ssl"
KERNEL_LIST="5.0 4.20 4.16 4.9 3.16" 
DPDK=1 OPTS="--enable-shared"
DPDK_SHARED=1   

> 
> Best regards, Ilya Maximets.
> 
> 
> >
> > Best Regards,
> > Wei Yanqin
> >
> >> -Original Message-
> >> From: Lance Yang 
> >> Sent: Tuesday, January 21, 2020 9:06 AM
> >> To: Ilya Maximets ; ovs-dev@openvswitch.org
> >> Cc: b...@ovn.org; Yanqin Wei ;
> dwil...@us.ibm.com;
> >> Gavin Hu ; Ruifeng Wang ;
> >> Jieqiang Wang ; Malvika Gupta
> >> ; nd 
> >> Subject: RE: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI
> >> for arm
> >>
> >>
> >>> -Original Message-
> >>> From: Ilya Maximets 
> >>> Sent: Saturday, December 7, 2019 12:39 AM
> >>> To: Lance Yang (Arm Technology China) ; ovs-
> >>> d...@openvswitch.org
> >>> Cc: i.maxim...@ovn.org; b...@ovn.org; Yanqin Wei (Arm Technology
> >>> China) ; dwil...@us.ibm.com; Gavin Hu (Arm
> >>> Technology
> >>> China) ; Ruifeng Wang (Arm Technology China)
> >>> ; Jieqiang Wang (Arm Technology China)
> >>> ; Malvika Gupta ;
> nd
> >>> 
> >>> Subject: Re: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI
> >>> for arm
> >>>
> >>> On 06.12.2019 04:26, Lance Yang wrote:
> >>>> Enable part of travis jobs with gcc compiler for arm64 architecture
> >>>>
> >>>> 1. Add arm jobs into the matrix in .travis.yml configuration file 2.
> >>>> To enable OVS-DPDK jobs, set the build target according to
> >>>> different CPU architectures 3. Temporarily disable sparse checker
> >>>> because of static code checking failure on arm64
> >>>>
> >>>> Successful travis build jobs report:
> >>>> https://travis-ci.org/yzyuestc/ovs/builds/621037339
> >>>>
> >>>> Reviewed-by: Ya

[ovs-dev] [PATCH v1 3/3] tests/test-sda-table: add check test for sda-table.

2020-03-12 Thread Yanqin Wei
Add check test for sda-table lookup, insertion and deletion.

Reviewed-by: Gavin Hu 
Reviewed-by: Malvika Gupta 
Signed-off-by: Yanqin Wei 
---
 tests/automake.mk  |   3 +-
 tests/test-sda-table.c | 197 +
 2 files changed, 199 insertions(+), 1 deletion(-)
 create mode 100644 tests/test-sda-table.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 9c7ebdce9..9581ffb8e 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -462,7 +462,8 @@ tests_ovstest_SOURCES = \
tests/test-bitmap.c \
tests/test-vconn.c \
tests/test-aa.c \
-   tests/test-stopwatch.c
+   tests/test-stopwatch.c \
+   tests/test-sda-table.c
 
 if !WIN32
 tests_ovstest_SOURCES += \
diff --git a/tests/test-sda-table.c b/tests/test-sda-table.c
new file mode 100644
index 0..6bcd4cefb
--- /dev/null
+++ b/tests/test-sda-table.c
@@ -0,0 +1,197 @@
+/*
+ * Copyright (c) 2020 Arm Limited.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/* A functional test for some of the functions and macros declared in
+ * sda-table.h. */
+
+#include 
+#undef NDEBUG
+#include "sda-table.h"
+#include "id-pool.h"
+#include 
+#include 
+#include 
+#include "bitmap.h"
+#include "command-line.h"
+#include "ovstest.h"
+#include "ovs-thread.h"
+#include "util.h"
+
+struct element {
+struct sda_table_node node;
+};
+
+/* Tests basic sda table insertion and deletion for single node chain. */
+static void
+test_sda_table_add_del_singlenode_chain(void)
+{
+enum { N_ELEMS = 1 };
+struct element elements[N_ELEMS];
+uint32_t id[N_ELEMS];
+
+size_t i;
+struct id_pool *pool = id_pool_create(0, UINT32_MAX - 1);
+struct sda_table sda = SDA_TABLE_INITIALIZER;
+const struct sda_table_node * node;
+bool ret;
+
+for (i = 0; i < N_ELEMS; i++) {
+ret = id_pool_alloc_id(pool, [i]);
+ovs_assert(ret == true);
+
+ret = sda_table_insert_node(, id[i], [id[i]].node);
+ovs_assert(ret == true);
+
+node = sda_table_find_node(, id[i]);
+ovs_assert(node == [id[i]].node);
+
+ret = sda_table_remove_node(, id[i], [id[i]].node);
+ovs_assert(ret == true);
+
+node = sda_table_find_node(, id[i]);
+ovs_assert(node == NULL);
+
+id_pool_free_id(pool, id[i]);
+}
+
+sda_table_destroy();
+id_pool_destroy(pool);
+}
+
+
+static void
+test_sda_table_add_del_multinode_chain(void)
+{
+enum { N_ELEMS = 1, N_NODES = 10 };
+struct element elements[N_ELEMS][N_NODES];
+uint32_t id[N_ELEMS];
+
+struct element *elm;
+size_t i, j;
+struct id_pool *pool = id_pool_create(0, UINT32_MAX - 1);
+struct sda_table  sda = SDA_TABLE_INITIALIZER;
+bool ret;
+
+for (i = 0; i < N_ELEMS; i++) {
+ret = id_pool_alloc_id(pool, [i]);
+ovs_assert(ret == true);
+
+for (j = 0; j < N_NODES; j++) {
+ret = sda_table_insert_node(, id[i], [id[i]][j].node);
+ovs_assert(ret == true);
+}
+
+SDA_TABLE_FOR_EACH_WITH_ID (elm, node, id[i], ) {
+for (j = 0; j < N_NODES; j++) {
+if (elm == [id[i]][j]) {
+break;
+}
+}
+ovs_assert(elm == [id[i]][j]);
+}
+
+for (j = N_NODES / 2; j < N_NODES; j++) {
+ret = sda_table_remove_node(, id[i], [id[i]][j].node);
+ovs_assert(ret == true);
+}
+
+SDA_TABLE_FOR_EACH_WITH_ID (elm, node, id[i], ) {
+for (j = 0; j < N_NODES / 2; j++) {
+if (elm == [id[i]][j]) {
+break;
+}
+}
+ovs_assert(elm == [id[i]][j]);
+}
+}
+
+for (i = N_ELEMS / 2; i < N_ELEMS; i++) {
+for (j = 0; j < N_NODES; j++) {
+ret = sda_table_remove_node(, id[i], [id[i]][j].node);
+}
+id_pool_free_id(pool, id[i]);
+
+SDA_TABLE_FOR_EACH_WITH_ID (elm, node, id[i], ) {
+ovs_assert(elm == NULL);
+}
+}
+
+for (i = 0; i < N_ELEMS / 2; i++) {
+SDA_TABLE_FOR_EACH_WITH_ID (elm, node, id[i], ) {
+for (j = 0; j < N_NODES; j++) {
+if (elm == [id[i]][j]) {
+break;
+}
+}
+ov

[ovs-dev] [PATCH v1 1/3] lib: implement scalable direct address table for fast lookup

2020-03-12 Thread Yanqin Wei
In the partial flow offloading path, mark2flow table lookup is a hot spot.
"cmap_find" takes more than 20% CPU cycles in datapath. Hash map is too
heavy for this case and a lighter direct address table can be used for
faster lookup.

This patch implements a scalable direct address table. It is composed of a
series of arrays. It has no array in the initial phase, but it can expand
without memory copy.
An element of the array is a chain header, whose address can be calculated
by index. This table supports single writer, multi-reader concurrent
access.

Reviewed-by: Gavin Hu 
Reviewed-by: Malvika Gupta 
Signed-off-by: Yanqin Wei 
---
 lib/automake.mk |   2 +
 lib/sda-table.c | 166 
 lib/sda-table.h | 127 
 3 files changed, 295 insertions(+)
 create mode 100644 lib/sda-table.c
 create mode 100644 lib/sda-table.h

diff --git a/lib/automake.mk b/lib/automake.mk
index 95925b57c..aff21d82a 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -238,6 +238,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/pcap-file.h \
lib/perf-counter.h \
lib/perf-counter.c \
+   lib/sda-table.h \
+   lib/sda-table.c \
lib/stopwatch.h \
lib/stopwatch.c \
lib/poll-loop.c \
diff --git a/lib/sda-table.c b/lib/sda-table.c
new file mode 100644
index 0..d83f9e5d0
--- /dev/null
+++ b/lib/sda-table.c
@@ -0,0 +1,166 @@
+/*
+ * Copyright (c) 2020 Arm Limited.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include "sda-table.h"
+#include "util.h"
+
+#define SDA_ARRAY_SIZE(ARRAYID) (ARRAYID == 0 ? SDA_TABLE_BASE_SIZE :\
+1 << (SDA_TABLE_BASE_SIZE_LOG2 + ARRAYID -1))
+
+static bool
+sda_table_find_node_header(struct sda_table *sda, uint32_t id,
+struct sda_table_node **header, bool create_array)
+{
+struct sda_table_node *p_array;
+uint32_t array_id, offset;
+uint32_t l1 = leftmost_1bit_idx(id);
+
+array_id = id < SDA_TABLE_BASE_SIZE ?
+ 0 : l1 - SDA_TABLE_BASE_SIZE_LOG2 + 1;
+
+p_array = ovsrcu_get(struct sda_table_node *, >array[array_id]);
+if (p_array == NULL) {
+if (create_array) {
+p_array = xzalloc_cacheline(sizeof(struct sda_table_node) *
+SDA_ARRAY_SIZE(array_id));
+ovsrcu_set(>array[array_id], p_array);
+} else {
+return false;
+}
+}
+
+offset = id < SDA_TABLE_BASE_SIZE ?
+ id : id - (1 << l1);
+*header = p_array + offset;
+
+return true;
+}
+
+bool
+sda_table_insert_node(struct sda_table *sda, uint32_t id,
+struct sda_table_node *new)
+{
+struct sda_table_node *header = NULL;
+
+if (sda_table_find_node_header(sda, id, , true)) {
+struct sda_table_node *node = sda_table_node_next_protected(header);
+ovsrcu_set_hidden(>next, node);
+ovsrcu_set(>next, new);
+return true;
+} else {
+return false;
+}
+}
+
+bool
+sda_table_remove_node(struct sda_table *sda, uint32_t id,
+struct sda_table_node *node)
+{
+struct sda_table_node *iter = NULL;
+
+if (sda_table_find_node_header(sda, id, , false)) {
+for (;;) {
+struct sda_table_node *next = sda_table_node_next_protected(iter);
+
+if (next == node) {
+ovsrcu_set(>next, sda_table_node_next_protected(node));
+return true;
+}
+else if (next == NULL) {
+return false;
+}
+iter = next;
+}
+}
+
+return false;
+}
+
+const struct sda_table_node *
+sda_table_find_node(struct sda_table *sda, uint32_t id)
+{
+struct sda_table_node * header = NULL;
+
+if (sda_table_find_node_header(sda, id, , false) && header) {
+return sda_table_node_next(header);
+} else {
+return NULL;
+}
+}
+
+void
+sda_table_destroy(struct sda_table *sda)
+{
+if (sda) {
+for (uint32_t i = 0; i < SDA_TABLE_ARRAY_NUM; i++) {
+const struct sda_table_node *b =
+ovsrcu_get(struct sda_table_node *, >array[i]);
+if (b) {
+ovsrcu_postpone(free, >array[i]);
+ovsrcu_set(>array[i], NULL);
+}
+}

[ovs-dev] [PATCH v1 0/3] improve mark2flow lookup for partial offloading datapath

2020-03-12 Thread Yanqin Wei
In the partial flow offloading datapath, mark2flow table lookup is a hot
spot. "cmap_find" takes more than 20% CPU cycles in datapath. Hash map is
too heavy for this case and a lighter data struct can be used for faster
lookup. This patch applies a direct address table for mark2flow lookup. The
throughput uplift is more than 10% in case of single flow and 20% with
>1000 mega flow.

Yanqin Wei (3):
  lib: implement scalable direct address table for fast lookup
  dpif-netdev: improve partial offloding datapath by sda-table lookup
  tests/test-sda-table: add check test for sda-table.

 lib/automake.mk|   2 +
 lib/dpif-netdev.c  |  30 ---
 lib/sda-table.c| 166 ++
 lib/sda-table.h| 126 ++
 tests/automake.mk  |   3 +-
 tests/test-sda-table.c | 197 +
 6 files changed, 510 insertions(+), 14 deletions(-)
 create mode 100644 lib/sda-table.c
 create mode 100644 lib/sda-table.h
 create mode 100644 tests/test-sda-table.c

-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 2/3] dpif-netdev: improve partial offloding datapath by sda-table lookup

2020-03-12 Thread Yanqin Wei
Cmap_find is a hot spot in partial offloading datapath. This patch applies
sda-table for mark2flow lookup. The throughput uplift is more than 10% in
case of single flow and 20% with 1000 mega flows.

Reviewed-by: Gavin Hu 
Reviewed-by: Malvika Gupta 
Signed-off-by: Yanqin Wei 
---
 lib/dpif-netdev.c | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index d393aab5e..47eacdc51 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -48,6 +48,7 @@
 #include "flow.h"
 #include "hmapx.h"
 #include "id-pool.h"
+#include "sda-table.h"
 #include "ipf.h"
 #include "netdev.h"
 #include "netdev-offload.h"
@@ -523,7 +524,8 @@ struct dp_netdev_flow {
 /* Hash table index by unmasked flow. */
 const struct cmap_node node; /* In owning dp_netdev_pmd_thread's */
  /* 'flow_table'. */
-const struct cmap_node mark_node; /* In owning flow_mark's mark_to_flow */
+const struct sda_table_node mark_node; /* In owning flow_mark's */
+ /* mark_to_flow */
 const ovs_u128 ufid; /* Unique flow identifier. */
 const ovs_u128 mega_ufid;/* Unique mega flow identifier. */
 const unsigned pmd_id;   /* The 'core_id' of pmd thread owning this */
@@ -2159,13 +2161,13 @@ struct megaflow_to_mark_data {
 
 struct flow_mark {
 struct cmap megaflow_to_mark;
-struct cmap mark_to_flow;
+struct sda_table mark_to_flow;
 struct id_pool *pool;
 };
 
 static struct flow_mark flow_mark = {
 .megaflow_to_mark = CMAP_INITIALIZER,
-.mark_to_flow = CMAP_INITIALIZER,
+.mark_to_flow = SDA_TABLE_INITIALIZER,
 };
 
 static uint32_t
@@ -2248,9 +2250,10 @@ mark_to_flow_associate(const uint32_t mark, struct 
dp_netdev_flow *flow)
 {
 dp_netdev_flow_ref(flow);
 
-cmap_insert(_mark.mark_to_flow,
-CONST_CAST(struct cmap_node *, >mark_node),
-hash_int(mark, 0));
+sda_table_insert_node(_mark.mark_to_flow,
+mark,
+CONST_CAST(struct sda_table_node *, >mark_node));
+
 flow->mark = mark;
 
 VLOG_DBG("Associated dp_netdev flow %p with mark %u\n", flow, mark);
@@ -2261,8 +2264,8 @@ flow_mark_has_no_ref(uint32_t mark)
 {
 struct dp_netdev_flow *flow;
 
-CMAP_FOR_EACH_WITH_HASH (flow, mark_node, hash_int(mark, 0),
- _mark.mark_to_flow) {
+SDA_TABLE_FOR_EACH_WITH_ID (flow, mark_node, mark,
+   _mark.mark_to_flow) {
 if (flow->mark == mark) {
 return false;
 }
@@ -2277,10 +2280,11 @@ mark_to_flow_disassociate(struct dp_netdev_pmd_thread 
*pmd,
 {
 int ret = 0;
 uint32_t mark = flow->mark;
-struct cmap_node *mark_node = CONST_CAST(struct cmap_node *,
+struct sda_table_node *mark_node = CONST_CAST(struct sda_table_node *,
  >mark_node);
 
-cmap_remove(_mark.mark_to_flow, mark_node, hash_int(mark, 0));
+ovs_assert(sda_table_remove_node(_mark.mark_to_flow,
+ mark, mark_node));
 flow->mark = INVALID_FLOW_MARK;
 
 /*
@@ -2316,7 +2320,7 @@ flow_mark_flush(struct dp_netdev_pmd_thread *pmd)
 {
 struct dp_netdev_flow *flow;
 
-CMAP_FOR_EACH (flow, mark_node, _mark.mark_to_flow) {
+SDA_TABLE_FOR_EACH (flow, mark_node, _mark.mark_to_flow) {
 if (flow->pmd_id == pmd->core_id) {
 queue_netdev_flow_del(pmd, flow);
 }
@@ -2329,8 +2333,8 @@ mark_to_flow_find(const struct dp_netdev_pmd_thread *pmd,
 {
 struct dp_netdev_flow *flow;
 
-CMAP_FOR_EACH_WITH_HASH (flow, mark_node, hash_int(mark, 0),
- _mark.mark_to_flow) {
+SDA_TABLE_FOR_EACH_WITH_ID (flow, mark_node, mark,
+_mark.mark_to_flow) {
 if (flow->mark == mark && flow->pmd_id == pmd->core_id &&
 flow->dead == false) {
 return flow;
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI for arm

2020-03-11 Thread Yanqin Wei
Hi Ilya,

This patch has been in the review pipeline for some time. It runs stable on our 
internal repo more than two months.
Could you give us some suggestion about the next action I can take to speed up 
the merge of this patch?

Best Regards,
Wei Yanqin

> -Original Message-
> From: Lance Yang 
> Sent: Tuesday, January 21, 2020 9:06 AM
> To: Ilya Maximets ; ovs-dev@openvswitch.org
> Cc: b...@ovn.org; Yanqin Wei ; dwil...@us.ibm.com;
> Gavin Hu ; Ruifeng Wang ;
> Jieqiang Wang ; Malvika Gupta
> ; nd 
> Subject: RE: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI for arm
> 
> 
> > -Original Message-
> > From: Ilya Maximets 
> > Sent: Saturday, December 7, 2019 12:39 AM
> > To: Lance Yang (Arm Technology China) ; ovs-
> > d...@openvswitch.org
> > Cc: i.maxim...@ovn.org; b...@ovn.org; Yanqin Wei (Arm Technology China)
> > ; dwil...@us.ibm.com; Gavin Hu (Arm Technology
> > China) ; Ruifeng Wang (Arm Technology China)
> > ; Jieqiang Wang (Arm Technology China)
> > ; Malvika Gupta ; nd
> > 
> > Subject: Re: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI for
> > arm
> >
> > On 06.12.2019 04:26, Lance Yang wrote:
> > > Enable part of travis jobs with gcc compiler for arm64 architecture
> > >
> > > 1. Add arm jobs into the matrix in .travis.yml configuration file 2.
> > > To enable OVS-DPDK jobs, set the build target according to different
> > > CPU architectures 3. Temporarily disable sparse checker because of
> > > static code checking failure on arm64
> > >
> > > Successful travis build jobs report:
> > > https://travis-ci.org/yzyuestc/ovs/builds/621037339
> > >
> > > Reviewed-by: Yanqin Wei 
> > > Reviewed-by: Jieqiang Wang 
> > > Reviewed-by: Gavin Hu 
> > > Signed-off-by: Lance Yang 
> > > ---
> >
> > Compiler crashed while building DPDK:
> >
> > /home/travis/build/ovsrobot/ovs/dpdk-dir/drivers/net/ixgbe/ixgbe_pf.c:
> > In function
> > ‘ixgbe_pf_host_configure’:
> > /home/travis/build/ovsrobot/ovs/dpdk-dir/drivers/net/ixgbe/ixgbe_pf.c:
> > 297:1: internal compiler error: Segmentation fault
> >
> > https://travis-ci.org/ovsrobot/ovs/jobs/621434216#L1999
> >
> > This is not good.
> > Need to check how frequently this happens.
> >
> > Best regards, Ilya Maximets.
> [Lance]
> Hi Ilya,
> 
> After you give us the feedback about the segmentation fault issue, we keep
> running the travis CI to observe the frequency. We run those arm jobs on a
> regular basis and clear cache on every single build. We have run the build at
> least 2 times a day for more than a month.
> 
> The good thing is that we haven't reproduce the segfault issue. Job reports 
> are
> available at: https://travis-ci.org/yzyuestc/ovs/builds/639064033
> 
> Travis CI community also made some adjustment on their side, you can see
> their reply here: https://travis-ci.community/t/segfaults-in-arm64-
> environment/5617/13
> 
> I think the segfault issue may occur by chance. At least, it is not a 
> frequent one,
> which is unlikely to cause issues to OVS developers. Could you please check it
> again?
> 
> Best Regards,
> Lance

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3] lib: use acquire-release semantics for pvector size

2020-03-03 Thread Yanqin Wei
Hi Ilya,

Thanks for your comments. V4 has been updated, could you please check it again?

https://patchwork.ozlabs.org/patch/1248440/

Best Regards,
Wei Yanqin

> -Original Message-
> From: Ilya Maximets 
> Sent: Friday, February 28, 2020 9:40 PM
> To: Yanqin Wei ; d...@openvswitch.org
> Cc: nd ; Lijian Zhang ; Gavin Hu
> ; i.maxim...@ovn.org
> Subject: Re: [ovs-dev] [PATCH v3] lib: use acquire-release semantics for 
> pvector
> size
> 
> On 2/27/20 5:12 PM, Yanqin Wei wrote:
> > Read/write concurrency of pvector library is implemented by a temp
> > vector and RCU protection. Considering performance reason, insertion
> > does not follow this scheme.
> > In insertion function, a thread fence ensures size incrementation is
> > done after new entry is stored. But there is no barrier in the
> > iteration fuction(pvector_cursor_init). Entry point access may be
> > reorderd before
> 
> s/reorderd/reordered/
> 
> > loading vector size, so the invalid entry point may be loaded when
> > vector iteration.
> > This patch fixes it by acquire-release pair. It can guarantee new size
> > is observed by reader after new entry stored by writer. And this is
> > implemented by one-way barrier instead of two-way memory fence.
> >
> 
> I believe we need a 'Fixes' tag here.
> 
> > Reviewed-by: Gavin Hu 
> > Reviewed-by: Lijian Zhang 
> > Signed-off-by: Yanqin Wei 
> > ---
> >  lib/pvector.c | 18 +++---  lib/pvector.h | 12
> > 
> >  2 files changed, 19 insertions(+), 11 deletions(-)
> >
> > diff --git a/lib/pvector.c b/lib/pvector.c index aaeee9214..f557b0559
> > 100644
> > --- a/lib/pvector.c
> > +++ b/lib/pvector.c
> > @@ -33,7 +33,7 @@ pvector_impl_alloc(size_t size)
> >  struct pvector_impl *impl;
> >
> >  impl = xmalloc(sizeof *impl + size * sizeof impl->vector[0]);
> > -impl->size = 0;
> > +atomic_init(>size, 0);
> >  impl->allocated = size;
> >
> >  return impl;
> > @@ -117,18 +117,22 @@ pvector_insert(struct pvector *pvec, void *ptr,
> > int priority)  {
> >  struct pvector_impl *temp = pvec->temp;
> >  struct pvector_impl *old = pvector_impl_get(pvec);
> > +size_t size;
> >
> >  ovs_assert(ptr != NULL);
> >
> > +/* There is no possible concurrent writer. Insertions must be protected
> > + * by mutex or be always excuted from the same thread */
> 
> Please, add a period to the end of the comment.
> 
> > +atomic_read_relaxed(>size, );
> > +
> >  /* Check if can add to the end without reallocation. */
> > -if (!temp && old->allocated > old->size &&
> > -(!old->size || priority <= old->vector[old->size - 1].priority)) {
> > -old->vector[old->size].ptr = ptr;
> > -old->vector[old->size].priority = priority;
> > +if (!temp && old->allocated > size &&
> > +(!size || priority <= old->vector[size - 1].priority)) {
> > +old->vector[size].ptr = ptr;
> > +old->vector[size].priority = priority;
> >  /* Size increment must not be visible to the readers before the new
> >   * entry is stored. */
> > -atomic_thread_fence(memory_order_release);
> > -++old->size;
> > +atomic_store_explicit(>size, size + 1,
> > + memory_order_release);
> >  } else {
> >  if (!temp) {
> >  temp = pvector_impl_dup(old); diff --git a/lib/pvector.h
> > b/lib/pvector.h index b990ed9d5..55b725ba7 100644
> > --- a/lib/pvector.h
> > +++ b/lib/pvector.h
> > @@ -69,8 +69,8 @@ struct pvector_entry {  };
> >
> >  struct pvector_impl {
> > -size_t size;   /* Number of entries in the vector. */
> > -size_t allocated;  /* Number of allocated entries. */
> > +atomic_size_t size;   /* Number of entries in the vector. */
> > +size_t allocated; /* Number of allocated entries. */
> >  struct pvector_entry vector[];
> >  };
> >
> > @@ -181,12 +181,16 @@ pvector_cursor_init(const struct pvector *pvec,
> > {
> >  const struct pvector_impl *impl;
> >  struct pvector_cursor cursor;
> > +size_t size;
> >
> >  impl = ovsrcu_get(struct pvector_impl *, >impl);
> >
> > -ovs_prefetch_range(impl->vector, impl->size * sizeof impl->vector[0]);
> > +/* Use memory_order_acquire to ensure entry access can not be
> > + * reordered to 

[ovs-dev] [PATCH v4] lib: use acquire-release semantics for pvector size

2020-03-03 Thread Yanqin Wei
Read/write concurrency of pvector library is implemented by a temp vector
and RCU protection. Considering performance reason, insertion does not
follow this scheme.
In insertion function, a thread fence ensures size incrementation is done
after new entry is stored. But there is no barrier in the iteration
fuction(pvector_cursor_init). Entry point access may be reordered before
loading vector size, so the invalid entry point may be loaded when vector
iteration.
This patch fixes it by acquire-release pair. It can guarantee new size is
observed by reader after new entry stored by writer. And this is
implemented by one-way barrier instead of two-way memory fence.

Fixes: fe7cfa5c3f19 ("lib/pvector: Non-intrusive RCU priority vector.")
Reviewed-by: Gavin Hu 
Reviewed-by: Lijian Zhang 
Signed-off-by: Yanqin Wei 
---
 lib/pvector.c | 18 +++---
 lib/pvector.h | 13 +
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/lib/pvector.c b/lib/pvector.c
index aaeee9214..cc527fdc4 100644
--- a/lib/pvector.c
+++ b/lib/pvector.c
@@ -33,7 +33,7 @@ pvector_impl_alloc(size_t size)
 struct pvector_impl *impl;
 
 impl = xmalloc(sizeof *impl + size * sizeof impl->vector[0]);
-impl->size = 0;
+atomic_init(>size, 0);
 impl->allocated = size;
 
 return impl;
@@ -117,18 +117,22 @@ pvector_insert(struct pvector *pvec, void *ptr, int 
priority)
 {
 struct pvector_impl *temp = pvec->temp;
 struct pvector_impl *old = pvector_impl_get(pvec);
+size_t size;
 
 ovs_assert(ptr != NULL);
 
+/* There is no possible concurrent writer. Insertions must be protected
+ * by mutex or be always excuted from the same thread. */
+atomic_read_relaxed(>size, );
+
 /* Check if can add to the end without reallocation. */
-if (!temp && old->allocated > old->size &&
-(!old->size || priority <= old->vector[old->size - 1].priority)) {
-old->vector[old->size].ptr = ptr;
-old->vector[old->size].priority = priority;
+if (!temp && old->allocated > size &&
+(!size || priority <= old->vector[size - 1].priority)) {
+old->vector[size].ptr = ptr;
+old->vector[size].priority = priority;
 /* Size increment must not be visible to the readers before the new
  * entry is stored. */
-atomic_thread_fence(memory_order_release);
-++old->size;
+atomic_store_explicit(>size, size + 1, memory_order_release);
 } else {
 if (!temp) {
 temp = pvector_impl_dup(old);
diff --git a/lib/pvector.h b/lib/pvector.h
index b990ed9d5..c5024487f 100644
--- a/lib/pvector.h
+++ b/lib/pvector.h
@@ -69,8 +69,8 @@ struct pvector_entry {
 };
 
 struct pvector_impl {
-size_t size;   /* Number of entries in the vector. */
-size_t allocated;  /* Number of allocated entries. */
+atomic_size_t size;   /* Number of entries in the vector. */
+size_t allocated; /* Number of allocated entries. */
 struct pvector_entry vector[];
 };
 
@@ -181,12 +181,17 @@ pvector_cursor_init(const struct pvector *pvec,
 {
 const struct pvector_impl *impl;
 struct pvector_cursor cursor;
+size_t size;
 
 impl = ovsrcu_get(struct pvector_impl *, >impl);
 
-ovs_prefetch_range(impl->vector, impl->size * sizeof impl->vector[0]);
+/* Use memory_order_acquire to ensure entry access can not be
+ * reordered to happen before size read. */
+atomic_read_explicit(_CAST(struct pvector_impl *, impl)->size,
+, memory_order_acquire);
+ovs_prefetch_range(impl->vector, size * sizeof impl->vector[0]);
 
-cursor.size = impl->size;
+cursor.size = size;
 cursor.vector = impl->vector;
 cursor.entry_idx = -1;
 
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] lib: use acquire-release semantics for pvector size

2020-02-27 Thread Yanqin Wei
Hi Ilya,

V3 has been updated based on your comments. Thanks.

Best Regards,
Wei Yanqin

> -Original Message-
> From: Ilya Maximets 
> Sent: Wednesday, February 26, 2020 9:41 PM
> To: Yanqin Wei ; d...@openvswitch.org
> Cc: nd ; Lijian Zhang ; Gavin Hu
> ; i.maxim...@ovn.org
> Subject: Re: [ovs-dev] [PATCH v2] lib: use acquire-release semantics for 
> pvector
> size
> 
> On 2/26/20 1:33 AM, Yanqin Wei wrote:
> > Read/write concurrency of pvector library is implemented by a temp
> > vector and RCU protection. Considering performance reason, insertion
> > does not follow this scheme.
> > In insertion function, a thread fence ensures size incrementation is
> > done after new entry is stored. But there is no barrier in the
> > iteration fuction(pvector_cursor_init). Entry point access may be
> > reorderd before loading vector size, so the invalid entry point may be
> > loaded when vector iteration.
> > This patch fixes it by acquire-release pair. It can guarantee new size
> > is observed by reader after new entry stored by writer. And this is
> > implemented by one-way barrier instead of two-way memory fence.
> >
> > Reviewed-by: Gavin Hu 
> > Reviewed-by: Lijian Zhang 
> > Signed-off-by: Yanqin Wei 
> > ---
> >  lib/pvector.c | 16 +---
> >  lib/pvector.h | 10 ++
> >  2 files changed, 15 insertions(+), 11 deletions(-)
> >
> > diff --git a/lib/pvector.c b/lib/pvector.c index aaeee9214..d367079fd
> > 100644
> > --- a/lib/pvector.c
> > +++ b/lib/pvector.c
> > @@ -33,7 +33,7 @@ pvector_impl_alloc(size_t size)
> >  struct pvector_impl *impl;
> >
> >  impl = xmalloc(sizeof *impl + size * sizeof impl->vector[0]);
> > -impl->size = 0;
> > +atomic_init(>size, 0);
> >  impl->allocated = size;
> >
> >  return impl;
> > @@ -117,18 +117,20 @@ pvector_insert(struct pvector *pvec, void *ptr,
> > int priority)  {
> >  struct pvector_impl *temp = pvec->temp;
> >  struct pvector_impl *old = pvector_impl_get(pvec);
> > +size_t size;
> >
> >  ovs_assert(ptr != NULL);
> >
> > +atomic_read_relaxed(>size, );
> 
> I still think that we need a comment here why this read is relaxed.
> 
> > +
> >  /* Check if can add to the end without reallocation. */
> > -if (!temp && old->allocated > old->size &&
> > -(!old->size || priority <= old->vector[old->size - 1].priority)) {
> > -old->vector[old->size].ptr = ptr;
> > -old->vector[old->size].priority = priority;
> > +if (!temp && old->allocated > size &&
> > +(!size || priority <= old->vector[size - 1].priority)) {
> > +old->vector[size].ptr = ptr;
> > +old->vector[size].priority = priority;
> >  /* Size increment must not be visible to the readers before the new
> >   * entry is stored. */
> > -atomic_thread_fence(memory_order_release);
> > -++old->size;
> > +atomic_store_explicit(>size, size + 1,
> > + memory_order_release);
> >  } else {
> >  if (!temp) {
> >  temp = pvector_impl_dup(old); diff --git a/lib/pvector.h
> > b/lib/pvector.h index b990ed9d5..c65276638 100644
> > --- a/lib/pvector.h
> > +++ b/lib/pvector.h
> > @@ -69,8 +69,8 @@ struct pvector_entry {  };
> >
> >  struct pvector_impl {
> > -size_t size;   /* Number of entries in the vector. */
> > -size_t allocated;  /* Number of allocated entries. */
> > +atomic_size_t size;  /* Number of entries in the vector. */
> > +size_t allocated; /* Number of allocated entries. */
> 
> Comment alignment is off.
> 
> >  struct pvector_entry vector[];
> >  };
> >
> > @@ -181,12 +181,14 @@ pvector_cursor_init(const struct pvector *pvec,
> > {
> >  const struct pvector_impl *impl;
> >  struct pvector_cursor cursor;
> > +size_t size;
> >
> >  impl = ovsrcu_get(struct pvector_impl *, >impl);
> >
> > -ovs_prefetch_range(impl->vector, impl->size * sizeof impl->vector[0]);
> > +atomic_read_explicit(>size, , memory_order_acquire);
> > +ovs_prefetch_range(impl->vector, size * sizeof impl->vector[0]);
> >
> > -cursor.size = impl->size;
> > +cursor.size = size;
> >  cursor.vector = impl->vector;
> >  cursor.entry_idx = -1;
> >
> >

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3] lib: use acquire-release semantics for pvector size

2020-02-27 Thread Yanqin Wei
Read/write concurrency of pvector library is implemented by a temp vector
and RCU protection. Considering performance reason, insertion does not
follow this scheme.
In insertion function, a thread fence ensures size incrementation is done
after new entry is stored. But there is no barrier in the iteration
fuction(pvector_cursor_init). Entry point access may be reorderd before
loading vector size, so the invalid entry point may be loaded when vector
iteration.
This patch fixes it by acquire-release pair. It can guarantee new size is
observed by reader after new entry stored by writer. And this is
implemented by one-way barrier instead of two-way memory fence.

Reviewed-by: Gavin Hu 
Reviewed-by: Lijian Zhang 
Signed-off-by: Yanqin Wei 
---
 lib/pvector.c | 18 +++---
 lib/pvector.h | 12 
 2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/lib/pvector.c b/lib/pvector.c
index aaeee9214..f557b0559 100644
--- a/lib/pvector.c
+++ b/lib/pvector.c
@@ -33,7 +33,7 @@ pvector_impl_alloc(size_t size)
 struct pvector_impl *impl;
 
 impl = xmalloc(sizeof *impl + size * sizeof impl->vector[0]);
-impl->size = 0;
+atomic_init(>size, 0);
 impl->allocated = size;
 
 return impl;
@@ -117,18 +117,22 @@ pvector_insert(struct pvector *pvec, void *ptr, int 
priority)
 {
 struct pvector_impl *temp = pvec->temp;
 struct pvector_impl *old = pvector_impl_get(pvec);
+size_t size;
 
 ovs_assert(ptr != NULL);
 
+/* There is no possible concurrent writer. Insertions must be protected
+ * by mutex or be always excuted from the same thread */
+atomic_read_relaxed(>size, );
+
 /* Check if can add to the end without reallocation. */
-if (!temp && old->allocated > old->size &&
-(!old->size || priority <= old->vector[old->size - 1].priority)) {
-old->vector[old->size].ptr = ptr;
-old->vector[old->size].priority = priority;
+if (!temp && old->allocated > size &&
+(!size || priority <= old->vector[size - 1].priority)) {
+old->vector[size].ptr = ptr;
+old->vector[size].priority = priority;
 /* Size increment must not be visible to the readers before the new
  * entry is stored. */
-atomic_thread_fence(memory_order_release);
-++old->size;
+atomic_store_explicit(>size, size + 1, memory_order_release);
 } else {
 if (!temp) {
 temp = pvector_impl_dup(old);
diff --git a/lib/pvector.h b/lib/pvector.h
index b990ed9d5..55b725ba7 100644
--- a/lib/pvector.h
+++ b/lib/pvector.h
@@ -69,8 +69,8 @@ struct pvector_entry {
 };
 
 struct pvector_impl {
-size_t size;   /* Number of entries in the vector. */
-size_t allocated;  /* Number of allocated entries. */
+atomic_size_t size;   /* Number of entries in the vector. */
+size_t allocated; /* Number of allocated entries. */
 struct pvector_entry vector[];
 };
 
@@ -181,12 +181,16 @@ pvector_cursor_init(const struct pvector *pvec,
 {
 const struct pvector_impl *impl;
 struct pvector_cursor cursor;
+size_t size;
 
 impl = ovsrcu_get(struct pvector_impl *, >impl);
 
-ovs_prefetch_range(impl->vector, impl->size * sizeof impl->vector[0]);
+/* Use memory_order_acquire to ensure entry access can not be
+ * reordered to happen before size read */
+atomic_read_explicit(>size, , memory_order_acquire);
+ovs_prefetch_range(impl->vector, size * sizeof impl->vector[0]);
 
-cursor.size = impl->size;
+cursor.size = size;
 cursor.vector = impl->vector;
 cursor.entry_idx = -1;
 
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] netdev: fix partial offloading test cases failure

2020-02-25 Thread Yanqin Wei
Some partial offloading test cases are failing inconsistently. The root
cause is that dummy netdev is assigned with "linux_tc" offloading API.
dpif-netdev - partial hw offload - dummy
dpif-netdev - partial hw offload - dummy-pmd
dpif-netdev - partial hw offload with packet modifications - dummy
dpif-netdev - partial hw offload with packet modifications - dummy-pmd

This patch fixes this issue by changing 'options:ifindex=1' to some big
value. It is a workaround to make "linux_tc" init flow api failure. All
above cases can pass consistently after applying this patch.

Suggested-by: Ilya Maximets 
Reviewed-by: Gavin Hu 
Reviewed-by: Lijian Zhang 
Signed-off-by: Yanqin Wei 
---
 tests/dpif-netdev.at | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/dpif-netdev.at b/tests/dpif-netdev.at
index 0aeb4e788..12e468744 100644
--- a/tests/dpif-netdev.at
+++ b/tests/dpif-netdev.at
@@ -371,7 +371,7 @@ m4_define([DPIF_NETDEV_FLOW_HW_OFFLOAD],
   [AT_SETUP([dpif-netdev - partial hw offload - $1])
OVS_VSWITCHD_START(
  [add-port br0 p1 -- \
-  set interface p1 type=$1 ofport_request=1 
options:pstream=punix:$OVS_RUNDIR/p1.sock options:ifindex=1 -- \
+  set interface p1 type=$1 ofport_request=1 
options:pstream=punix:$OVS_RUNDIR/p1.sock options:ifindex=1100 -- \
   set bridge br0 datapath-type=dummy \
  other-config:datapath-id=1234 fail-mode=secure], [], [],
   [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,1,1,1,1"], [])])
@@ -434,7 +434,7 @@ m4_define([DPIF_NETDEV_FLOW_HW_OFFLOAD_OFFSETS],
   [AT_SETUP([dpif-netdev - partial hw offload with packet modifications - $1])
OVS_VSWITCHD_START(
  [add-port br0 p1 -- \
-  set interface p1 type=$1 ofport_request=1 options:pcap=p1.pcap 
options:ifindex=1 -- \
+  set interface p1 type=$1 ofport_request=1 options:pcap=p1.pcap 
options:ifindex=1101 -- \
   set bridge br0 datapath-type=dummy \
  other-config:datapath-id=1234 fail-mode=secure], [], [],
   [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,1,1,1,1"], [])])
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] lib: use acquire-release semantics for pvector size

2020-02-25 Thread Yanqin Wei
Read/write concurrency of pvector library is implemented by a temp vector
and RCU protection. Considering performance reason, insertion does not
follow this scheme.
In insertion function, a thread fence ensures size incrementation is done
after new entry is stored. But there is no barrier in the iteration
fuction(pvector_cursor_init). Entry point access may be reorderd before
loading vector size, so the invalid entry point may be loaded when vector
iteration.
This patch fixes it by acquire-release pair. It can guarantee new size is
observed by reader after new entry stored by writer. And this is
implemented by one-way barrier instead of two-way memory fence.

Reviewed-by: Gavin Hu 
Reviewed-by: Lijian Zhang 
Signed-off-by: Yanqin Wei 
---
 lib/pvector.c | 16 +---
 lib/pvector.h | 10 ++
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/lib/pvector.c b/lib/pvector.c
index aaeee9214..d367079fd 100644
--- a/lib/pvector.c
+++ b/lib/pvector.c
@@ -33,7 +33,7 @@ pvector_impl_alloc(size_t size)
 struct pvector_impl *impl;
 
 impl = xmalloc(sizeof *impl + size * sizeof impl->vector[0]);
-impl->size = 0;
+atomic_init(>size, 0);
 impl->allocated = size;
 
 return impl;
@@ -117,18 +117,20 @@ pvector_insert(struct pvector *pvec, void *ptr, int 
priority)
 {
 struct pvector_impl *temp = pvec->temp;
 struct pvector_impl *old = pvector_impl_get(pvec);
+size_t size;
 
 ovs_assert(ptr != NULL);
 
+atomic_read_relaxed(>size, );
+
 /* Check if can add to the end without reallocation. */
-if (!temp && old->allocated > old->size &&
-(!old->size || priority <= old->vector[old->size - 1].priority)) {
-old->vector[old->size].ptr = ptr;
-old->vector[old->size].priority = priority;
+if (!temp && old->allocated > size &&
+(!size || priority <= old->vector[size - 1].priority)) {
+old->vector[size].ptr = ptr;
+old->vector[size].priority = priority;
 /* Size increment must not be visible to the readers before the new
  * entry is stored. */
-atomic_thread_fence(memory_order_release);
-++old->size;
+atomic_store_explicit(>size, size + 1, memory_order_release);
 } else {
 if (!temp) {
 temp = pvector_impl_dup(old);
diff --git a/lib/pvector.h b/lib/pvector.h
index b990ed9d5..c65276638 100644
--- a/lib/pvector.h
+++ b/lib/pvector.h
@@ -69,8 +69,8 @@ struct pvector_entry {
 };
 
 struct pvector_impl {
-size_t size;   /* Number of entries in the vector. */
-size_t allocated;  /* Number of allocated entries. */
+atomic_size_t size;  /* Number of entries in the vector. */
+size_t allocated; /* Number of allocated entries. */
 struct pvector_entry vector[];
 };
 
@@ -181,12 +181,14 @@ pvector_cursor_init(const struct pvector *pvec,
 {
 const struct pvector_impl *impl;
 struct pvector_cursor cursor;
+size_t size;
 
 impl = ovsrcu_get(struct pvector_impl *, >impl);
 
-ovs_prefetch_range(impl->vector, impl->size * sizeof impl->vector[0]);
+atomic_read_explicit(>size, , memory_order_acquire);
+ovs_prefetch_range(impl->vector, size * sizeof impl->vector[0]);
 
-cursor.size = impl->size;
+cursor.size = size;
 cursor.vector = impl->vector;
 cursor.entry_idx = -1;
 
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1] netdev: fix partial offloading test cases failure

2020-02-25 Thread Yanqin Wei


> -Original Message-
> From: Ilya Maximets 
> Sent: Tuesday, February 25, 2020 6:31 PM
> To: Yanqin Wei ; Ilya Maximets
> ; d...@openvswitch.org
> Cc: nd ; Lijian Zhang ; Gavin Hu
> 
> Subject: Re: [ovs-dev] [PATCH v1] netdev: fix partial offloading test cases
> failure
> 
> On 2/25/20 10:59 AM, Yanqin Wei wrote:
> > Hi Ilya,
> >
> >
> >> -Original Message-
> >> From: Ilya Maximets 
> >> Sent: Tuesday, February 25, 2020 5:25 PM
> >> To: Yanqin Wei ; d...@openvswitch.org
> >> Cc: nd ; Lijian Zhang ; Gavin Hu
> >> ; i.maxim...@ovn.org
> >> Subject: Re: [ovs-dev] [PATCH v1] netdev: fix partial offloading test
> >> cases failure
> >>
> >> On 2/25/20 2:46 AM, Yanqin Wei wrote:
> >>> Some partial offloading test cases are failing inconsistently. The
> >>> root cause is that dummy netdev is assigned with incorrect
> >>> offloading flow
> >> API.
> >>> dpif-netdev - partial hw offload - dummy dpif-netdev - partial hw
> >>> offload - dummy-pmd dpif-netdev - partial hw offload with packet
> >>> modifications - dummy dpif-netdev - partial hw offload with packet
> >>> modifications - dummy-pmd
> >>>
> >>> This patch fixes this issue by adding a specified flow api type in netdev.
> >>> Dummy netdev class can specify flow type in construct function. All
> >>> of the above cases can pass consistently.
> >>
> >> Could you, please, clarify which offload provider is assigned to
> >> dummy ports in your case and why this happens?
> >>
> >> In general, we need to fix offload providers to only accept ports
> >> that are usable for them instead of hardcoding the type.
> >>
> > [Yanqin] Sometimes " linux_tc" is assigned to dummy netdev by mistake.
> Currently, dpif traverses all offloading provider and select the first 
> provider
> with successful initialization. It makes the result uncertain and random.
> > I think the problem is who should take responsible for offloading
> > providers assignment. Both dpif and netdev class  impact the provider
> selection Firstly, netdev may specify offloading provider if the option is 
> unique.
> Secondly, dpif should determine if netdev has more than one option (instead
> of traverse all providers).
> > This patch implement the first one. The second one could be implemented in
> netdev_init_flow_api by means of dpif class and netdev class.
> > What do you think of this?
> 
> I think that current implementation of dynamic flow API assignment is
> somewhat OK and there is no significant issues that we should address.
> 
> For this particular issue, I think we just need to be a little bit more 
> careful with
> tests.  I believe that removing of 'options:ifindex=1' from the test along 
> with
> applying of the following patch:
> https://patchwork.ozlabs.org/patch/1226013/
> should fix the occasional assignment of tinux_tc flow API for dummy port.
> 
> For now, as a workaround, to fix this particular case we could change
> 'options:ifindex=1' to some fairly big value instead of using '1', which is 
> likely
> used by some of the existing system interfaces.
> 
[Yanqin] Using ifindex to determine offloading provider is a little tricky.
If more netdev need support offloading (which have valid ifindex but not 
support tc offloading), the logic needs to change. 
 
But as you said, there is no significant issue so far. We could use the 
workaround you suggest and refine the provider assignment when required. I will 
update v2 based on this workaround.

> Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1] lib: use acquire-release semantics for pvector size

2020-02-25 Thread Yanqin Wei
Hi Ilya,

> -Original Message-
> From: Ilya Maximets 
> Sent: Tuesday, February 25, 2020 5:54 PM
> To: Yanqin Wei ; d...@openvswitch.org
> Cc: Lijian Zhang ; Gavin Hu ; nd
> ; i.maxim...@ovn.org; Ben Pfaff 
> Subject: Re: [ovs-dev] [PATCH v1] lib: use acquire-release semantics for 
> pvector
> size
> 
> On 2/25/20 2:52 AM, Yanqin Wei wrote:
> > Read/write concurrency of pvector library is implemented by a temp
> > vector and RCU protection. Considering performance reason, insertion
> > does not follow this scheme.
> > In insertion function, a thread fence ensures size incrementation is
> > done after new entry is stored. But there is no barrier in the
> > iteration fuction(pvector_cursor_init). Entry point access may be
> > reorderd before loading vector size, so the invalid entry point may be
> > loaded when vector iteration.
> > This patch fixes it by acquire-release pair. It can guarantee new size
> > is observed by reader after new entry stored by writer. And this is
> > implemented by one-way barrier instead of two-way memory fence.
> >
> > Reviewed-by: Gavin Hu 
> > Reviewed-by: Lijian Zhang 
> > Signed-off-by: Yanqin Wei 
> > ---
> >  lib/pvector.c | 14 +++---
> >  lib/pvector.h | 12 +++-
> >  2 files changed, 14 insertions(+), 12 deletions(-)
> >
> > diff --git a/lib/pvector.c b/lib/pvector.c index aaeee9214..12c599c97
> > 100644
> > --- a/lib/pvector.c
> > +++ b/lib/pvector.c
> > @@ -33,7 +33,7 @@ pvector_impl_alloc(size_t size)
> >  struct pvector_impl *impl;
> >
> >  impl = xmalloc(sizeof *impl + size * sizeof impl->vector[0]);
> > -impl->size = 0;
> > +atomic_init(>size, 0);
> >  impl->allocated = size;
> >
> >  return impl;
> > @@ -117,18 +117,18 @@ pvector_insert(struct pvector *pvec, void *ptr,
> > int priority)  {
> >  struct pvector_impl *temp = pvec->temp;
> >  struct pvector_impl *old = pvector_impl_get(pvec);
> > +size_t size = old->size;
> 
> Why this is not an atomic read?  I understand that insertions are not thread-
> safe and must be protected by the mutex or be always executed from the same
> thread.
> However, if we're choosing to read this variable non-atomically, we could
> avoid introduction of additional variable here and at the same time avoid
> modification of most of the code lines in this function.  A comment, why we're
> reading it non-atomically might be good anyway since we should be consistent
> and use atomic operations for variables marked as atomic as possible.
> 
[Yanqin]  It makes sense for me.  I will update v2 for all of your comments. 
Thanks.

> >
> >  ovs_assert(ptr != NULL);
> >
> >  /* Check if can add to the end without reallocation. */
> > -if (!temp && old->allocated > old->size &&
> > -(!old->size || priority <= old->vector[old->size - 1].priority)) {
> > -old->vector[old->size].ptr = ptr;
> > -old->vector[old->size].priority = priority;
> > +if (!temp && old->allocated > size &&
> > +(!size || priority <= old->vector[size - 1].priority)) {
> > +old->vector[size].ptr = ptr;
> > +old->vector[size].priority = priority;
> >  /* Size increment must not be visible to the readers before the new
> >   * entry is stored. */
> > -atomic_thread_fence(memory_order_release);
> > -++old->size;
> > +atomic_store_explicit(>size, size + 1,
> > + memory_order_release);
> >  } else {
> >  if (!temp) {
> >  temp = pvector_impl_dup(old); diff --git a/lib/pvector.h
> > b/lib/pvector.h index b990ed9d5..430bdf746 100644
> > --- a/lib/pvector.h
> > +++ b/lib/pvector.h
> > @@ -69,8 +69,8 @@ struct pvector_entry {  };
> >
> >  struct pvector_impl {
> > -size_t size;   /* Number of entries in the vector. */
> > -size_t allocated;  /* Number of allocated entries. */
> > +ATOMIC(size_t) size;  /* Number of entries in the vector. */
> 
> atomic_size_t
> 
> > +size_t allocated; /* Number of allocated entries. */
> >  struct pvector_entry vector[];
> >  };
> >
> > @@ -172,7 +172,7 @@ static inline void pvector_cursor_lookahead(const
> struct pvector_cursor *,
> >  #define PVECTOR_CURSOR_FOR_EACH_CONTINUE(PTR, CURSOR)   \
> >  for (; ((PTR) = pvector_cursor_next(CURSOR, INT_MIN, 0, 0)) !=
> > NULL; )
> >
&g

Re: [ovs-dev] [PATCH v1] netdev: fix partial offloading test cases failure

2020-02-25 Thread Yanqin Wei
Hi Ilya,


> -Original Message-
> From: Ilya Maximets 
> Sent: Tuesday, February 25, 2020 5:25 PM
> To: Yanqin Wei ; d...@openvswitch.org
> Cc: nd ; Lijian Zhang ; Gavin Hu
> ; i.maxim...@ovn.org
> Subject: Re: [ovs-dev] [PATCH v1] netdev: fix partial offloading test cases
> failure
> 
> On 2/25/20 2:46 AM, Yanqin Wei wrote:
> > Some partial offloading test cases are failing inconsistently. The
> > root cause is that dummy netdev is assigned with incorrect offloading flow
> API.
> > dpif-netdev - partial hw offload - dummy dpif-netdev - partial hw
> > offload - dummy-pmd dpif-netdev - partial hw offload with packet
> > modifications - dummy dpif-netdev - partial hw offload with packet
> > modifications - dummy-pmd
> >
> > This patch fixes this issue by adding a specified flow api type in netdev.
> > Dummy netdev class can specify flow type in construct function. All of
> > the above cases can pass consistently.
> 
> Could you, please, clarify which offload provider is assigned to dummy ports 
> in
> your case and why this happens?
> 
> In general, we need to fix offload providers to only accept ports that are 
> usable
> for them instead of hardcoding the type.
> 
[Yanqin] Sometimes " linux_tc" is assigned to dummy netdev by mistake. 
Currently, dpif traverses all offloading provider and select the first provider 
with successful initialization. It makes the result uncertain and random. 
I think the problem is who should take responsible for offloading providers 
assignment. Both dpif and netdev class  impact the provider selection
Firstly, netdev may specify offloading provider if the option is unique. 
Secondly, dpif should determine if netdev has more than one option (instead of 
traverse all providers).
This patch implement the first one. The second one could be implemented in 
netdev_init_flow_api by means of dpif class and netdev class.
What do you think of this?
Best Regards,
Wei Yanqin
> Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1] lib: use acquire-release semantics for pvector size

2020-02-24 Thread Yanqin Wei
Read/write concurrency of pvector library is implemented by a temp vector
and RCU protection. Considering performance reason, insertion does not
follow this scheme.
In insertion function, a thread fence ensures size incrementation is done
after new entry is stored. But there is no barrier in the iteration
fuction(pvector_cursor_init). Entry point access may be reorderd before
loading vector size, so the invalid entry point may be loaded when vector
iteration.
This patch fixes it by acquire-release pair. It can guarantee new size is
observed by reader after new entry stored by writer. And this is
implemented by one-way barrier instead of two-way memory fence.

Reviewed-by: Gavin Hu 
Reviewed-by: Lijian Zhang 
Signed-off-by: Yanqin Wei 
---
 lib/pvector.c | 14 +++---
 lib/pvector.h | 12 +++-
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/lib/pvector.c b/lib/pvector.c
index aaeee9214..12c599c97 100644
--- a/lib/pvector.c
+++ b/lib/pvector.c
@@ -33,7 +33,7 @@ pvector_impl_alloc(size_t size)
 struct pvector_impl *impl;
 
 impl = xmalloc(sizeof *impl + size * sizeof impl->vector[0]);
-impl->size = 0;
+atomic_init(>size, 0);
 impl->allocated = size;
 
 return impl;
@@ -117,18 +117,18 @@ pvector_insert(struct pvector *pvec, void *ptr, int 
priority)
 {
 struct pvector_impl *temp = pvec->temp;
 struct pvector_impl *old = pvector_impl_get(pvec);
+size_t size = old->size;
 
 ovs_assert(ptr != NULL);
 
 /* Check if can add to the end without reallocation. */
-if (!temp && old->allocated > old->size &&
-(!old->size || priority <= old->vector[old->size - 1].priority)) {
-old->vector[old->size].ptr = ptr;
-old->vector[old->size].priority = priority;
+if (!temp && old->allocated > size &&
+(!size || priority <= old->vector[size - 1].priority)) {
+old->vector[size].ptr = ptr;
+old->vector[size].priority = priority;
 /* Size increment must not be visible to the readers before the new
  * entry is stored. */
-atomic_thread_fence(memory_order_release);
-++old->size;
+atomic_store_explicit(>size, size + 1, memory_order_release);
 } else {
 if (!temp) {
 temp = pvector_impl_dup(old);
diff --git a/lib/pvector.h b/lib/pvector.h
index b990ed9d5..430bdf746 100644
--- a/lib/pvector.h
+++ b/lib/pvector.h
@@ -69,8 +69,8 @@ struct pvector_entry {
 };
 
 struct pvector_impl {
-size_t size;   /* Number of entries in the vector. */
-size_t allocated;  /* Number of allocated entries. */
+ATOMIC(size_t) size;  /* Number of entries in the vector. */
+size_t allocated; /* Number of allocated entries. */
 struct pvector_entry vector[];
 };
 
@@ -172,7 +172,7 @@ static inline void pvector_cursor_lookahead(const struct 
pvector_cursor *,
 #define PVECTOR_CURSOR_FOR_EACH_CONTINUE(PTR, CURSOR)   \
 for (; ((PTR) = pvector_cursor_next(CURSOR, INT_MIN, 0, 0)) != NULL; )
 
-
+
 /* Inline implementations. */
 
 static inline struct pvector_cursor
@@ -181,12 +181,14 @@ pvector_cursor_init(const struct pvector *pvec,
 {
 const struct pvector_impl *impl;
 struct pvector_cursor cursor;
+size_t size;
 
 impl = ovsrcu_get(struct pvector_impl *, >impl);
 
-ovs_prefetch_range(impl->vector, impl->size * sizeof impl->vector[0]);
+atomic_read_explicit(>size, , memory_order_acquire);
+ovs_prefetch_range(impl->vector, size * sizeof impl->vector[0]);
 
-cursor.size = impl->size;
+cursor.size = size;
 cursor.vector = impl->vector;
 cursor.entry_idx = -1;
 
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1] netdev: fix partial offloading test cases failure

2020-02-24 Thread Yanqin Wei
Some partial offloading test cases are failing inconsistently. The root
cause is that dummy netdev is assigned with incorrect offloading flow API.
dpif-netdev - partial hw offload - dummy
dpif-netdev - partial hw offload - dummy-pmd
dpif-netdev - partial hw offload with packet modifications - dummy
dpif-netdev - partial hw offload with packet modifications - dummy-pmd

This patch fixes this issue by adding a specified flow api type in netdev.
Dummy netdev class can specify flow type in construct function. All of the
above cases can pass consistently.

Reviewed-by: Gavin Hu 
Reviewed-by: Lijian Zhang 
Signed-off-by: Yanqin Wei 
---
 lib/netdev-dummy.c|  5 -
 lib/netdev-offload.c  | 11 +++
 lib/netdev-provider.h |  1 +
 lib/netdev.c  |  1 +
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c
index 71df29184..252b9d802 100644
--- a/lib/netdev-dummy.c
+++ b/lib/netdev-dummy.c
@@ -50,6 +50,8 @@ VLOG_DEFINE_THIS_MODULE(netdev_dummy);
 
 #define C_STATS_SIZE 2
 
+#define DUMMY_OFFLOAD_TYPE "dummy"
+
 struct reconnect;
 
 struct dummy_packet_stream {
@@ -709,6 +711,7 @@ netdev_dummy_construct(struct netdev *netdev_)
 
 dummy_packet_conn_init(>conn);
 
+netdev_->flow_api_type = DUMMY_OFFLOAD_TYPE;
 ovs_list_init(>rxes);
 hmap_init(>offloaded_flows);
 ovs_mutex_unlock(>mutex);
@@ -1588,7 +1591,7 @@ netdev_dummy_offloads_init_flow_api(struct netdev *netdev)
 }
 
 static const struct netdev_flow_api netdev_offload_dummy = {
-.type = "dummy",
+.type = DUMMY_OFFLOAD_TYPE,
 .flow_put = netdev_dummy_flow_put,
 .flow_del = netdev_dummy_flow_del,
 .init_flow_api = netdev_dummy_offloads_init_flow_api,
diff --git a/lib/netdev-offload.c b/lib/netdev-offload.c
index 32eab5910..c4bc71618 100644
--- a/lib/netdev-offload.c
+++ b/lib/netdev-offload.c
@@ -173,6 +173,17 @@ netdev_assign_flow_api(struct netdev *netdev)
 {
 struct netdev_registered_flow_api *rfa;
 
+if (netdev->flow_api_type &&
+(rfa = netdev_lookup_flow_api(netdev->flow_api_type))) {
+if (!rfa->flow_api->init_flow_api(netdev)) {
+ovs_refcount_ref(>refcnt);
+ovsrcu_set(>flow_api, rfa->flow_api);
+VLOG_INFO("%s: Sepecified flow API '%s'.",
+  netdev_get_name(netdev), rfa->flow_api->type);
+return 0;
+}
+}
+
 CMAP_FOR_EACH (rfa, cmap_node, _flow_apis) {
 if (!rfa->flow_api->init_flow_api(netdev)) {
 ovs_refcount_ref(>refcnt);
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index 22f4cde33..5942166a8 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -93,6 +93,7 @@ struct netdev {
 struct ovs_list saved_flags_list; /* Contains "struct netdev_saved_flags". 
*/
 
 /* Functions to control flow offloading. */
+char *flow_api_type;
 OVSRCU_TYPE(const struct netdev_flow_api *) flow_api;
 struct netdev_hw_info hw_info; /* offload-capable netdev info */
 };
diff --git a/lib/netdev.c b/lib/netdev.c
index f95b19af4..5c799f854 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -423,6 +423,7 @@ netdev_open(const char *name, const char *type, struct 
netdev **netdevp)
 netdev->reconfigure_seq = seq_create();
 netdev->last_reconfigure_seq =
 seq_read(netdev->reconfigure_seq);
+netdev->flow_api_type = NULL;
 ovsrcu_set(>flow_api, NULL);
 netdev->hw_info.oor = false;
 netdev->node = shash_add(_shash, name, netdev);
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] OVS release - 2.13

2020-02-12 Thread Yanqin Wei
Hi Ilya,

Some user need a customized version based on official release. CI can help them 
verify their modification conveniently.
But this patch has not enabled test suite jobs for Arm(cannot help user to run 
functional test on Arm), so it is OK for us to apply it after 2.13 release. 

Anyway, we hope it can be enabled soon to prevent new issue introduced during 
development.

Best Regards,
Wei Yanqin

> -Original Message-
> From: Ben Pfaff 
> Sent: Thursday, February 13, 2020 12:23 AM
> To: Ilya Maximets 
> Cc: Yanqin Wei ; d...@openvswitch.org; nd
> ; Malvika Gupta ; Lance Yang
> 
> Subject: Re: [ovs-dev] OVS release - 2.13
> 
> On Wed, Feb 12, 2020 at 03:45:44PM +0100, Ilya Maximets wrote:
> > On 2/12/20 4:41 AM, Yanqin Wei wrote:
> > > Hi Ben,
> > >
> > > There is a patch to enable travis CI on Arm. Is it possible to be put 
> > > into 2.13?
> > > https://patchwork.ozlabs.org/patch/1204923/
> >
> > Hi.
> >
> > Is it a strong requirement for you to have this in 2.13 release?
> >
> > I mean, travis scripts are not part of a source code and will not be
> > normally used after downloading release source archive.
> > So, I think (since this is CI only change) it could be applied to 2.13
> > branch even after the official release if needed, because it will not
> > affect anyone except our CI.
> >
> > Ben, what do you think?
> 
> Seems right to me.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] OVS release - 2.13

2020-02-11 Thread Yanqin Wei
Hi Ben,

There is a patch to enable travis CI on Arm. Is it possible to be put into 2.13?
https://patchwork.ozlabs.org/patch/1204923/


Best Regards,
Wei Yanqin


> -Original Message-
> From: dev  On Behalf Of Ben Pfaff
> Sent: Tuesday, February 11, 2020 1:44 AM
> To: d...@openvswitch.org
> Subject: [ovs-dev] OVS release - 2.13
> 
> Hi everyone!  The 2.13 release should, by our calendar, happen around Feb. 15.
> I would like to do it Feb. 14, if possible, because I'm planning to take next 
> week
> off.
> 
> Does anyone have anything they'd like to get into 2.13 or know a reason to
> hold off?
> 
> Thanks,
> 
> Ben.
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI for arm

2020-02-05 Thread Yanqin Wei
Hi Ilya, 

Travis on Arm looks stable now, we did not observe any issue more than two 
weeks. And Travis also made some adjustment on their side, you can see their 
reply here: https://travis-ci.community/t/segfaults-in-arm64-environment/5617/13
Could you reconsider this patch?

Best Regards,
Wei Yanqin

> -Original Message-
> From: Lance Yang 
> Sent: Tuesday, January 21, 2020 9:06 AM
> To: Ilya Maximets ; ovs-dev@openvswitch.org
> Cc: b...@ovn.org; Yanqin Wei ; dwil...@us.ibm.com;
> Gavin Hu ; Ruifeng Wang ;
> Jieqiang Wang ; Malvika Gupta
> ; nd 
> Subject: RE: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI for arm
> 
> 
> > -Original Message-
> > From: Ilya Maximets 
> > Sent: Saturday, December 7, 2019 12:39 AM
> > To: Lance Yang (Arm Technology China) ; ovs-
> > d...@openvswitch.org
> > Cc: i.maxim...@ovn.org; b...@ovn.org; Yanqin Wei (Arm Technology China)
> > ; dwil...@us.ibm.com; Gavin Hu (Arm Technology
> > China) ; Ruifeng Wang (Arm Technology China)
> > ; Jieqiang Wang (Arm Technology China)
> > ; Malvika Gupta ; nd
> > 
> > Subject: Re: [ovs-dev] [PATCH v2 3/3] travis: Enable OvS Travis CI for
> > arm
> >
> > On 06.12.2019 04:26, Lance Yang wrote:
> > > Enable part of travis jobs with gcc compiler for arm64 architecture
> > >
> > > 1. Add arm jobs into the matrix in .travis.yml configuration file 2.
> > > To enable OVS-DPDK jobs, set the build target according to different
> > > CPU architectures 3. Temporarily disable sparse checker because of
> > > static code checking failure on arm64
> > >
> > > Successful travis build jobs report:
> > > https://travis-ci.org/yzyuestc/ovs/builds/621037339
> > >
> > > Reviewed-by: Yanqin Wei 
> > > Reviewed-by: Jieqiang Wang 
> > > Reviewed-by: Gavin Hu 
> > > Signed-off-by: Lance Yang 
> > > ---
> >
> > Compiler crashed while building DPDK:
> >
> > /home/travis/build/ovsrobot/ovs/dpdk-dir/drivers/net/ixgbe/ixgbe_pf.c:
> > In function
> > ‘ixgbe_pf_host_configure’:
> > /home/travis/build/ovsrobot/ovs/dpdk-dir/drivers/net/ixgbe/ixgbe_pf.c:
> > 297:1: internal compiler error: Segmentation fault
> >
> > https://travis-ci.org/ovsrobot/ovs/jobs/621434216#L1999
> >
> > This is not good.
> > Need to check how frequently this happens.
> >
> > Best regards, Ilya Maximets.
> [Lance]
> Hi Ilya,
> 
> After you give us the feedback about the segmentation fault issue, we keep
> running the travis CI to observe the frequency. We run those arm jobs on a
> regular basis and clear cache on every single build. We have run the build at
> least 2 times a day for more than a month.
> 
> The good thing is that we haven't reproduce the segfault issue. Job reports 
> are
> available at: https://travis-ci.org/yzyuestc/ovs/builds/639064033
> 
> Travis CI community also made some adjustment on their side, you can see
> their reply here: https://travis-ci.community/t/segfaults-in-arm64-
> environment/5617/13
> 
> I think the segfault issue may occur by chance. At least, it is not a 
> frequent one,
> which is unlikely to cause issues to OVS developers. Could you please check it
> again?
> 
> Best Regards,
> Lance

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 2/3] travis: Move x86-only addon packages to linux-prepare.sh

2020-02-05 Thread Yanqin Wei
Hi Ilya, 

Travis on Arm looks stable now, we did not observe any issue more than two 
weeks. And Travis also made some adjustment on their side, you can see their 
reply here: https://travis-ci.community/t/segfaults-in-arm64-environment/5617/13
Could you reconsider this patch?

Best Regards,
Wei Yanqin

> -Original Message-
> From: Lance Yang (Arm Technology China) 
> Sent: Tuesday, December 17, 2019 9:54 AM
> To: Ilya Maximets ; dwilder 
> Cc: ovs-dev@openvswitch.org; b...@ovn.org; Yanqin Wei (Arm Technology
> China) ; Gavin Hu (Arm Technology China)
> ; Ruifeng Wang (Arm Technology China)
> ; Jieqiang Wang (Arm Technology China)
> ; Malvika Gupta ; nd
> 
> Subject: RE: [ovs-dev] [PATCH v2 2/3] travis: Move x86-only addon packages to
> linux-prepare.sh
> 
> 
> 
> > -Original Message-
> > From: Ilya Maximets 
> > Sent: Saturday, December 14, 2019 1:55 AM
> > To: dwilder ; Lance Yang (Arm Technology China)
> > 
> > Cc: ovs-dev@openvswitch.org; i.maxim...@ovn.org; b...@ovn.org; Yanqin
> > Wei (Arm Technology China) ; Gavin Hu (Arm
> > Technology China) ; Ruifeng Wang (Arm Technology
> > China) ; Jieqiang Wang (Arm Technology China)
> > ; Malvika Gupta ; nd
> > 
> > Subject: Re: [ovs-dev] [PATCH v2 2/3] travis: Move x86-only addon
> > packages to linux- prepare.sh
> >
> > On 06.12.2019 23:32, dwilder wrote:
> > > On 2019-12-05 19:26, Lance Yang wrote:
> > >> To enable multiple CPU architectures support, it is necessary to
> > >> move the x86-only addon packages from .travis.yml file. Otherwise,
> > >> the x86-only addon packages will break the builds on some other CPU
> architectures.
> > >>
> > >> Reviewed-by: Yanqin Wei 
> > >> Reviewed-by: Malvika Gupta 
> > >> Reviewed-by: Gavin Hu 
> > >> Reviewed-by: Ruifeng Wang 
> > >> Signed-off-by: Lance Yang 
> > >> ---
> > >>  .travis.yml  | 2 --
> > >>  .travis/linux-prepare.sh | 3 ++-
> > >>  2 files changed, 2 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/.travis.yml b/.travis.yml index 482efd2..2dc4d43
> > >> 100644
> > >> --- a/.travis.yml
> > >> +++ b/.travis.yml
> > >> @@ -14,7 +14,6 @@ addons:
> > >>apt:
> > >>  packages:
> > >>- bc
> > >> -  - gcc-multilib
> > >>- libssl-dev
> > >>- llvm-dev
> > >>- libjemalloc1
> > >> @@ -26,7 +25,6 @@ addons:
> > >>- libelf-dev
> > >>- selinux-policy-dev
> > >>- libunbound-dev
> > >> -  - libunbound-dev:i386
> > >>- libunwind-dev
> > >>
> > >>  before_install: ./.travis/${TRAVIS_OS_NAME}-prepare.sh
> > >> diff --git a/.travis/linux-prepare.sh b/.travis/linux-prepare.sh
> > >> index 9e3ac0d..6421066 100755
> > >> --- a/.travis/linux-prepare.sh
> > >> +++ b/.travis/linux-prepare.sh
> > >> @@ -18,7 +18,8 @@ pip install --user --upgrade docutils  if [
> > >> "$M32" ]; then
> > >>  # 32-bit and 64-bit libunwind can not be installed at the same time.
> > >>  # This will remove the 64-bit libunwind and install 32-bit version.
> > >> -sudo apt-get install -y libunwind-dev:i386
> > >> +sudo apt-get install -y \
> > >> +libunwind-dev:i386 libunbound-dev:i386 gcc-multilib
> > >>  fi
> > >>
> > >>  # IPv6 is supported by kernel but disabled in TravisCI images:
> > >
> > > LGTM:
> > > With this patch applied ppc64le simply needs to be include into the 
> > > matrix.
> > > I will submit an updated ppc64le patch to be layered on top of this one.
> > >
> > > Acked-by: David Wilder 
> >
> > Thanks.  First two patches of this series are good even without
> > multiarch support so I went ahead and applied them (with minor
> > visual/spelling changes) to master.  We'll need to think more about
> > actual enabling of ppc/arm since they are not that stable as we would want
> them to be.
> >
> > Best regards, Ilya Maximets.
> [Lance]
> 
> Hi Ilya,
> 
> Thank you for all the comments. I am glad to that the patches can be merged.
> 
> We reported the segment fault issue to Travis CI community. You can see the
> threads for segment faults in arm64/ppc64 environment:
> https://travis-ci.community/t/segfaults-in-arm64-environment/5617/8
> https://travis-ci.community/t/arm64-ppc64le-segfaults/6158
> 
> There might be a period of time for the Travis CI community to investigate. We
> will keep you updated about the latest process.
> 
> Best regards, Lance

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif-netdev-perf: aarch64 support for accurate timing update of TSC cycle counter

2019-11-26 Thread Yanqin Wei (Arm Technology China)
Hi Ilya,

No, we didn't test this patch based on OVS-AF_XDP, but made a black build to 
enable this in OVS-DPDK and test it. 
Currently DPDK-AF_XDP has been tested in latest kernel (not released). So I 
think OVS-AF_XDP is close to be supported for aarch64.  

Furthermore, I found a document about userspace-only mode of Open vSwitch 
without DPDK.  
http://docs.openvswitch.org/en/latest/intro/install/userspace/#using-the-userspace-datapath-with-ovs-vswitchd
So it seems userspace datapath should be decoupled with networking IO, users 
can even customize this. Does it means we need implement all used DPDK API 
inside OVS?

Best Regards,
Wei Yanqin 


> -Original Message-
> From: dev  On Behalf Of Ilya Maximets
> Sent: Tuesday, November 26, 2019 11:38 PM
> To: Malvika Gupta ; d...@openvswitch.org
> Cc: nd ; Honnappa Nagarahalli
> 
> Subject: Re: [ovs-dev] [PATCH] dpif-netdev-perf: aarch64 support for accurate
> timing update of TSC cycle counter
> 
> On 13.11.2019 18:01, Malvika Gupta wrote:
> > The accurate timing implementation in this patch gets the wall clock
> > counter via
> > cntvct_el0 register access. This call is portable to all aarch64
> > architectures and has been verified on an 64-bit arm server.
> >
> > Suggested-by: Yanqin Wei 
> > Signed-off-by: Malvika Gupta 
> > ---
> 
> Thanks for the patch!
> 
> Are you trying to use AF_XDP on aarch64?  Asking because it's the only real
> scenario where this patch can be useful.
> 
> For the patch subject, I'd suggest to shorten it a little.
> 'timing', 'TSC' and 'cycle counter' are kind of synonyms here and doesn't make
> the sentence any clear.  Suggesting something like this:
> "dpif-netdev-perf: Accurate cycle counter update on aarch64."
> 
> What do you think?
> 
> One more comment inline.
> 
> >  lib/dpif-netdev-perf.h | 5 +
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index
> > ce369375b..4ea7cc355 100644
> > --- a/lib/dpif-netdev-perf.h
> > +++ b/lib/dpif-netdev-perf.h
> > @@ -220,6 +220,11 @@ cycles_counter_update(struct pmd_perf_stats *s)
> >  asm volatile("rdtsc" : "=a" (l), "=d" (h));
> >
> >  return s->last_tsc = ((uint64_t) h << 32) | l;
> > +#elif !defined(_MSC_VER) && defined(__aarch64__)
> > +uint64_t tsc;
> > +asm volatile("mrs %0, cntvct_el0" : "=r" (tsc));
> > +
> > +return s->last_tsc = tsc;
> 
> I think we could drop the 'tsc' local variable here and write directly to s-
> >last_tsc.  Less number of variables and operations.
> 
> Best regards, Ilya Maximets.
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] netdev: use acquire-release semantics for change_seq in netdev

2019-11-25 Thread Yanqin Wei
"rxq_enabled" of netdev is writen in the vhost thread and read by pmd
thread once it observes 'change_seq' is updated. This patch is to keep
order on aarch64 or other weak memory model CPU to ensure 'rxq_enabled' is
observed before 'change_seq'.

Reviewed-by: Gavin Hu 
Signed-off-by: Yanqin Wei 
---
 lib/netdev-provider.h | 13 +
 lib/netdev.c  |  7 ++-
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index 1e5a40c89..f109c4e66 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -63,7 +63,7 @@ struct netdev {
  *
  * Minimally, the sequence number is required to change whenever
  * 'netdev''s flags, features, ethernet address, or carrier changes. */
-uint64_t change_seq;
+atomic_uint64_t change_seq;
 
 /* A netdev provider might be unable to change some of the device's
  * parameter (n_rxq, mtu) when the device is in use.  In this case
@@ -91,12 +91,17 @@ struct netdev {
 static inline void
 netdev_change_seq_changed(const struct netdev *netdev_)
 {
+uint64_t change_seq;
 struct netdev *netdev = CONST_CAST(struct netdev *, netdev_);
 seq_change(connectivity_seq_get());
-netdev->change_seq++;
-if (!netdev->change_seq) {
-netdev->change_seq++;
+
+atomic_read_relaxed(>change_seq, _seq);
+change_seq++;
+if (OVS_UNLIKELY(!change_seq)) {
+change_seq++;
 }
+atomic_store_explicit(>change_seq, change_seq,
+  memory_order_release);
 }
 
 static inline void
diff --git a/lib/netdev.c b/lib/netdev.c
index af8f8560d..405c98c68 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -2039,7 +2039,12 @@ restore_all_flags(void *aux OVS_UNUSED)
 uint64_t
 netdev_get_change_seq(const struct netdev *netdev)
 {
-return netdev->change_seq;
+uint64_t change_seq;
+
+atomic_read_explicit(_CAST(struct netdev *, netdev)->change_seq,
+_seq, memory_order_acquire);
+
+return change_seq;
 }
 
 #ifndef _WIN32
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1] netdev: use acquire-release semantics for change_seq in netdev

2019-11-25 Thread Yanqin Wei (Arm Technology China)
Hi Ben,

Thanks for your time to review. I am sorry not to verify this patch by clang 
compiler.  But this patch compiles successfully in gcc 7.4. Maybe in some gcc 
version, the atomic type is necessary for atomic operation.
I will fix it in V2.

Best Regards,
Wei Yanqin

> -Original Message-
> From: Ben Pfaff 
> Sent: Tuesday, November 26, 2019 5:54 AM
> To: Yanqin Wei (Arm Technology China) 
> Cc: d...@openvswitch.org; ovs-dev@openvswitch.org; nd ;
> Gavin Hu (Arm Technology China) 
> Subject: Re: [ovs-dev] [PATCH v1] netdev: use acquire-release semantics for
> change_seq in netdev
> 
> On Mon, Nov 18, 2019 at 10:46:59AM +0800, Yanqin Wei wrote:
> > "rxq_enabled" of netdev is writen in the vhost thread and read by pmd
> > thread once it observes 'change_seq' is updated. This patch is to keep
> > order on aarch64 or other weak memory model CPU to ensure
> > 'rxq_enabled' is observed before 'change_seq'.
> >
> > Reviewed-by: Gavin Hu 
> > Signed-off-by: Yanqin Wei 
> 
> Thanks for the patch.
> 
> This does not compile.  Clang says:
> 
> In file included from ../lib/dpif-netdev.c:54:
> ../lib/netdev-provider.h:97:5: error: address argument to atomic operation
> must be a pointer to _Atomic type ('uint64_t *' (aka 'unsigned long *') 
> invalid)
> ../lib/ovs-atomic-clang.h:82:16: note: expanded from macro
> 'atomic_add_explicit'
> In file included from ../lib/dpif-netdev.c:54:
> ../lib/netdev-provider.h:99:9: error: address argument to atomic operation
> must be a pointer to _Atomic type ('uint64_t *' (aka 'unsigned long *') 
> invalid)
> ../lib/ovs-atomic-clang.h:82:16: note: expanded from macro
> 'atomic_add_explicit'
> ../lib/netdev.c:2044:5: error: address argument to atomic operation must 
> be
> a pointer to _Atomic type ('const uint64_t *' (aka 'const unsigned long *')
> invalid)
> ../lib/ovs-atomic-clang.h:53:15: note: expanded from macro
> 'atomic_read_explicit'
> 
> and many more instances.
> 
> GCC says:
> 
> ../lib/netdev.c:2044:5: error: incorrect type in argument 1 (different
> modifiers)
> ../lib/netdev.c:2044:5:expected void *
> ../lib/netdev.c:2044:5:got unsigned long const *
> ../lib/netdev.c:2044:5: error: incorrect type in argument 1 (different
> modifiers)
> ../lib/netdev.c:2044:5:expected void *
> ../lib/netdev.c:2044:5:got unsigned long const *
> 
> I do tend to think it's correct, otherwise.  I wonder how this has been missed
> for so long.
> 
> Thanks,
> 
> Ben.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 2/4] travis: Move x86-only addon packages

2019-11-22 Thread Yanqin Wei (Arm Technology China)
Hi Ilya,

Reply inline.

Best Regards,
Wei Yanqin

> -Original Message-
> From: dev  On Behalf Of Ilya Maximets
> Sent: Friday, November 22, 2019 3:30 AM
> To: Lance Yang (Arm Technology China) ;
> d...@openvswitch.org; ovs-dev@openvswitch.org
> Cc: Jieqiang Wang (Arm Technology China) ;
> Ruifeng Wang (Arm Technology China) ; Gavin Hu
> (Arm Technology China) ; Jingzhao Ni (Arm Technology
> China) ; nd 
> Subject: Re: [ovs-dev] [PATCH v1 2/4] travis: Move x86-only addon packages
> 
> On 20.11.2019 9:14, Lance Yang wrote:
> > To enable multiple CPU architectures support, it is necessary to move
> > the x86-only addon packages from .travis.yml file. Otherwise, the
> > x86-only addon packages will break the builds on some other CPU
> architectures.
> >
> > Reviewed-by: Yangqin Wei 
> > Reviewed-by: Malvika Gupta 
> > Reviewed-by: Gavin Hu 
> > Reviewed-by: Ruifeng Wang 
> > Signed-off-by: Lance Yang 
> > ---
> >  .travis.yml  |  2 --
> >  .travis/linux-prepare.sh | 12 
> >  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> Common comment for all the patches in a series:
> * It's better to add a period in the end of a subject line.
[Yanqin] OK.
> 
> >
> > diff --git a/.travis.yml b/.travis.yml index 482efd2..2dc4d43 100644
> > --- a/.travis.yml
> > +++ b/.travis.yml
> > @@ -14,7 +14,6 @@ addons:
> >apt:
> >  packages:
> >- bc
> > -  - gcc-multilib
> >- libssl-dev
> >- llvm-dev
> >- libjemalloc1
> > @@ -26,7 +25,6 @@ addons:
> >- libelf-dev
> >- selinux-policy-dev
> >- libunbound-dev
> > -  - libunbound-dev:i386
> >- libunwind-dev
> >
> >  before_install: ./.travis/${TRAVIS_OS_NAME}-prepare.sh
> > diff --git a/.travis/linux-prepare.sh b/.travis/linux-prepare.sh index
> > 9e3ac0d..8096abe 100755
> > --- a/.travis/linux-prepare.sh
> > +++ b/.travis/linux-prepare.sh
> > @@ -15,10 +15,14 @@ cd ..
> >  pip install --disable-pip-version-check --user six flake8 hacking
> > pip install --user --upgrade docutils
> >
> > -if [ "$M32" ]; then
> > -# 32-bit and 64-bit libunwind can not be installed at the same time.
> > -# This will remove the 64-bit libunwind and install 32-bit version.
> > -sudo apt-get install -y libunwind-dev:i386
> > +if [[ "$TRAVIS_ARCH" == "amd64" ]] || [[ -z "$TRAVIS_ARCH" ]]; then
> 
> The same comment here as for previous ppc64le patch.
> Are you going to ever build 32bit binary on aarch64 on Travis?
> Is it really possible to build 32bit binary on aarch64 with '-m32' flag?
[Yanqin] Not yet. Gcc for aarch64 does not support -m32 flag. Cross compiler is 
required to build 32 bits binary on aarch64 machine.

> 
> > +if [ "$M32" ]; then
> > +# 32-bit and 64-bit libunwind can not be installed at the same 
> > time.
> > +# This will remove the 64-bit libunwind and install 32-bit version.
> > +sudo apt-get install \
> > +-y libunwind-dev:i386 libunbound-dev:i386 gcc-multilib
> 
> Please, add additional indentation level for above line.
[Yanqin] Thanks, will be updated in V2.
> 
> > +fi
> > +
> >  fi
> >
> >  # IPv6 is supported by kernel but disabled in TravisCI images:
> >
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [bug-report] bfd decay unit case failure

2019-11-22 Thread Yanqin Wei (Arm Technology China)
Hi Ben,

This issue is difficult to reproduce in local machine. We will try to do it. 
Could you suggest the log we need  to collect or enable?

Best Regards,
Wei Yanqin

> -Original Message-
> From: dev  On Behalf Of Ben Pfaff
> Sent: Friday, November 22, 2019 1:34 AM
> To: Lance Yang (Arm Technology China) 
> Cc: ovs-dev@openvswitch.org
> Subject: Re: [ovs-dev] [bug-report] bfd decay unit case failure
>
> On Thu, Nov 21, 2019 at 09:25:49AM +, Lance Yang (Arm Technology
> China) wrote:
> > I encountered a unit test failure when I ran the Open vSwitch testsuite on
> Travis CI aarch64 lxd container.
> > The environment can be found at line 7 : build system information section in
> the report on https://travis-ci.org/yzyuestc/ovs/jobs/614941322 . The unit 
> test
> case name is "bfd decay" , you can find the unit test failure details in the
> report after line 6520. The failure is not 100% reproducible on Travis CI.
> > Could anyone give some hint on what is wrong for this unit test case?
>
> We do our best to make the Open vSwitch test cases resist differences in
> timing from one environment to another, but there still may be some that are
> sensitive to it.  This one may be such a test case.  These problems can be
> difficult to debug without being able to trigger them in interactive
> environments.  Have you been able to see it when you build by hand?
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 2/2] cmap: non-blocking cmap lookup

2019-11-18 Thread Yanqin Wei (Arm Technology China)
Hi Ilya,

New node insertion does not always set bucket slot. It should be only two cases 
would trigger slot updating.
1. node is inserted into an empty slot.
2. rearrange cmap bucket in case of no candidate bucket for new node. see 
"cmap_insert_bfs"

1st case is an empty slot, there is no other node in the list.
2nd case, the slot movement has two step:
a. copy an slot(hash,node list) into another candidate slot.
b. replace the old position with another slot. 
So there is at least one complete slot copy in the cmap. If one slot is 
updating and skipped by bitmap, it will be found in anther candidate bucket.

Best Regards,
Wei Yanqin

> -Original Message-
> From: Ilya Maximets 
> Sent: Tuesday, November 19, 2019 12:06 AM
> To: Ola Liljedahl ; i.maxim...@ovn.org; Yanqin Wei
> (Arm Technology China) ; ovs-dev@openvswitch.org
> Cc: Gavin Hu (Arm Technology China) ; nd
> ; b...@ovn.org
> Subject: Re: [ovs-dev] [PATCH v1 2/2] cmap: non-blocking cmap lookup
> 
> On 18.11.2019 16:55, Ola Liljedahl wrote:
> > On Mon, 2019-11-18 at 16:45 +0100, Ilya Maximets wrote:
> >> On 18.11.2019 3:45, Yanqin Wei wrote:
> >>
> >> Currently cmap_bucket is protected by a counter. Cmap reader will be
> >> blocked until the counter becomes even, writer increments it by 1 to
> >> be odd, and after slot update, increments by 1 again to become even.
> >> If the writer is pending or scheduled out during the writer course,
> >> the reader will be blocked.
> >>
> >> This patch introduces a bitmap as guard variable for (hash,node) pair.
> >> Writer need set slot valid bit to 0 before each change. And after
> >> change, writer sets valid bit to 1 and increments counter. Thus,
> >> reader can ensure slot consistency by only scanning valid slot and
> >> then checking counter+bitmap does not change while examining the bucket.
> >>
> >> The only time a reader has to retry the read operation for single
> >> bucket is when 1)a writer clears a bit in the valid bitmap between a
> >> reader's first and second read of the counter+bitmap.
> >> 2)a writer increments the counter between a reader's first and second
> >> read of counter+bitmap.
> >> I.e. the read operation needs to be retried when some other thread
> >> has made progress (with a write).
> >> There is no spinning/waiting for other threads to complete. This
> >> makes the design non-blocking (for readers).
> >>
> >> And it has almost no additional overhead because counter and bitmap
> >> share one 32 bits variable. No additional load/store for reader and
> >> writer.
> >>
> >>
> >> IIUC, after this change if writer will start inserting the node,
> >> reader will not find any node with the same hash in cmap because it
> >> will check only "valid" slots.  This breaks the cmap API and could lead to
> crash.
> > Why wouldn't readers find other valid slots with the same hash value?
> > It is only the slot that is being updated that is cleared in the valid
> > bitmap duing the update. Other valid slots (irrespective of actual
> > hash values) are unchanged.
> >
> > Am I missing something? Do the users of cmap have some other
> > expectations that are not obvious from looking at the cmap code?
> 
> bucket->nodes[i] is not a single node, but the list of nodes with the
> bucket->same hash
> equal to bucket->hashes[i].
> 
> You may look at the implementation of
> CMAP_NODE_FOR_EACH/CMAP_FOR_EACH_WITH_HASH
> and read comments to 'struct cmap_node' and cmap.{c,h} in general.
> 
> While adding the new node to the list you're restricting access to the whole 
> list
> making it impossible to find any node there.
> 
> Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1] netdev: use acquire-release semantics for change_seq in netdev

2019-11-17 Thread Yanqin Wei
"rxq_enabled" of netdev is writen in the vhost thread and read by pmd
thread once it observes 'change_seq' is updated. This patch is to keep
order on aarch64 or other weak memory model CPU to ensure 'rxq_enabled' is
observed before 'change_seq'.

Reviewed-by: Gavin Hu 
Signed-off-by: Yanqin Wei 
---
 lib/netdev-provider.h | 8 +---
 lib/netdev.c  | 7 ++-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index 1e5a40c89..ba809daa0 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -91,11 +91,13 @@ struct netdev {
 static inline void
 netdev_change_seq_changed(const struct netdev *netdev_)
 {
+uint64_t orig;
 struct netdev *netdev = CONST_CAST(struct netdev *, netdev_);
 seq_change(connectivity_seq_get());
-netdev->change_seq++;
-if (!netdev->change_seq) {
-netdev->change_seq++;
+atomic_add_explicit(>change_seq, 1, , memory_order_release);
+if (OVS_UNLIKELY(!netdev->change_seq)) {
+atomic_add_explicit(>change_seq, 1, ,
+memory_order_release);
 }
 }
 
diff --git a/lib/netdev.c b/lib/netdev.c
index af8f8560d..1841889e7 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -2039,7 +2039,12 @@ restore_all_flags(void *aux OVS_UNUSED)
 uint64_t
 netdev_get_change_seq(const struct netdev *netdev)
 {
-return netdev->change_seq;
+uint64_t change_seq;
+
+atomic_read_explicit(>change_seq, _seq,
+ memory_order_acquire);
+
+return change_seq;
 }
 
 #ifndef _WIN32
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 2/2] cmap: non-blocking cmap lookup

2019-11-17 Thread Yanqin Wei
Currently cmap_bucket is protected by a counter. Cmap reader will be
blocked until the counter becomes even, writer increments it by 1 to be
odd, and after slot update, increments by 1 again to become even. If the
writer is pending or scheduled out during the writer course, the reader
will be blocked.

This patch introduces a bitmap as guard variable for (hash,node) pair.
Writer need set slot valid bit to 0 before each change. And after change,
writer sets valid bit to 1 and increments counter. Thus, reader can ensure
slot consistency by only scanning valid slot and then checking counter+bitmap
does not change while examining the bucket.

The only time a reader has to retry the read operation for single bucket
is when
1)a writer clears a bit in the valid bitmap between a reader's first and
second read of the counter+bitmap.
2)a writer increments the counter between a reader's first and second
read of counter+bitmap.
I.e. the read operation needs to be retried when some other thread has
made progress (with a write).
There is no spinning/waiting for other threads to complete. This makes
the design non-blocking (for readers).

And it has almost no additional overhead because counter and bitmap
share one 32 bits variable. No additional load/store for reader and
writer.

Reviewed-by: Ola Liljedahl 
Reviewed-by: Gavin Hu 
Signed-off-by: Yanqin Wei 
---
 lib/cmap.c | 157 ++---
 1 file changed, 101 insertions(+), 56 deletions(-)

diff --git a/lib/cmap.c b/lib/cmap.c
index e41c40794..a36efc68c 100644
--- a/lib/cmap.c
+++ b/lib/cmap.c
@@ -123,18 +123,27 @@ COVERAGE_DEFINE(cmap_shrink);
 /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit for 64B cacheline. */
 #define CMAP_K ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
 
+/* "counter_slotValidMap" in cmap_bucket includes counter and slot valid map.
+ * Reserve the "CMAP_K" least significant bits for slot valid bitmap and use
+ * remaining bits for bucket counter. Adding 1<counter, , memory_order_acquire);
+atomic_read_explicit(>counter_slotValidMap, _slotValidMap,
+memory_order_relaxed);
 
-return counter;
+return CMAP_SLOT_MAP(counter_slotValidMap);
 }
 
 static inline uint32_t
-read_even_counter(const struct cmap_bucket *bucket)
+read_counter_slot_bitmap(const struct cmap_bucket *bucket_)
 {
-uint32_t counter;
+struct cmap_bucket *bucket = CONST_CAST(struct cmap_bucket *, bucket_);
+uint32_t counter_slotValidMap;
 
-do {
-counter = read_counter(bucket);
-} while (OVS_UNLIKELY(counter & 1));
+/*Both counter and slot valid map are changed each time slot is updated.
+ * So counter_slotValidMap can be directly used for tracking hash update */
+atomic_read_explicit(>counter_slotValidMap, _slotValidMap,
+ memory_order_acquire);
 
-return counter;
+return counter_slotValidMap;
 }
 
 static inline bool
-counter_changed(const struct cmap_bucket *b_, uint32_t c)
+bucket_changed(const struct cmap_bucket *b_, uint32_t c)
 {
 struct cmap_bucket *b = CONST_CAST(struct cmap_bucket *, b_);
-uint32_t counter;
+uint32_t counter_slotValidMap;
 
 /* Need to make sure the counter read is not moved up, before the hash and
- * cmap_node_next().  Using atomic_read_explicit with memory_order_acquire
- * would allow prior reads to be moved after the barrier.
- * atomic_thread_fence prevents all following memory accesses from moving
- * prior to preceding loads. */
+ * cmap_node_next().
+ * atomic_thread_fence with acquire memory ordering prevents all following
+   memory accesses from moving prior to preceding loads. */
 atomic_thread_fence(memory_order_acquire);
-atomic_read_relaxed(>counter, );
+atomic_read_relaxed(>counter_slotValidMap, _slotValidMap);
 
-return OVS_UNLIKELY(counter != c);
+return OVS_UNLIKELY(counter_slotValidMap != c);
 }
 
 static inline const struct cmap_node *
-cmap_find_in_bucket(const struct cmap_bucket *bucket, uint32_t hash)
+cmap_find_in_bucket(const struct cmap_bucket *bucket, uint32_t hash,
+uint32_t slot_map)
 {
-for (int i = 0; i < CMAP_K; i++) {
+int i;
+
+ULLONG_FOR_EACH_1 (i, slot_map) {
 if (bucket->hashes[i] == hash) {
 return cmap_node_next(>nodes[i]);
 }
 }
+
 return NULL;
 }
 
@@ -334,20 +349,23 @@ cmap_find__(const struct cmap_bucket *b1, const struct 
cmap_bucket *b2,
 
 do {
 do {
-c1 = read_even_counter(b1);
-node = cmap_find_in_bucket(b1, hash);
-} while (OVS_UNLIKELY(counter_changed(b1, c1)));
+c1 = read_counter_slot_bitmap(b1);
+node = cmap_find_in_bucket(b1, hash, CMAP_SLOT_MAP(c1));
+} while (OVS_UNLIKELY(bucket_changed(b1, c1)));
+
 if (node) {
 break;
 }
+
 do {
-c2 = read_even_coun

[ovs-dev] [PATCH v1 1/2] cmap: add thread fence for slot update

2019-11-17 Thread Yanqin Wei
bucket update in the cmap lib is protected by a counter. But hash setting
is possible to be moved before counter update. This patch fix this issue.

Reviewed-by: Ola Liljedahl 
Reviewed-by: Gavin Hu 
Signed-off-by: Yanqin Wei 
---
 lib/cmap.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/cmap.c b/lib/cmap.c
index c9eef3f4a..e41c40794 100644
--- a/lib/cmap.c
+++ b/lib/cmap.c
@@ -598,7 +598,9 @@ cmap_set_bucket(struct cmap_bucket *b, int i,
 uint32_t c;
 
 atomic_read_explicit(>counter, , memory_order_acquire);
-atomic_store_explicit(>counter, c + 1, memory_order_release);
+atomic_store_explicit(>counter, c + 1, memory_order_relaxed);
+/*need to make sure setting hash is not moved up before counter update*/
+atomic_thread_fence(memory_order_release);
 ovsrcu_set(>nodes[i].next, node); /* Also atomic. */
 b->hashes[i] = hash;
 atomic_store_explicit(>counter, c + 2, memory_order_release);
-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 0/2] improve cmap read-write concurrency

2019-11-17 Thread Yanqin Wei
1. add thread fence to ensure hash setting safe.
2. make reader not blocked by writer.

Yanqin Wei (2):
  [ovs-dev] [PATCH 1/2] cmap: add thread fence for slot update
  [ovs-dev] [PATCH 2/2] cmap: non-blocking cmap lookup

 lib/cmap.c | 159 ++---
 1 file changed, 103 insertions(+), 56 deletions(-)

-- 
2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3] travis: support ppc64le builds

2019-11-08 Thread Yanqin Wei (Arm Technology China)
Hi David

> -Original Message-
> From: David Wilder 
> Sent: Thursday, November 7, 2019 3:21 AM
> To: ovs-dev@openvswitch.org
> Cc: i.maxim...@ovn.org; b...@ovn.org; Yanqin Wei (Arm Technology China)
> ; wil...@us.ibm.com
> Subject: [PATCH v3] travis: support ppc64le builds
>
> Add support for travis-ci ppc64le builds.
>
> - Updated matrix in .travis.yml to include an arch: ppc64le build.
> - Move package install needed for 32bit builds to .travis/linux-prepare.sh.
>
> To keep the total build time at an acceptable level only a single build job is
> included in the matrix for ppc64le.
>
> A build report example can be found here [1] [0] http://travis-ci.org/ [1]
> https://travis-ci.org/djlwilder/ovs/builds/607851729
>
> Signed-off-by: David Wilder 
> ---
> Addressed review comments:
> - Cleaned up linux-prepare.sh (v2)
> - Switch from os: linux-ppc64le to arch: ppc64le (v3)
>
>  .travis.yml  | 5 +++--
>  .travis/linux-prepare.sh | 5 -
>  2 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/.travis.yml b/.travis.yml
> index 482efd2d1..308c09635 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -14,7 +14,6 @@ addons:
>apt:
>  packages:
>- bc
> -  - gcc-multilib
>- libssl-dev
>- llvm-dev
>- libjemalloc1
> @@ -26,7 +25,6 @@ addons:
>- libelf-dev
>- selinux-policy-dev
>- libunbound-dev
> -  - libunbound-dev:i386
>- libunwind-dev
>
>  before_install: ./.travis/${TRAVIS_OS_NAME}-prepare.sh
> @@ -52,6 +50,9 @@ matrix:
>  - os: osx
>compiler: clang
>env: OPTS="--disable-ssl"
> +- arch: ppc64le
> +  compiler: gcc
> +  env: OPTS="--disable-ssl"
>
>  script: ./.travis/${TRAVIS_OS_NAME}-build.sh $OPTS
>
> diff --git a/.travis/linux-prepare.sh b/.travis/linux-prepare.sh index
> 9e3ac0df7..d66f480c6 100755
> --- a/.travis/linux-prepare.sh
> +++ b/.travis/linux-prepare.sh
> @@ -18,7 +18,10 @@ pip install --user --upgrade docutils  if [ "$M32" ]; then
>  # 32-bit and 64-bit libunwind can not be installed at the same time.
>  # This will remove the 64-bit libunwind and install 32-bit version.
> -sudo apt-get install -y libunwind-dev:i386
> +sudo apt-get install -y \
> + gcc-multilib \
> + libunwind-dev:i386 \
> + libunbound-dev:i386
[Yanqin] They are x86 specific dependency. It is better to use  "$TRAVIS_ARCH" 
== "amd64" condition.
[Yanqin]  Is gcc-multilib only required for 32bits build?
>  fi
>
>  # IPv6 is supported by kernel but disabled in TravisCI images:
> --
> 2.23.0.162.gf1d4a28

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ] travis: support ppc64le builds

2019-11-07 Thread Yanqin Wei (Arm Technology China)
Hi David, 

I am sorry to reply this late. Yes, if travis support this configuration, it 
should be a good solution for me.

Best regards,
Wei Yanqin

-Original Message-
From: dwilder  
Sent: Wednesday, November 6, 2019 4:32 AM
To: Yanqin Wei (Arm Technology China) 
Cc: Ilya Maximets ; ovs-dev@openvswitch.org; 
wil...@us.ibm.com; nd 
Subject: Re: RE: RE: [ovs-dev] [PATCH ] travis: support ppc64le builds

Hi Wei

If I change my matrix:include to use "arch: ppc64le" rather than "os: 
linux-ppc64le", will it eliminate your concern?

matrix:
   include:
-   - os: linux-ppc64le
+   - arch: ppc64le
   compiler: gcc
   env: OPTS="--disable-ssl"


Later when we want to enable the full matrix on all arch we would add:

1) arch:
   - amd64
   - ppc64le
   - arm64

2) eliminate the include: -arch: ppc64le

3) Add any exclude: - arch:XXX sections for any tests that dont apply.

For example I would add:

  exclude:
- arch: ppc64le
  env: M32=1 OPTS="--disable-ssl"

Regards,
   David

On 2019-11-02 02:09, Yanqin Wei (Arm Technology China) wrote:
> Hi David,
> 
> Thanks for your reply.  Yes, my concern is how to define arch and os 
> in .travis.yml after we cover all builds and cases for arm and ppc.
> This pattern can enable all builds and testsuits for x86 and arm
> arch:
>   - amd64
>   - arm64
> os:
>   - linux
> 
> This can enable all jobs for x86 and ppc.
> arch:
>   - amd64 //default
> os:
>   - linux
>   - linux-ppc64le
> 
> But it does not work to combine them.  This means four kinds of
> arch+os combinations in all.   Arm64+linux-ppc64le is invalid.
> arch:
>   - amd64
>   - arm64
> os:
>   - linux
>   - linux-ppc64le
> 
> So if we finally cover all the builds and cases for arm/ppc,  we have 
> to duplicate all jobs for different cpu arch in the matrix include.
> matrix:
>   include:
> - os: linux-ppc64le
>   env: job1
> - os: linux-ppc64le
>   env: job2
> ...
> - arch: arm64
>   env: job1
> - arch: arm64
>   env: job2
> ...
> 
> But currently either arm or ppc has not cover all the cases, so they 
> can coexist in build-matrix.  And there is no conflict in 
> build/prepare scripts, because both of them use TRAVIS_ARCH variable 
> to indicate cpu arch.
> The patch series to enable arm CI is under internal review. It will be 
> submitted when ready.
> 
> Best Regards,
> Wei Yanqin
> 
> 
> -Original Message-
> From: dwilder 
> Sent: Saturday, November 2, 2019 1:09 AM
> To: Yanqin Wei (Arm Technology China) 
> Cc: Ilya Maximets ; ovs-dev@openvswitch.org; 
> wil...@us.ibm.com; nd 
> Subject: Re: RE: [ovs-dev] [PATCH ] travis: support ppc64le builds
> 
> On 2019-10-30 19:04, Yanqin Wei (Arm Technology China) wrote:
>> Hi,
>> 
>> We are working to support arm64 build for ovs travis CI. It is indeed 
>> to use arch: arm64 to choose cpu architecture, because travis has 
>> provided native arm64 option now.
>> But in this patch it seems ppc64 builds run on the ppc-VM + x86 
>> native machine.
>> Currently arm only select a part of jobs to run, which is defined in 
>> matrix:include. But the final object is to run all jobs. It means 
>> that
>>  arch: arm64 will be moved out of marix. If ppc plans to do the same 
>> in the future, it will conflict with arm jobs.
>> 
>> Best Regards,
>> Wei Yanqin
> 
> Hi,
> I have added a build only test for ppc64le following the model used 
> for osx. I think this is a good start for getting multi-arch support 
> into Ci.
> 
> I agree that running all jobs on the matrix on every arch is good goal.
> I dont completely understand your issue, is your concern the use of os:
> vs arch: ?
> 
> I am glad to work with you to find a solution. Can you share your
> arm64 changes?  We can discuss off-list if you prefer.
> 
> 
>> 
>> -Original Message-
>> From: ovs-dev-boun...@openvswitch.org 
>>  On Behalf Of dwilder
>> Sent: Wednesday, October 30, 2019 1:55 AM
>> To: Ilya Maximets 
>> Cc: ovs-dev@openvswitch.org; wil...@us.ibm.com
>> Subject: Re: [ovs-dev] [PATCH ] travis: support ppc64le builds
>> 
>> On 2019-10-29 09:52, Ilya Maximets wrote:
>>> On 28.10.2019 22:22, David Wilder wrote:
>>>> Add support for travis-ci ppc64le builds.
>>>> 
>>>> - Updated matrix in .travis.yml to include a ppc64le build.
>>>> - Added support to install packages needed by specific 
>>>> architectures.
>>>> 
>>>> To keep the total build time at an acceptable level only a single 
>>>> build job is included in

Re: [ovs-dev] [PATCH ] travis: support ppc64le builds

2019-11-02 Thread Yanqin Wei (Arm Technology China)
Hi David,

Thanks for your reply.  Yes, my concern is how to define arch and os in 
.travis.yml after we cover all builds and cases for arm and ppc.  
This pattern can enable all builds and testsuits for x86 and arm
arch:
  - amd64
  - arm64
os:
  - linux

This can enable all jobs for x86 and ppc. 
arch:
  - amd64 //default
os:
  - linux
  - linux-ppc64le

But it does not work to combine them.  This means four kinds of arch+os 
combinations in all.   Arm64+linux-ppc64le is invalid.
arch:
  - amd64
  - arm64
os:
  - linux
  - linux-ppc64le  

So if we finally cover all the builds and cases for arm/ppc,  we have to 
duplicate all jobs for different cpu arch in the matrix include.
matrix:
  include:
- os: linux-ppc64le
  env: job1
- os: linux-ppc64le
  env: job2
...
- arch: arm64
  env: job1
- arch: arm64
  env: job2
...

But currently either arm or ppc has not cover all the cases, so they can 
coexist in build-matrix.  And there is no conflict in build/prepare scripts, 
because both of them use TRAVIS_ARCH variable to indicate cpu arch.
The patch series to enable arm CI is under internal review. It will be 
submitted when ready.

Best Regards,
Wei Yanqin


-Original Message-
From: dwilder  
Sent: Saturday, November 2, 2019 1:09 AM
To: Yanqin Wei (Arm Technology China) 
Cc: Ilya Maximets ; ovs-dev@openvswitch.org; 
wil...@us.ibm.com; nd 
Subject: Re: RE: [ovs-dev] [PATCH ] travis: support ppc64le builds

On 2019-10-30 19:04, Yanqin Wei (Arm Technology China) wrote:
> Hi,
> 
> We are working to support arm64 build for ovs travis CI. It is indeed 
> to use arch: arm64 to choose cpu architecture, because travis has 
> provided native arm64 option now.
> But in this patch it seems ppc64 builds run on the ppc-VM + x86 native 
> machine.
> Currently arm only select a part of jobs to run, which is defined in 
> matrix:include. But the final object is to run all jobs. It means that
>  arch: arm64 will be moved out of marix. If ppc plans to do the same 
> in the future, it will conflict with arm jobs.
> 
> Best Regards,
> Wei Yanqin

Hi,
I have added a build only test for ppc64le following the model used for osx. I 
think this is a good start for getting multi-arch support into Ci.

I agree that running all jobs on the matrix on every arch is good goal.  
I dont completely understand your issue, is your concern the use of os: 
vs arch: ?

I am glad to work with you to find a solution. Can you share your arm64 
changes?  We can discuss off-list if you prefer.


> 
> -Original Message-
> From: ovs-dev-boun...@openvswitch.org
>  On Behalf Of dwilder
> Sent: Wednesday, October 30, 2019 1:55 AM
> To: Ilya Maximets 
> Cc: ovs-dev@openvswitch.org; wil...@us.ibm.com
> Subject: Re: [ovs-dev] [PATCH ] travis: support ppc64le builds
> 
> On 2019-10-29 09:52, Ilya Maximets wrote:
>> On 28.10.2019 22:22, David Wilder wrote:
>>> Add support for travis-ci ppc64le builds.
>>> 
>>> - Updated matrix in .travis.yml to include a ppc64le build.
>>> - Added support to install packages needed by specific architectures.
>>> 
>>> To keep the total build time at an acceptable level only a single
>>> build job is included in the matrix for ppc64le.
>>> 
>>> A build report example can be found here [1] [0]
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__travis-2Dci.org_;
>>> d=DwICaQ=jf_iaSHvJObTbx-siA1ZOg=7ndxyKjH9UrBD68Us2WP1wI4BwEBQbeAy
>>> z8i_vwCCaI=6JANehIfGoxUMtwHhe4yob4UPeby0Y8ovgzRDIyJZFo=UMYL8rzJNp
>>> h87seC0oJLBiWoe-sUSL80AJy0RMTgBzQ=
>>> [1]
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_
>>> djlwilder_ovs_builds_604098141=DwICaQ=jf_iaSHvJObTbx-siA1ZOg=7n
>>> dxyKjH9UrBD68Us2WP1wI4BwEBQbeAyz8i_vwCCaI=6JANehIfGoxUMtwHhe4yob4UP
>>> eby0Y8ovgzRDIyJZFo=pyd2yQpQ0snpwGE5El4RYZsatwl74sthM1KLqtIKCnY=
>>> Signed-off-by: David Wilder 
>> 
>> Hi David,
>> Thanks for working on this. I have a couple of question regarding
>> ppc64le support by TravisCI.  It seems that they are not supporting
>> this architecture officially and refusing[1] to solve any issues that
>> appears while using it. There also no official documentation.
>> It's kind of a hidden feature that some projects are using for their
>> own risk [2]. Do you know why this happens or maybe you have some
>> insights about what is going on/how it works?
> 
> Work is going on to increase ppc64le support on Travis by the end of
> the year.  I dont have any details yet. My plan is to keep this to
> build-only ci until then.  Important, ppc64le VM are only available on
> travis-ci.org, they are not available on travis-ci.com.
> 
>> The API is also a bit strange because Travis starte

Re: [ovs-dev] [PATCH ] travis: support ppc64le builds

2019-10-30 Thread Yanqin Wei (Arm Technology China)
Hi,

We are working to support arm64 build for ovs travis CI. It is indeed to use 
arch: arm64 to choose cpu architecture, because travis has provided native 
arm64 option now. 
But in this patch it seems ppc64 builds run on the ppc-VM + x86 native machine. 
 
Currently arm only select a part of jobs to run, which is defined in 
matrix:include. But the final object is to run all jobs. It means that  arch: 
arm64 will be moved out of marix. If ppc plans to do the same in the future, it 
will conflict with arm jobs.

Best Regards,
Wei Yanqin

-Original Message-
From: ovs-dev-boun...@openvswitch.org  On 
Behalf Of dwilder
Sent: Wednesday, October 30, 2019 1:55 AM
To: Ilya Maximets 
Cc: ovs-dev@openvswitch.org; wil...@us.ibm.com
Subject: Re: [ovs-dev] [PATCH ] travis: support ppc64le builds

On 2019-10-29 09:52, Ilya Maximets wrote:
> On 28.10.2019 22:22, David Wilder wrote:
>> Add support for travis-ci ppc64le builds.
>> 
>> - Updated matrix in .travis.yml to include a ppc64le build.
>> - Added support to install packages needed by specific architectures.
>> 
>> To keep the total build time at an acceptable level only a single 
>> build job is included in the matrix for ppc64le.
>> 
>> A build report example can be found here [1] [0] 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__travis-2Dci.org_;
>> d=DwICaQ=jf_iaSHvJObTbx-siA1ZOg=7ndxyKjH9UrBD68Us2WP1wI4BwEBQbeAy
>> z8i_vwCCaI=6JANehIfGoxUMtwHhe4yob4UPeby0Y8ovgzRDIyJZFo=UMYL8rzJNp
>> h87seC0oJLBiWoe-sUSL80AJy0RMTgBzQ=
>> [1]
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_
>> djlwilder_ovs_builds_604098141=DwICaQ=jf_iaSHvJObTbx-siA1ZOg=7n
>> dxyKjH9UrBD68Us2WP1wI4BwEBQbeAyz8i_vwCCaI=6JANehIfGoxUMtwHhe4yob4UP
>> eby0Y8ovgzRDIyJZFo=pyd2yQpQ0snpwGE5El4RYZsatwl74sthM1KLqtIKCnY=
>> Signed-off-by: David Wilder 
> 
> Hi David,
> Thanks for working on this. I have a couple of question regarding 
> ppc64le support by TravisCI.  It seems that they are not supporting 
> this architecture officially and refusing[1] to solve any issues that 
> appears while using it. There also no official documentation.
> It's kind of a hidden feature that some projects are using for their 
> own risk [2]. Do you know why this happens or maybe you have some 
> insights about what is going on/how it works?

Work is going on to increase ppc64le support on Travis by the end of the year.  
I dont have any details yet. My plan is to keep this to build-only ci until 
then.  Important, ppc64le VM are only available on travis-ci.org, they are not 
available on travis-ci.com.

> The API is also a bit strange because Travis started to officially 
> support arm builds, but this is done via 'arch' knob, not the 'os'.
> Will it be changed over time for ppc64le?
> 

Sorry, I dont know.

> [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.commu
> nity_t_ppc64le-2Darch-2Dsupport-2Don-2Dtravis-2Dci-2Dcom-2Dvs-2Dtravis
> -2Dci-2Dorg_2898_2=DwICaQ=jf_iaSHvJObTbx-siA1ZOg=7ndxyKjH9UrBD68
> Us2WP1wI4BwEBQbeAyz8i_vwCCaI=6JANehIfGoxUMtwHhe4yob4UPeby0Y8ovgzRDIy
> JZFo=TrXdSxjvnbbVQz7EzR5r0aE93lZMSdCiIUQT2wt8E3I=
> [2]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openss
> l_openssl_commit_13da3ad00c80e1da816ca27f6c15b0ecee1bb0b8=DwICaQ=j
> f_iaSHvJObTbx-siA1ZOg=7ndxyKjH9UrBD68Us2WP1wI4BwEBQbeAyz8i_vwCCaI=
> 6JANehIfGoxUMtwHhe4yob4UPeby0Y8ovgzRDIyJZFo=RWVuli-BT8E2IsW3rAA9MtqC
> VPZahNk8k7yqxEbgTT4=
> 
> Few code comments inline.
> 
>> ---
>>   .travis.yml  |  5 +++--
>>   .travis/linux-prepare.sh | 18 ++
>>   2 files changed, 17 insertions(+), 6 deletions(-)
>> 
>> diff --git a/.travis.yml b/.travis.yml index 5676d9748..c99f26815 
>> 100644
>> --- a/.travis.yml
>> +++ b/.travis.yml
>> @@ -14,7 +14,6 @@ addons:
>> apt:
>>   packages:
>> - bc
>> -  - gcc-multilib
>> - libssl-dev
>> - llvm-dev
>> - libjemalloc1
>> @@ -24,7 +23,6 @@ addons:
>> - libelf-dev
>> - selinux-policy-dev
>> - libunbound-dev
>> -  - libunbound-dev:i386
>> - libunwind-dev
>> before_install: ./.travis/${TRAVIS_OS_NAME}-prepare.sh
>> @@ -50,6 +48,9 @@ matrix:
>>   - os: osx
>> compiler: clang
>> env: OPTS="--disable-ssl"
>> +- os: linux-ppc64le
>> +  compiler: gcc
>> +  env: OPTS="--disable-ssl"
>> script: ./.travis/${TRAVIS_OS_NAME}-build.sh $OPTS
>>   diff --git a/.travis/linux-prepare.sh b/.travis/linux-prepare.sh 
>> index e546d32cb..f3a9a6d44 100755
>> --- a/.travis/linux-prepare.sh
>> +++ b/.travis/linux-prepare.sh
>> @@ -15,8 +15,18 @@ cd ..
>>   pip install --disable-pip-version-check --user six flake8 hacking
>>   pip install --user --upgrade docutils
>>   -if [ "$M32" ]; then
>> -# 32-bit and 64-bit libunwind can not be installed at the same 
>> time.
>> -# This will remove the 64-bit libunwind and install 32-bit 
>> version.
>> -sudo apt-get install -y libunwind-dev:i386
>> +# Include packages 

Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK

2019-09-05 Thread Yanqin Wei (Arm Technology China)
Thanks James,
I can understand the root cause now.

Best Regards,
Wei Yanqin

From: James Page 
Sent: Thursday, September 5, 2019 3:42 PM
To: Yanqin Wei (Arm Technology China) 
Cc: Ilya Maximets ; ovs-dev@openvswitch.org
Subject: Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK

On Thu, Sep 5, 2019 at 8:34 AM Yanqin Wei (Arm Technology China) 
mailto:yanqin@arm.com>> wrote:
Is it possible a configuration issue for building OVS userspace program , which 
should not enable crc for xgene cpu?  DPDK library is linked with OVS program 
during OVS building. It should not import compiling configuration from DPDK, 
right?

Compiler flags will be generated by the pkg-config provided by DPDK - the CPU 
flags exceed the features of the xgene CPU.

That said, in the context of compiling a DPDK enabled version of OVS for arm64 
for general consumption, having a practical baseline for CPU features that are 
actually useful to arm64 users is important.

Canonical have some fairly earlier arm64 hardware in the Ubuntu build 
infrastructure which is probably below the sensible baseline for DPDK support.

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK

2019-09-05 Thread Yanqin Wei (Arm Technology China)
Hi Ilya,
Thanks for the clarification. It looks application is strong couple with DPDK. 
I will follow your discussion under this bug.

Best Regards,
Wei Yanqin

-Original Message-
From: Ilya Maximets 
Sent: Thursday, September 5, 2019 3:41 PM
To: Yanqin Wei (Arm Technology China) ; James Page 

Cc: ovs-dev@openvswitch.org
Subject: Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK

On 05.09.2019 10:34, Yanqin Wei (Arm Technology China) wrote:
> Is it possible a configuration issue for building OVS userspace program , 
> which should not enable crc for xgene cpu?

There is a arm64-xgene1-linux-gcc target that can be used for a legacy 
make-based build system. New meson based build doesn't support that.

> DPDK library is linked with OVS program during OVS building. It should not 
> import compiling configuration from DPDK, right?

DPDK exports its machine flags, extra defines and libraries to the application 
build process because big and important parts of DPDK code are in header 
inlines and built within the application build process.

>
> Best Regards,
> Wei Yanqin
>
> -Original Message-
> From: Ilya Maximets 
> Sent: Wednesday, September 4, 2019 8:46 PM
> To: Yanqin Wei (Arm Technology China) ; James Page
> 
> Cc: ovs-dev@openvswitch.org
> Subject: Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK
>
> BTW, I submitted a bug to DPDK:
> https://bugs.dpdk.org/show_bug.cgi?id=344
>
> Best regards, Ilya Maximets.
>
> On 04.09.2019 12:53, Yanqin Wei (Arm Technology China) wrote:
>> Understood. Thanks for the information.
>>
>>
>>
>> Best Regards,
>>
>> Wei Yanqin
>>
>>
>>
>> *From:* James Page 
>> *Sent:* Wednesday, September 4, 2019 5:45 PM
>> *To:* Yanqin Wei (Arm Technology China) 
>> *Cc:* Ilya Maximets ; ovs-dev@openvswitch.org
>> *Subject:* Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK
>>
>>
>>
>> Hi
>>
>>
>>
>> On Wed, Sep 4, 2019 at 10:30 AM Yanqin Wei (Arm Technology China) 
>> mailto:yanqin@arm.com>> wrote:
>>
>>
>> CRC32 is not a mandatory feature for Armv8.0. What is the arm64 CPU used 
>> in your platform?
>> I'm not sure if it's necessary to use mandatory feature (neon) to 
>> optimize hash libraries, because I thought most arm64 CPUs supported CRC32.
>>
>>
>>
>> The builders we use in Launchpad for package builds are fully virtualized 
>> via OpenStack; I checked and we have some older X-gene CPU's which don't 
>> have crc32 support.
>>
>>
>>
>> IMPORTANT NOTICE: The contents of this email and any attachments are 
>> confidential and may also be privileged. If you are not the intended 
>> recipient, please notify the sender immediately and do not disclose the 
>> contents to any other person, use it for any purpose, or store or copy the 
>> information in any medium. Thank you.
> IMPORTANT NOTICE: The contents of this email and any attachments are 
> confidential and may also be privileged. If you are not the intended 
> recipient, please notify the sender immediately and do not disclose the 
> contents to any other person, use it for any purpose, or store or copy the 
> information in any medium. Thank you.
>
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK

2019-09-05 Thread Yanqin Wei (Arm Technology China)
Is it possible a configuration issue for building OVS userspace program , which 
should not enable crc for xgene cpu?  DPDK library is linked with OVS program 
during OVS building. It should not import compiling configuration from DPDK, 
right?

Best Regards,
Wei Yanqin

-Original Message-
From: Ilya Maximets 
Sent: Wednesday, September 4, 2019 8:46 PM
To: Yanqin Wei (Arm Technology China) ; James Page 

Cc: ovs-dev@openvswitch.org
Subject: Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK

BTW, I submitted a bug to DPDK:
https://bugs.dpdk.org/show_bug.cgi?id=344

Best regards, Ilya Maximets.

On 04.09.2019 12:53, Yanqin Wei (Arm Technology China) wrote:
> Understood. Thanks for the information.
>
>
>
> Best Regards,
>
> Wei Yanqin
>
>
>
> *From:* James Page 
> *Sent:* Wednesday, September 4, 2019 5:45 PM
> *To:* Yanqin Wei (Arm Technology China) 
> *Cc:* Ilya Maximets ; ovs-dev@openvswitch.org
> *Subject:* Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK
>
>
>
> Hi
>
>
>
> On Wed, Sep 4, 2019 at 10:30 AM Yanqin Wei (Arm Technology China) 
> mailto:yanqin@arm.com>> wrote:
>
>
> CRC32 is not a mandatory feature for Armv8.0. What is the arm64 CPU used 
> in your platform?
> I'm not sure if it's necessary to use mandatory feature (neon) to 
> optimize hash libraries, because I thought most arm64 CPUs supported CRC32.
>
>
>
> The builders we use in Launchpad for package builds are fully virtualized via 
> OpenStack; I checked and we have some older X-gene CPU's which don't have 
> crc32 support.
>
>
>
> IMPORTANT NOTICE: The contents of this email and any attachments are 
> confidential and may also be privileged. If you are not the intended 
> recipient, please notify the sender immediately and do not disclose the 
> contents to any other person, use it for any purpose, or store or copy the 
> information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK

2019-09-04 Thread Yanqin Wei (Arm Technology China)
Understood. Thanks for the information.

Best Regards,
Wei Yanqin

From: James Page 
Sent: Wednesday, September 4, 2019 5:45 PM
To: Yanqin Wei (Arm Technology China) 
Cc: Ilya Maximets ; ovs-dev@openvswitch.org
Subject: Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK

Hi

On Wed, Sep 4, 2019 at 10:30 AM Yanqin Wei (Arm Technology China) 
mailto:yanqin@arm.com>> wrote:

CRC32 is not a mandatory feature for Armv8.0. What is the arm64 CPU used in 
your platform?
I'm not sure if it's necessary to use mandatory feature (neon) to optimize hash 
libraries, because I thought most arm64 CPUs supported CRC32.

The builders we use in Launchpad for package builds are fully virtualized via 
OpenStack; I checked and we have some older X-gene CPU's which don't have crc32 
support.

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK

2019-09-04 Thread Yanqin Wei (Arm Technology China)
Hi James,

CRC32 is not a mandatory feature for Armv8.0. What is the arm64 CPU used in 
your platform?
I'm not sure if it's necessary to use mandatory feature (neon) to optimize hash 
libraries, because I thought most arm64 CPUs supported CRC32.

Best Regards,
Wei Yanqin

-Original Message-
From: ovs-dev-boun...@openvswitch.org  On 
Behalf Of James Page
Sent: Wednesday, September 4, 2019 4:50 PM
To: Ilya Maximets 
Cc: ovs-dev@openvswitch.org
Subject: Re: [ovs-dev] SIGILL ovs branch-2.12/arm64/DPDK

On Tue, Sep 3, 2019 at 5:48 PM Ilya Maximets  wrote:

> > Hi
> >
> > I've been testing non-x86 architecture builds for the upcoming 2.12
> release
> > of OVS and I'm hitting an issue with DPDK enabled builds on the
> > arm64 architecture.
> >
> > branch-2.12 includes improvements for native hashing under arm64;
> > these appear to work fine when DPDK is not in the mix, but with DPDK
> > enabled, I get a SIGILL in lib/hash.c:
>
> Hi.
>
> What is your target platform?
> One explanation I could imagine is that DPDK blindly forces
> -march=armv8-a+crc defining __ARM_FEATURE_CRC32 while your cpu doesn't
> support crc32 extension.
>
> Do you have crc32 in the list of cpu features in /proc/cpuinfo ?
>

It would appear that I do not:

Features : fp asimd evtstrm cpuid

Thanks for the pointer.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3] flow: fix incorrect padding length checking in ipv6_sanity_check

2019-09-02 Thread Yanqin Wei
The padding length is (packet size - ipv6 header length - ipv6 plen).  This
patch fixes incorrect padding size checking in ipv6_sanity_check.

Reviewed-by: Gavin Hu 
Signed-off-by: Yanqin Wei 
---
 lib/flow.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/flow.c b/lib/flow.c
index ac6a4e1..0413c67 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -699,7 +699,7 @@ ipv6_sanity_check(const struct ovs_16aligned_ip6_hdr *nh, 
size_t size)
 return false;
 }
 /* Jumbo Payload option not supported yet. */
-if (OVS_UNLIKELY(size - plen > UINT8_MAX)) {
+if (OVS_UNLIKELY(size - (plen + IPV6_HEADER_LEN) > UINT8_MAX)) {
 return false;
 }
 
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] flow: miniflow_extract metadata branchless optimization

2019-08-30 Thread Yanqin Wei (Arm Technology China)
Hi 

I run a basic nic2nic case for compiler branch stripping. The improvement is 
around 0.8%.
It is observed that 0.37% branch miss in case of directly call 
miniflow_extract__(but it is not inline here)  and 0.27% branch miss in case of 
enable multi-instance for miniflow_extract.

But downside is code size gets bigger and is possible to increase I-cache miss. 
In this test, it is not observed because only miniflow_extract_firstpass is 
invoked. For some cases such as tunnel or connection track, it might has some 
impact.  But in these cases, the branch prediction failure should also increase 
if not enable branch strip.

So the impact should be platform(cache size, branch prediction, compiler) and 
deployment(L3 fwding/tunnel/...) dependency. 
On the whole, the greater difference between first pass and recirculation, the 
more benefit to enable multi-instance.

Best Regards,
Wei Yanqin

-Original Message-
From: Ilya Maximets  
Sent: Thursday, August 29, 2019 9:46 PM
To: Yanqin Wei (Arm Technology China) ; Ben Pfaff 

Cc: d...@openvswitch.org; nd ; Gavin Hu (Arm Technology China) 

Subject: Re: [ovs-dev] [PATCH v2] flow: miniflow_extract metadata branchless 
optimization

On 29.08.2019 12:21, Yanqin Wei (Arm Technology China) wrote:
> Hi Ben,
> Thanks for the feedback.  It is indeed related with userspace datapath. 
> 
> Hi Ilya,
> Could you take a look at this patch when you have time?> I knew 
> first-pass and recirculating traffic share the same packet handling. It makes 
> code common and maintainable.
> But if we can introduce some data-oriented and well-defined flag to bypass 
> some time-consuming handling, it can improve performance a lot.

Hi. I had a quick look at the patch.
Few thoughts:
* 'md' is actually always valid inside 'miniflow_extract', so the
   variable 'md_valid' should be renamed to not be misleading.
   Maybe something like 'md_is_full'? I'm not sure about the name.

* How much is the performance benefit of the compiler code stripping?
  I mean, what is the difference between direct call
 miniflow_extract__(packet, dst, md_is_valid);
  where 'md_is_valid == false' and the call to
 miniflow_extract_firstpass(packet, dst);
  ?
  Asking because this complicates dfc_processing() function.
  I'm, actually have a patch locally to combine rss_hash
  calculation functions to reduce code duplication, so I'm trying
  to figure out what are the possibilities here.

Best regards, Ilya Maximets.

> 
> Best Regards,
> Wei Yanqin
> 
> -Original Message-
> From: Ben Pfaff 
> Sent: Thursday, August 29, 2019 6:11 AM
> To: Yanqin Wei (Arm Technology China) ; Ilya 
> Maximets 
> Cc: d...@openvswitch.org; nd ; Gavin Hu (Arm Technology 
> China) 
> Subject: Re: [ovs-dev] [PATCH v2] flow: miniflow_extract metadata 
> branchless optimization
> 
> This fails to apply to current master.
> 
> Whereas most of the time I'd be comfortable with reviewing 'flow'
> patches myself, this one is closed related to the dpif-netdev code and I'd 
> prefer to have someone who understands that code well, as well as the 
> tradeoffs between performance and maintainability, review it.  Ilya (added to 
> the To line) is a good choice.
> 
> On Thu, Aug 22, 2019 at 06:09:16PM +0800, Yanqin Wei wrote:
>> "miniflow_extract" is a branch heavy implementation for packet header 
>> and metadata parsing. There is a lot of meta data handling for all traffic.
>> But this should not be applicable for packets from interface.
>> This patch adds a layer of inline encapsulation to miniflow_extract 
>> and introduces constant "md_valid" input parameter as a branch condition.
>> The new branch will be removed by the compiler at compile time. Two 
>> instances of miniflow_extract with different branches will be generated.
>>
>> This patch is tested on an arm64 platform. It improves more than 3.5% 
>> performance in P2P forwarding cases.
>>
>> Reviewed-by: Gavin Hu 
>> Signed-off-by: Yanqin Wei 
>> ---
>>  lib/dpif-netdev.c |  13 +++---
>>  lib/flow.c| 116 
>> --
>>  lib/flow.h|   2 +
>>  3 files changed, 79 insertions(+), 52 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>> d0a1c58..6686b14 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -6508,12 +6508,15 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>  }
>>  }
>>  
>> -miniflow_extract(packet, >mf);
>> +if (!md_is_valid) {
>> +miniflow_extract_firstpass(packet, >mf);
>> +key->hash =
>> +dpif_netdev_packet_get_rss_hash_orig_pkt(packet, >mf);
>&g

Re: [ovs-dev] [PATCH v2] flow: miniflow_extract metadata branchless optimization

2019-08-29 Thread Yanqin Wei (Arm Technology China)
Hi Ben,
Thanks for the feedback.  It is indeed related with userspace datapath. 

Hi Ilya, 
Could you take a look at this patch when you have time? 

I knew first-pass and recirculating traffic share the same packet handling. It 
makes code common and maintainable. 
But if we can introduce some data-oriented and well-defined flag to bypass some 
time-consuming handling, it can improve performance a lot.

Best Regards,
Wei Yanqin

-Original Message-
From: Ben Pfaff  
Sent: Thursday, August 29, 2019 6:11 AM
To: Yanqin Wei (Arm Technology China) ; Ilya Maximets 

Cc: d...@openvswitch.org; nd ; Gavin Hu (Arm Technology China) 

Subject: Re: [ovs-dev] [PATCH v2] flow: miniflow_extract metadata branchless 
optimization

This fails to apply to current master.

Whereas most of the time I'd be comfortable with reviewing 'flow'
patches myself, this one is closed related to the dpif-netdev code and I'd 
prefer to have someone who understands that code well, as well as the tradeoffs 
between performance and maintainability, review it.  Ilya (added to the To 
line) is a good choice.

On Thu, Aug 22, 2019 at 06:09:16PM +0800, Yanqin Wei wrote:
> "miniflow_extract" is a branch heavy implementation for packet header 
> and metadata parsing. There is a lot of meta data handling for all traffic.
> But this should not be applicable for packets from interface.
> This patch adds a layer of inline encapsulation to miniflow_extract 
> and introduces constant "md_valid" input parameter as a branch condition.
> The new branch will be removed by the compiler at compile time. Two 
> instances of miniflow_extract with different branches will be generated.
> 
> This patch is tested on an arm64 platform. It improves more than 3.5% 
> performance in P2P forwarding cases.
> 
> Reviewed-by: Gavin Hu 
> Signed-off-by: Yanqin Wei 
> ---
>  lib/dpif-netdev.c |  13 +++---
>  lib/flow.c| 116 
> --
>  lib/flow.h|   2 +
>  3 files changed, 79 insertions(+), 52 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 
> d0a1c58..6686b14 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -6508,12 +6508,15 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>  }
>  }
>  
> -miniflow_extract(packet, >mf);
> +if (!md_is_valid) {
> +miniflow_extract_firstpass(packet, >mf);
> +key->hash =
> +dpif_netdev_packet_get_rss_hash_orig_pkt(packet, >mf);
> +} else {
> +miniflow_extract(packet, >mf);
> +key->hash = dpif_netdev_packet_get_rss_hash(packet, >mf);
> +}
>  key->len = 0; /* Not computed yet. */
> -key->hash =
> -(md_is_valid == false)
> -? dpif_netdev_packet_get_rss_hash_orig_pkt(packet, >mf)
> -: dpif_netdev_packet_get_rss_hash(packet, >mf);
>  
>  /* If EMC is disabled skip emc_lookup */
>  flow = (cur_min != 0) ? emc_lookup(>emc_cache, key) : 
> NULL; diff --git a/lib/flow.c b/lib/flow.c index e54fd2e..e5b554b 
> 100644
> --- a/lib/flow.c
> +++ b/lib/flow.c
> @@ -707,7 +707,8 @@ ipv6_sanity_check(const struct 
> ovs_16aligned_ip6_hdr *nh, size_t size)  }
>  
>  /* Initializes 'dst' from 'packet' and 'md', taking the packet type 
> into
> - * account.  'dst' must have enough space for FLOW_U64S * 8 bytes.
> + * account.  'dst' must have enough space for FLOW_U64S * 8 bytes. 
> + Metadata
> + * initialization should be bypassed if "md_valid" is false.
>   *
>   * Initializes the layer offsets as follows:
>   *
> @@ -732,8 +733,9 @@ ipv6_sanity_check(const struct ovs_16aligned_ip6_hdr *nh, 
> size_t size)
>   *  present and the packet has at least the content used for the fields
>   *  of interest for the flow, otherwise UINT16_MAX.
>   */
> -void
> -miniflow_extract(struct dp_packet *packet, struct miniflow *dst)
> +static inline ALWAYS_INLINE void
> +miniflow_extract__(struct dp_packet *packet, struct miniflow *dst,
> +const bool md_valid)
>  {
>  /* Add code to this function (or its callees) to extract new fields. */
>  BUILD_ASSERT_DECL(FLOW_WC_SEQ == 41); @@ -752,54 +754,60 @@ 
> miniflow_extract(struct dp_packet *packet, struct miniflow *dst)
>  ovs_be16 ct_tp_src = 0, ct_tp_dst = 0;
>  
>  /* Metadata. */
> -if (flow_tnl_dst_is_set(>tunnel)) {
> -miniflow_push_words(mf, tunnel, >tunnel,
> -offsetof(struct flow_tnl, metadata) /
> -sizeof(uint64_t));
> -
> -if (!(md->tunnel.flags & FLOW_TNL_

Re: [ovs-dev] [PATCH v2] flow: fix incorrect padding length checking and combine branch in ipv6_sanity_check

2019-08-29 Thread Yanqin Wei (Arm Technology China)
Hi Ben,

Thanks for the comments. I am sorry not to notice the risk of calculation with 
different type . 
The original reason to use 'int' for size checking is that compiler can combine 
two conditions into one and save a branch instruction here,  because "negative 
value" always more than UINT8_MAX.  
> +if (OVS_UNLIKELY(pad_len < 0 || pad_len > UINT8_MAX)) {

But now I realized it introduces the risk and not clean for C code even if we 
use type cast here. So I will remove this performance improvement and only keep 
the bug fix for padding length calculation in this patch. What do you think of 
it?

Best Regards,
Wei Yanqin

-Original Message-
From: Ben Pfaff  
Sent: Thursday, August 29, 2019 5:58 AM
To: Yanqin Wei (Arm Technology China) 
Cc: d...@openvswitch.org; nd ; Gavin Hu (Arm Technology China) 

Subject: Re: [ovs-dev] [PATCH v2] flow: fix incorrect padding length checking 
and combine branch in ipv6_sanity_check

On Thu, Aug 22, 2019 at 06:09:34PM +0800, Yanqin Wei wrote:
> The padding length is (packet size - ipv6 header length - ipv6 plen).  
> This patch fixes incorrect ipv6 size checking and improves it by combining 
> branch.
> 
> Reviewed-by: Gavin Hu 
> Signed-off-by: Yanqin Wei 
> ---
>  lib/flow.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/flow.c b/lib/flow.c
> index e5b554b..1b21f51 100644
> --- a/lib/flow.c
> +++ b/lib/flow.c
> @@ -688,18 +688,16 @@ ipv4_get_nw_frag(const struct ip_header *nh)  
> static inline bool  ipv6_sanity_check(const struct 
> ovs_16aligned_ip6_hdr *nh, size_t size)  {
> -uint16_t plen;
> +int pad_len;
>  
>  if (OVS_UNLIKELY(size < sizeof *nh)) {
>  return false;
>  }
>  
> -plen = ntohs(nh->ip6_plen);
> -if (OVS_UNLIKELY(plen + IPV6_HEADER_LEN > size)) {
> -return false;
> -}
> +pad_len = size - IPV6_HEADER_LEN - ntohs(nh->ip6_plen);

The types in the above calculation worry me.  Writing the type of each
term:

int = size_t - int - uint16_t

The most likely type of the right side of the expression is size_t.
Assigning this to an 'int' might do "the right thing" for negative values, but 
it's risky--especially since size_t and int might be different widths.  I think 
it would be safer to cast the first and third terms to int, e.g.:

pad_len = (int) size - IPV6_HEADER_LEN - (int) ntohs(nh->ip6_plen);

>  /* Jumbo Payload option not supported yet. */
> -if (OVS_UNLIKELY(size - plen > UINT8_MAX)) {
> +if (OVS_UNLIKELY(pad_len < 0 || pad_len > UINT8_MAX)) {
>  return false;
>  }

Thanks,

Ben.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 ] flow: save "vlan_hdrs" memset for untagged traffic

2019-08-22 Thread Yanqin Wei
For untagged traffic, it is unnecessary to clear vlan_hdrs as it costs 32B
memset. So the patch improves it by postponing to clear vlan_hdrs until
ethtype check. It can benefit both untagged and single-tagged traffic. From
testing, it does not impact performance of dual-tagged traffic.

Reviewed-by: Gavin Hu 
Signed-off-by: Yanqin Wei 
---
 lib/flow.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/flow.c b/lib/flow.c
index 1b21f51..4d895e5 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -343,7 +343,6 @@ parse_vlan(const void **datap, size_t *sizep, union 
flow_vlan_hdr *vlan_hdrs)
 {
 const ovs_be16 *eth_type;
 
-memset(vlan_hdrs, 0, sizeof(union flow_vlan_hdr) * FLOW_MAX_VLAN_HEADERS);
 data_pull(datap, sizep, ETH_ADDR_LEN * 2);
 
 eth_type = *datap;
@@ -354,6 +353,7 @@ parse_vlan(const void **datap, size_t *sizep, union 
flow_vlan_hdr *vlan_hdrs)
 break;
 }
 
+memset(vlan_hdrs + n, 0, sizeof(union flow_vlan_hdr));
 const ovs_16aligned_be32 *qp = data_pull(datap, sizep, sizeof *qp);
 vlan_hdrs[n].qtag = get_16aligned_be32(qp);
 vlan_hdrs[n].tci |= htons(VLAN_CFI);
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] flow: fix incorrect padding length checking and combine branch in ipv6_sanity_check

2019-08-22 Thread Yanqin Wei
The padding length is (packet size - ipv6 header length - ipv6 plen).  This
patch fixes incorrect ipv6 size checking and improves it by combining branch.

Reviewed-by: Gavin Hu 
Signed-off-by: Yanqin Wei 
---
 lib/flow.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/flow.c b/lib/flow.c
index e5b554b..1b21f51 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -688,18 +688,16 @@ ipv4_get_nw_frag(const struct ip_header *nh)
 static inline bool
 ipv6_sanity_check(const struct ovs_16aligned_ip6_hdr *nh, size_t size)
 {
-uint16_t plen;
+int pad_len;
 
 if (OVS_UNLIKELY(size < sizeof *nh)) {
 return false;
 }
 
-plen = ntohs(nh->ip6_plen);
-if (OVS_UNLIKELY(plen + IPV6_HEADER_LEN > size)) {
-return false;
-}
+pad_len = size - IPV6_HEADER_LEN - ntohs(nh->ip6_plen);
+
 /* Jumbo Payload option not supported yet. */
-if (OVS_UNLIKELY(size - plen > UINT8_MAX)) {
+if (OVS_UNLIKELY(pad_len < 0 || pad_len > UINT8_MAX)) {
 return false;
 }
 
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] flow: miniflow_extract metadata branchless optimization

2019-08-22 Thread Yanqin Wei
"miniflow_extract" is a branch heavy implementation for packet header and
metadata parsing. There is a lot of meta data handling for all traffic.
But this should not be applicable for packets from interface.
This patch adds a layer of inline encapsulation to miniflow_extract and
introduces constant "md_valid" input parameter as a branch condition.
The new branch will be removed by the compiler at compile time. Two
instances of miniflow_extract with different branches will be generated.

This patch is tested on an arm64 platform. It improves more than 3.5%
performance in P2P forwarding cases.

Reviewed-by: Gavin Hu 
Signed-off-by: Yanqin Wei 
---
 lib/dpif-netdev.c |  13 +++---
 lib/flow.c| 116 --
 lib/flow.h|   2 +
 3 files changed, 79 insertions(+), 52 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index d0a1c58..6686b14 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -6508,12 +6508,15 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
 }
 }
 
-miniflow_extract(packet, >mf);
+if (!md_is_valid) {
+miniflow_extract_firstpass(packet, >mf);
+key->hash =
+dpif_netdev_packet_get_rss_hash_orig_pkt(packet, >mf);
+} else {
+miniflow_extract(packet, >mf);
+key->hash = dpif_netdev_packet_get_rss_hash(packet, >mf);
+}
 key->len = 0; /* Not computed yet. */
-key->hash =
-(md_is_valid == false)
-? dpif_netdev_packet_get_rss_hash_orig_pkt(packet, >mf)
-: dpif_netdev_packet_get_rss_hash(packet, >mf);
 
 /* If EMC is disabled skip emc_lookup */
 flow = (cur_min != 0) ? emc_lookup(>emc_cache, key) : NULL;
diff --git a/lib/flow.c b/lib/flow.c
index e54fd2e..e5b554b 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -707,7 +707,8 @@ ipv6_sanity_check(const struct ovs_16aligned_ip6_hdr *nh, 
size_t size)
 }
 
 /* Initializes 'dst' from 'packet' and 'md', taking the packet type into
- * account.  'dst' must have enough space for FLOW_U64S * 8 bytes.
+ * account.  'dst' must have enough space for FLOW_U64S * 8 bytes. Metadata
+ * initialization should be bypassed if "md_valid" is false.
  *
  * Initializes the layer offsets as follows:
  *
@@ -732,8 +733,9 @@ ipv6_sanity_check(const struct ovs_16aligned_ip6_hdr *nh, 
size_t size)
  *  present and the packet has at least the content used for the fields
  *  of interest for the flow, otherwise UINT16_MAX.
  */
-void
-miniflow_extract(struct dp_packet *packet, struct miniflow *dst)
+static inline ALWAYS_INLINE void
+miniflow_extract__(struct dp_packet *packet, struct miniflow *dst,
+const bool md_valid)
 {
 /* Add code to this function (or its callees) to extract new fields. */
 BUILD_ASSERT_DECL(FLOW_WC_SEQ == 41);
@@ -752,54 +754,60 @@ miniflow_extract(struct dp_packet *packet, struct 
miniflow *dst)
 ovs_be16 ct_tp_src = 0, ct_tp_dst = 0;
 
 /* Metadata. */
-if (flow_tnl_dst_is_set(>tunnel)) {
-miniflow_push_words(mf, tunnel, >tunnel,
-offsetof(struct flow_tnl, metadata) /
-sizeof(uint64_t));
-
-if (!(md->tunnel.flags & FLOW_TNL_F_UDPIF)) {
-if (md->tunnel.metadata.present.map) {
-miniflow_push_words(mf, tunnel.metadata, >tunnel.metadata,
-sizeof md->tunnel.metadata /
-sizeof(uint64_t));
-}
-} else {
-if (md->tunnel.metadata.present.len) {
-miniflow_push_words(mf, tunnel.metadata.present,
->tunnel.metadata.present, 1);
-miniflow_push_words(mf, tunnel.metadata.opts.gnv,
-md->tunnel.metadata.opts.gnv,
-
DIV_ROUND_UP(md->tunnel.metadata.present.len,
- sizeof(uint64_t)));
+if (md_valid) {
+if (flow_tnl_dst_is_set(>tunnel)) {
+miniflow_push_words(mf, tunnel, >tunnel,
+offsetof(struct flow_tnl, metadata) /
+sizeof(uint64_t));
+
+if (!(md->tunnel.flags & FLOW_TNL_F_UDPIF)) {
+if (md->tunnel.metadata.present.map) {
+miniflow_push_words(mf, tunnel.metadata,
+>tunnel.metadata,
+sizeof md->tunnel.metadata /
+sizeof(uint64_t));
+}
+} else {
+if (md->tunnel.metadata.present.len) {
+miniflow_push_words(mf, tunnel.metadat

[ovs-dev] [PATCH v1 ] flow: save "vlan_hdrs" memset for untagged traffic

2019-08-22 Thread Yanqin Wei
For untagged traffic, it is unnecessary to clear vlan_hdrs as it costs 32B
memset. So the patch improves it by postponing to clear vlan_hdrs until
ethtype check. It can benefit both untagged and single-tagged traffic. From
testing, it does not impact performance of dual-tagged traffic.

Change-Id: I6d503c904d0354c94882196d7720a574b2d07e44
Reviewed-by: Gavin Hu 
Signed-off-by: Yanqin Wei 
---
 lib/flow.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/flow.c b/lib/flow.c
index 1b21f51..4d895e5 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -343,7 +343,6 @@ parse_vlan(const void **datap, size_t *sizep, union 
flow_vlan_hdr *vlan_hdrs)
 {
 const ovs_be16 *eth_type;
 
-memset(vlan_hdrs, 0, sizeof(union flow_vlan_hdr) * FLOW_MAX_VLAN_HEADERS);
 data_pull(datap, sizep, ETH_ADDR_LEN * 2);
 
 eth_type = *datap;
@@ -354,6 +353,7 @@ parse_vlan(const void **datap, size_t *sizep, union 
flow_vlan_hdr *vlan_hdrs)
 break;
 }
 
+memset(vlan_hdrs + n, 0, sizeof(union flow_vlan_hdr));
 const ovs_16aligned_be32 *qp = data_pull(datap, sizep, sizeof *qp);
 vlan_hdrs[n].qtag = get_16aligned_be32(qp);
 vlan_hdrs[n].tci |= htons(VLAN_CFI);
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1] flow: fix incorrect padding length checking and combine branch in ipv6_sanity_check

2019-08-22 Thread Yanqin Wei
The padding length is (packet size - ipv6 header length - ipv6 plen).  This
patch fixes incorrect ipv6 size checking and improves it by combining branch.

Change-Id: I7a6212872f89ef9c11e3fbcec4dbecbcc6e89c47
Reviewed-by: Gavin Hu 
Signed-off-by: Yanqin Wei 
---
 lib/flow.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/flow.c b/lib/flow.c
index e5b554b..1b21f51 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -688,18 +688,16 @@ ipv4_get_nw_frag(const struct ip_header *nh)
 static inline bool
 ipv6_sanity_check(const struct ovs_16aligned_ip6_hdr *nh, size_t size)
 {
-uint16_t plen;
+int pad_len;
 
 if (OVS_UNLIKELY(size < sizeof *nh)) {
 return false;
 }
 
-plen = ntohs(nh->ip6_plen);
-if (OVS_UNLIKELY(plen + IPV6_HEADER_LEN > size)) {
-return false;
-}
+pad_len = size - IPV6_HEADER_LEN - ntohs(nh->ip6_plen);
+
 /* Jumbo Payload option not supported yet. */
-if (OVS_UNLIKELY(size - plen > UINT8_MAX)) {
+if (OVS_UNLIKELY(pad_len < 0 || pad_len > UINT8_MAX)) {
 return false;
 }
 
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1] flow: miniflow_extract metadata branchless optimization

2019-08-22 Thread Yanqin Wei
"miniflow_extract" is a branch heavy implementation for packet header and
metadata parsing. There is a lot of meta data handling for all traffic.
But this should not be applicable for packets from interface.
This patch adds a layer of inline encapsulation to miniflow_extract and
introduces constant "md_valid" input parameter as a branch condition.
The new branch will be removed by the compiler at compile time. Two
instances of miniflow_extract with different branches will be generated.

This patch is tested on an arm64 platform. It improves more than 3.5%
performance in P2P forwarding cases.

Change-Id: I5d606afb52bfa68e8afa6f886d69b9665cdad51a
Reviewed-by: Gavin Hu 
Signed-off-by: Yanqin Wei 
---
 lib/dpif-netdev.c |  13 +++---
 lib/flow.c| 116 --
 lib/flow.h|   2 +
 3 files changed, 79 insertions(+), 52 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index d0a1c58..6686b14 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -6508,12 +6508,15 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
 }
 }
 
-miniflow_extract(packet, >mf);
+if (!md_is_valid) {
+miniflow_extract_firstpass(packet, >mf);
+key->hash =
+dpif_netdev_packet_get_rss_hash_orig_pkt(packet, >mf);
+} else {
+miniflow_extract(packet, >mf);
+key->hash = dpif_netdev_packet_get_rss_hash(packet, >mf);
+}
 key->len = 0; /* Not computed yet. */
-key->hash =
-(md_is_valid == false)
-? dpif_netdev_packet_get_rss_hash_orig_pkt(packet, >mf)
-: dpif_netdev_packet_get_rss_hash(packet, >mf);
 
 /* If EMC is disabled skip emc_lookup */
 flow = (cur_min != 0) ? emc_lookup(>emc_cache, key) : NULL;
diff --git a/lib/flow.c b/lib/flow.c
index e54fd2e..e5b554b 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -707,7 +707,8 @@ ipv6_sanity_check(const struct ovs_16aligned_ip6_hdr *nh, 
size_t size)
 }
 
 /* Initializes 'dst' from 'packet' and 'md', taking the packet type into
- * account.  'dst' must have enough space for FLOW_U64S * 8 bytes.
+ * account.  'dst' must have enough space for FLOW_U64S * 8 bytes. Metadata
+ * initialization should be bypassed if "md_valid" is false.
  *
  * Initializes the layer offsets as follows:
  *
@@ -732,8 +733,9 @@ ipv6_sanity_check(const struct ovs_16aligned_ip6_hdr *nh, 
size_t size)
  *  present and the packet has at least the content used for the fields
  *  of interest for the flow, otherwise UINT16_MAX.
  */
-void
-miniflow_extract(struct dp_packet *packet, struct miniflow *dst)
+static inline ALWAYS_INLINE void
+miniflow_extract__(struct dp_packet *packet, struct miniflow *dst,
+const bool md_valid)
 {
 /* Add code to this function (or its callees) to extract new fields. */
 BUILD_ASSERT_DECL(FLOW_WC_SEQ == 41);
@@ -752,54 +754,60 @@ miniflow_extract(struct dp_packet *packet, struct 
miniflow *dst)
 ovs_be16 ct_tp_src = 0, ct_tp_dst = 0;
 
 /* Metadata. */
-if (flow_tnl_dst_is_set(>tunnel)) {
-miniflow_push_words(mf, tunnel, >tunnel,
-offsetof(struct flow_tnl, metadata) /
-sizeof(uint64_t));
-
-if (!(md->tunnel.flags & FLOW_TNL_F_UDPIF)) {
-if (md->tunnel.metadata.present.map) {
-miniflow_push_words(mf, tunnel.metadata, >tunnel.metadata,
-sizeof md->tunnel.metadata /
-sizeof(uint64_t));
-}
-} else {
-if (md->tunnel.metadata.present.len) {
-miniflow_push_words(mf, tunnel.metadata.present,
->tunnel.metadata.present, 1);
-miniflow_push_words(mf, tunnel.metadata.opts.gnv,
-md->tunnel.metadata.opts.gnv,
-
DIV_ROUND_UP(md->tunnel.metadata.present.len,
- sizeof(uint64_t)));
+if (md_valid) {
+if (flow_tnl_dst_is_set(>tunnel)) {
+miniflow_push_words(mf, tunnel, >tunnel,
+offsetof(struct flow_tnl, metadata) /
+sizeof(uint64_t));
+
+if (!(md->tunnel.flags & FLOW_TNL_F_UDPIF)) {
+if (md->tunnel.metadata.present.map) {
+miniflow_push_words(mf, tunnel.metadata,
+>tunnel.metadata,
+sizeof md->tunnel.metadata /
+sizeof(uint64_t));
+}
+} else {
+if (md->tunnel.metadata.present.len) {
+

[ovs-dev] [PATCH v1] util: implement count_1bits with Neon intrinsics or gcc built-in for aarch64.

2019-06-13 Thread Yanqin Wei
Userspace datapath needs to traverse through miniflow values many times. In
this process, 'count_1bits' operation for 'Flowmap' significantly impact
performance. On arm, this function was defined by portable implementation
because gcc for arm does not support popcnt feature.
But in the aarch64, VCNT neon instruction can accelerate "count_1bits".
>From Gcc-7, the built-in function is implemented with neon intruction.
In this patch, count_1bits function will be impelmented with gcc built-in
from gcc-7 on, and with neon intrinsics in gcc-6.
Performance test was run in two aarch64 machines. In the NIC2NIC test, one
tuple dpcls lookup case achieves around 4% throughput improvement and
10(average) tuples case achieves around 5% improvement.

Tested-by: Malvika Gupta 
Signed-off-by: Yanqin Wei 
---
 lib/util.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/lib/util.h b/lib/util.h
index 53354f1..2fd01f4 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -29,6 +29,9 @@
 #include "compiler.h"
 #include "util.h"
 #include "openvswitch/util.h"
+#if defined(__aarch64__) && __GNUC__ >= 6
+#include 
+#endif
 
 extern char *program_name;
 
@@ -353,8 +356,10 @@ log_2_ceil(uint64_t n)
 static inline unsigned int
 count_1bits(uint64_t x)
 {
-#if __GNUC__ >= 4 && __POPCNT__
+#if (__GNUC__ >= 4 && __POPCNT__) || (defined(__aarch64__) && __GNUC__ >= 7)
 return __builtin_popcountll(x);
+#elif defined(__aarch64__) && __GNUC__ >= 6
+return vaddv_u8(vcnt_u8(vcreate_u8(x)));
 #else
 /* This portable implementation is the fastest one we know of for 64
  * bits, and about 3x faster than GCC 4.7 __builtin_popcountll(). */
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] OVS compiling issue(gcc version 6.5.0 )

2019-06-04 Thread Yanqin Wei (Arm Technology China)
Hi Ben,

After reconfigure the compiler: CC=gcc , this issue is solved.  I tried to 
reproduce it and found reproducing condition is to switch default gcc version 
after running configure script.
1. default gcc is gcc-8, running the configure script.
2. change the default gcc to gcc-6
3. building  <--- compiling error

4. specific C compiler by configure script: compiler CC=gcc  <--  building 
success
5. change default gcc to gcc-8 and running the configure script
6  change default gcc to gcc-6 & building <-- compiling error

In the config.log after step 3, the detected gcc version is 8.2, so it choose 
some unsupported compiler option for gcc-6.
configure:4038: checking for C compiler version
configure:4047: gcc --version >&5
gcc (GCC) 8.2.0
Copyright (C) 2018 Free Software Foundation, Inc.

So I think it should not be a bug.  It is a configure issue. Thanks for your 
support.

Best Regards,
Wei Yanqin

-Original Message-
From: Ben Pfaff 
Sent: Tuesday, June 4, 2019 7:11 AM
To: Yanqin Wei (Arm Technology China) 
Cc: ovs-disc...@openvswitch.org; d...@openvswitch.org
Subject: Re: [ovs-dev] OVS compiling issue(gcc version 6.5.0 )

On Mon, Jun 03, 2019 at 11:18:23AM +, Yanqin Wei (Arm Technology China) 
wrote:
> I am trying to compile OVS(master branch) via gcc 6.5.0, but there is gcc 
> option error.
> gcc: error: unrecognized command line option '-Wmultistatement-macros'; did 
> you mean '-Wunused-macros'?

What's in config.log?

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] OVS compiling issue(gcc version 6.5.0 )

2019-06-03 Thread Yanqin Wei (Arm Technology China)
Hi,

I am trying to compile OVS(master branch) via gcc 6.5.0, but there is gcc 
option error.
gcc: error: unrecognized command line option '-Wmultistatement-macros'; did you 
mean '-Wunused-macros'?

My question is which versions of GCC are supported by OVS. Is it a bug of OVS 
compiling? The compiling output is attached.


gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/6/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 
6.5.0-2ubuntu1~16.04' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs 
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr 
--with-as=/usr/bin/aarch64-linux-gnu-as --with-ld=/usr/bin/aarch64-linux-gnu-ld 
--program-suffix=-6 --program-prefix=aarch64-linux-gnu- --enable-shared 
--enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext 
--enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ 
--enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes 
--with-default-libstdcxx-abi=new --enable-gnu-unique-object 
--disable-libquadmath --enable-plugin --with-system-zlib 
--disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo 
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-arm64/jre --enable-java-home 
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-arm64 
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-arm64 
--with-arch-directory=aarch64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar 
--enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror 
--enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu 
--target=aarch64-linux-gnu
Thread model: posix
gcc version 6.5.0 20181026 (Ubuntu/Linaro 6.5.0-2ubuntu1~16.04)


Best Regards,
Wei Yanqin



IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
:~/repo/openvswitch$ sudo -E make install -j 20 LDFLAGS="-libverbs"
make  install-recursive
make[1]: Entering directory '/home/wei/repo/openvswitch'
Making install in datapath
make[2]: Entering directory '/home/wei/repo/openvswitch/datapath'
make[3]: Entering directory '/home/wei/repo/openvswitch/datapath'
make[4]: Entering directory '/home/wei/repo/openvswitch/datapath'
make[4]: Nothing to be done for 'install-exec-am'.
make[4]: Nothing to be done for 'install-data-am'.
make[4]: Leaving directory '/home/wei/repo/openvswitch/datapath'
make[3]: Leaving directory '/home/wei/repo/openvswitch/datapath'
make[2]: Leaving directory '/home/wei/repo/openvswitch/datapath'
make[2]: Entering directory '/home/wei/repo/openvswitch'
/bin/bash ./libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I.-I 
./include -I ./include -I ./lib -I ./lib-Wstrict-prototypes -Wall -Wextra 
-Wno-sign-compare -Wpointer-arith -Wformat -Wformat-security -Wswitch-enum 
-Wunused-parameter -Wbad-function-cast -Wcast-align -Wstrict-prototypes 
-Wold-style-definition -Wmissing-prototypes -Wmissing-field-initializers 
-fno-strict-aliasing -Wswitch-bool -Wlogical-not-parentheses 
-Wsizeof-array-argument -Wbool-compare -Wshift-negative-value -Wduplicated-cond 
-Wshadow -Wmultistatement-macros -Wcast-align=strict 
-I/home/wei/dpdk-stable/arm64-armv8a-linuxapp-gcc/include 
-D_FILE_OFFSET_BITS=64  -Wno-unused -Wno-unused-parameter -O3 -march=native -MT 
lib/lib_libsflow_la-sflow_agent.lo -MD -MP -MF 
lib/.deps/lib_libsflow_la-sflow_agent.Tpo -c -o 
lib/lib_libsflow_la-sflow_agent.lo `test -f 'lib/sflow_agent.c' || echo 
'./'`lib/sflow_agent.c
/bin/bash ./libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I.-I 
./include -I ./include -I ./lib -I ./lib-Wstrict-prototypes -Wall -Wextra 
-Wno-sign-compare -Wpointer-arith -Wformat -Wformat-security -Wswitch-enum 
-Wunused-parameter -Wbad-function-cast -Wcast-align -Wstrict-prototypes 
-Wold-style-definition -Wmissing-prototypes -Wmissing-field-initializers 
-fno-strict-aliasing -Wswitch-bool -Wlogical-not-parentheses 
-Wsizeof-array-argument -Wbool-compare -Wshift-negative-value -Wduplicated-cond 
-Wshadow -Wmultistatement-macros -Wcast-align=strict 
-I/home/wei/dpdk-stable/arm64-armv8a-linuxapp-gcc/include 
-D_FILE_OFFSET_BITS=64  -Wno-unused -Wno-unused-parameter -O3 -march=native -MT 
lib/lib_libsflow_la-sflow_sampler.lo -MD -MP -MF 
lib/.deps/lib_libsflow_la-sflow_sampler.Tpo -c -o 
lib/lib_libsflow_la-sflow_sampler.lo `test -f 'lib/sflow_sampler.c' || echo 
'./'`lib/sflow_sampler.c
/bin/bash ./libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I.-I 
./include -I ./include -I ./lib -I ./lib-Wstrict-prototypes -Wall -Wextra 
-Wno-sign-compare -Wpointer-arith -Wformat -Wformat-security -Wswitch-enum 
-Wunused-parameter -Wbad-function-cast -Wcast-align 

[ovs-dev] [PATCH v4 2/2] dpif-netdev: dfc_process performance optimization by prefetching EMC entry.

2019-06-03 Thread Yanqin Wei
It is observed that the throughput of medium number of flows(9-8191) is
worse than low number of flows(1-8) in the EMC NIC2NIC test.
It is because CPU cache-miss increasing in EMC lookup. Each flow need
load at least one EMC entry to CPU L1 cache(several cache lines) and
compare it with packet miniflow.
This patch improves it by prefetching EMC entry in advance. Hash value
can be obtained from dpdk rss hash, so this step can be advanced ahead of
miniflow_extract() and prefetch EMC entry there. By testing on several
kinds of cpu with 32K L1 cache(x86-64 and arm64), prefetch start to improve
performance from 8~10 flows onwards. In order to benefit most modern CPUs,
the minimum threshold is set to 20. The max threshold is set to
EM_FLOW_HASH_ENTRIES-1 because entry prefetching become negative in huge
number of flows. So this patch prefetch one EMC cache line only when EMC
counter is 20-8191, which could ensure no side effect in all cases.
Performance test was run in some arm and x86 platform. Medium number of
flow case achieved around 2-3% improvement in RFC2544 test in x86 and arm.
High number of flows(>8191) also benifit during EMC insertion, so it
acheives around 2% improvement in 100k flows RFC2544 test. And low number
of flows has almost no performance impact.

Signed-off-by: Yanqin Wei 
Reviewed-by: Gavin Hu 
---
 lib/dpif-netdev.c | 66 ---
 1 file changed, 43 insertions(+), 23 deletions(-)
 mode change 100644 => 100755 lib/dpif-netdev.c

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
old mode 100644
new mode 100755
index c74cc02..dc2ad64
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -189,6 +189,9 @@ struct netdev_flow_key {
 #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX / \
 DEFAULT_EM_FLOW_INSERT_INV_PROB)
 
+/* Prefetch minimum threshold*/
+#define EMC_PREFETCH_MIN_THRESHOLD 10
+
 struct emc_entry {
 struct dp_netdev_flow *flow;
 struct netdev_flow_key key;   /* key.hash used for emc hash value. */
@@ -215,6 +218,11 @@ struct dfc_cache {
 struct smc_cache smc_cache;
 };
 
+/* Prefetch in case of [EMC_PREFETCH_THRESHOLD,EM_FLOW_HASH_ENTRIES) entries*/
+#define EMC_PREFETCH_IN_RANGE(DFC_CACHE)\
+((DFC_CACHE)->emc_cache.counter >= EMC_PREFETCH_MIN_THRESHOLD \
+&& (DFC_CACHE)->emc_cache.counter < EM_FLOW_HASH_ENTRIES)
+
 /* Iterate in the exact match cache through every entry that might contain a
  * miniflow with hash 'HASH'. */
 #define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH) \
@@ -6172,41 +6180,41 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread *pmd, 
struct dp_packet *packet_,
 }
 
 static inline uint32_t
-dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
-const struct miniflow *mf)
+dpif_netdev_packet_get_5tuple_hash(struct dp_packet *packet,
+   const struct miniflow *mf,
+   bool account_recirc_id)
 {
-uint32_t hash;
+uint32_t hash, recirc_depth;
 
-if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
-hash = dp_packet_get_rss_hash(packet);
-} else {
-hash = miniflow_hash_5tuple(mf, 0);
-dp_packet_set_rss_hash(packet, hash);
+hash = miniflow_hash_5tuple(mf, 0);
+
+if (account_recirc_id) {
+/* The RSS hash must account for the recirculation depth to avoid
+ * collisions in the exact match cache */
+recirc_depth = *recirc_depth_get_unsafe();
+hash = hash_finish(hash, recirc_depth);
 }
 
+dp_packet_set_rss_hash(packet, hash);
 return hash;
 }
 
 static inline uint32_t
 dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
-const struct miniflow *mf)
+bool account_recirc_id)
 {
 uint32_t hash, recirc_depth;
 
-if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
-hash = dp_packet_get_rss_hash(packet);
-} else {
-hash = miniflow_hash_5tuple(mf, 0);
-dp_packet_set_rss_hash(packet, hash);
-}
+hash = dp_packet_get_rss_hash(packet);
 
-/* The RSS hash must account for the recirculation depth to avoid
- * collisions in the exact match cache */
-recirc_depth = *recirc_depth_get_unsafe();
-if (OVS_UNLIKELY(recirc_depth)) {
+if (account_recirc_id) {
+/* The RSS hash must account for the recirculation depth to avoid
+ * collisions in the exact match cache */
+recirc_depth = *recirc_depth_get_unsafe();
 hash = hash_finish(hash, recirc_depth);
 dp_packet_set_rss_hash(packet, hash);
 }
+
 return hash;
 }
 
@@ -6396,6 +6404,8 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
 bool smc_enable_db;
 size_t map_cnt = 0;
 bool batch_enable = true;
+bool rss_valid;
+bool prefetch_emc = cur_min &

[ovs-dev] [PATCH v4 1/2] dpif-netdev: add EMC entries counter

2019-06-03 Thread Yanqin Wei
Implement entries counter for EMC. It could be used for improvement for
EMC lookup.

Signed-off-by: Yanqin Wei 
Reviewed-by: Gavin Hu 
---
 lib/dpif-netdev.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 5a6f2ab..c74cc02 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -197,6 +197,7 @@ struct emc_entry {
 struct emc_cache {
 struct emc_entry entries[EM_FLOW_HASH_ENTRIES];
 int sweep_idx;/* For emc_cache_slow_sweep(). */
+uint32_t counter;
 };
 
 struct smc_bucket {
@@ -826,7 +827,7 @@ static int dpif_netdev_xps_get_tx_qid(const struct 
dp_netdev_pmd_thread *pmd,
   struct tx_port *tx);
 
 static inline bool emc_entry_alive(struct emc_entry *ce);
-static void emc_clear_entry(struct emc_entry *ce);
+static void emc_clear_entry(struct emc_cache *cache, struct emc_entry *ce);
 static void smc_clear_entry(struct smc_bucket *b, int idx);
 
 static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
@@ -840,6 +841,7 @@ emc_cache_init(struct emc_cache *flow_cache)
 {
 int i;
 
+flow_cache->counter = 0;
 flow_cache->sweep_idx = 0;
 for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
 flow_cache->entries[i].flow = NULL;
@@ -872,8 +874,9 @@ emc_cache_uninit(struct emc_cache *flow_cache)
 {
 int i;
 
+flow_cache->counter = 0;
 for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
-emc_clear_entry(_cache->entries[i]);
+emc_clear_entry(flow_cache, _cache->entries[i]);
 }
 }
 
@@ -904,7 +907,7 @@ emc_cache_slow_sweep(struct emc_cache *flow_cache)
 struct emc_entry *entry = _cache->entries[flow_cache->sweep_idx];
 
 if (!emc_entry_alive(entry)) {
-emc_clear_entry(entry);
+emc_clear_entry(flow_cache,entry);
 }
 flow_cache->sweep_idx = (flow_cache->sweep_idx + 1) & EM_FLOW_HASH_MASK;
 }
@@ -2771,25 +2774,28 @@ emc_entry_alive(struct emc_entry *ce)
 }
 
 static void
-emc_clear_entry(struct emc_entry *ce)
+emc_clear_entry(struct emc_cache *cache, struct emc_entry *ce)
 {
 if (ce->flow) {
 dp_netdev_flow_unref(ce->flow);
 ce->flow = NULL;
+cache->counter--;
 }
 }
 
 static inline void
-emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
- const struct netdev_flow_key *key)
+emc_change_entry(struct emc_cache *cache, struct emc_entry *ce,
+struct dp_netdev_flow *flow, const struct netdev_flow_key *key)
 {
 if (ce->flow != flow) {
 if (ce->flow) {
 dp_netdev_flow_unref(ce->flow);
+cache->counter--;
 }
 
 if (dp_netdev_flow_ref(flow)) {
 ce->flow = flow;
+cache->counter++;
 } else {
 ce->flow = NULL;
 }
@@ -2809,7 +2815,7 @@ emc_insert(struct emc_cache *cache, const struct 
netdev_flow_key *key,
 EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {
 if (netdev_flow_key_equal(_entry->key, key)) {
 /* We found the entry with the 'mf' miniflow */
-emc_change_entry(current_entry, flow, NULL);
+emc_change_entry(cache,current_entry, flow, NULL);
 return;
 }
 
@@ -2825,7 +2831,7 @@ emc_insert(struct emc_cache *cache, const struct 
netdev_flow_key *key,
 /* We didn't find the miniflow in the cache.
  * The 'to_be_replaced' entry is where the new flow will be stored */
 
-emc_change_entry(to_be_replaced, flow, key);
+emc_change_entry(cache,to_be_replaced, flow, key);
 }
 
 static inline void
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 0/2] dfc_process optimization by prefetching EMC entry

2019-06-03 Thread Yanqin Wei
It is observed that the throughput of medium number of flows(9-8191) is
worse than low number of flows(1-8) in the EMC NIC2NIC test.
This patch improves it by prefetching EMC entry in advance. In order not to
affect low and high number of flows scenario, an EMC counter is introduced
to prefetch only under the medium number of flows scenario.   

Yanqin Wei (2):
  dpif-netdev: add EMC entries counter
  dpif-netdev: dfc_process performance optimization by prefetching EMC
entry.

 lib/dpif-netdev.c | 88 +++
 1 file changed, 57 insertions(+), 31 deletions(-)
 mode change 100644 => 100755 lib/dpif-netdev.c

-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCHv9] netdev-afxdp: add new netdev type for AF_XDP.

2019-05-26 Thread Yanqin Wei (Arm Technology China)
Hi William,

I think the main objective of this patch is to introduce AF_XDP socket for OVS 
userspace datapath. This is an alternative to DPDK  PMD library.  Is it 
possible to continue using other DPDK libraries before XDP provides the 
corresponding functionality?

Best Regards,
Wei Yanqin

-Original Message-
From: ovs-dev-boun...@openvswitch.org  On 
Behalf Of William Tu
Sent: Saturday, May 25, 2019 4:37 AM
To: Ben Pfaff 
Cc:  ; Ilya Maximets 

Subject: Re: [ovs-dev] [PATCHv9] netdev-afxdp: add new netdev type for AF_XDP.

On Fri, May 24, 2019 at 11:32 AM Ben Pfaff  wrote:
>
> On Fri, May 24, 2019 at 11:03:33AM -0700, William Tu wrote:
> > The patch introduces experimental AF_XDP support for OVS netdev.
> > AF_XDP, the Address Family of the eXpress Data Path, is a new Linux
> > socket type built upon the eBPF and XDP technology.  It is aims to
> > have comparable performance to DPDK but cooperate better with
> > existing kernel's networking stack.  An AF_XDP socket receives and
> > sends packets from an eBPF/XDP program attached to the netdev,
> > by-passing a couple of Linux kernel's subsystems As a result, AF_XDP
> > socket shows much better performance than AF_PACKET For more details
> > about AF_XDP, please see linux kernel's
> > Documentation/networking/af_xdp.rst. Note that by default, this feature is 
> > not compiled in.
> >
> > Signed-off-by: William Tu 
>
> How heavy is the x86(_64)-only dependency?  It seems undesirable. is

Now in my patch, the cycles_counter_update has x86-specific instructions.
Other part of the code has no x86-only dependency.

The reason I made this x86-only is that AF_XDP is rarely tested on
non-x86 system. So I'm not sure whether it works or not.

Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3] dpif-netdev: dfc_process optimization by prefetching EMC entry.

2019-04-01 Thread Yanqin Wei (Arm Technology China)
So the main concern is performance drop for 1~8 flow, because EMC is effective 
here and would be enable.
On the other hand, EMC prefetching does not take effect when EMC is disable, 
and in the case of large number of flows, the EMC is likely to be disabled. So 
the performance drop here can be avoided by configuration.
Is my understanding correct?

Best Regards,
Wei Yanqin

-Original Message-
From: Ilya Maximets  
Sent: Monday, April 1, 2019 3:15 PM
To: Yanqin Wei (Arm Technology China) ; 
d...@openvswitch.org; ian.sto...@intel.com
Cc: nd ; Gavin Hu (Arm Technology China) 
Subject: Re: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
prefetching EMC entry.

On 01.04.2019 9:52, Yanqin Wei (Arm Technology China) wrote:
> Hi Ilya,
> 
> Really appreciate your time in doing benchmarking for this patch. And many 
> thanks for your valuable comments.
> It is quite possible that the second unneeded cache line prefetching causes 
> performance drop in case of big number of flows (64K - 256K).
> This impact the real world use case, I will try to improve it . For 1-8 flows 
> cases, it should be not very important because of the lack of actual 
> deployment.

I'm not actually sure which of these cases is more real.
We're suggesting to disable EMC for cases where it's not effective.
>From the other side, if VM encapsulates all the incoming traffic and sends it 
>for further processing, there will be only few outcoming flows.

> 
> Best Regards,
> Wei Yanqin
> 
> -Original Message-
> From: Ilya Maximets 
> Sent: Monday, April 1, 2019 2:16 PM
> To: Yanqin Wei (Arm Technology China) ; 
> d...@openvswitch.org; ian.sto...@intel.com
> Cc: nd ; Gavin Hu (Arm Technology China) 
> 
> Subject: Re: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
> prefetching EMC entry.
> 
> On 29.03.2019 20:33, Ilya Maximets wrote:
>> Hi.
>> I made few tests on PVP with bonded PHY setup and found no 
>> significant difference in maximum performance with low and medium 
>> number of flows
>> (8 - 8192).
> 
> Correction: I see the a slight performance drop (~1%) for the cases with very 
> low number of flows (1-8), and a performance increase (~ 2-3%) for the medium 
> low number of flows (64 - 4096).
> In this scenario packets from VM has already calculated hash on 
> recirculation, so the prefetching optimization doesn't work for the first 
> pass through the datapath, but works for the second.
> Degradation on single or very low number of flows probably caused by 
> additional checks and instructions for prefetching the memory that is in 
> cache already.
> 
>> In case of big number of flows (64K - 256K) I see performance drop in 
>> about 2-3%. I think that because of prefetching of a second 
>> cacheline, which is unneeded because current_entry->key.hash likely 
>> != key->hash and we don't need to compare miniflows while emc_lookup. 
>> Changing the code to prefetch the first cacheline only decreases the 
>> drop to ~1% (However, this increases the consumed processing cycles 
>> for medium numbers of flows described below)
> 
> In general, this patch decreases the performance for cases where EMC is not 
> efficient and improves it for cases where EMC is efficient, except the cases 
> with very low numbers of flows, which could be the main concern.
> 
>>
>> OTOH, I see a slight decrease (~1%) of consumed cycles per packet for 
>> the thread that polls HW NICs and send packets to VM, which is good.
>> This improvement observed for a medium small number of flows: 512 - 8192.
>> For a low (1 - 256) and high (64K - 256K) numbers of flows value of 
>> consumed processing cycles per packet fro this thread was not affected by 
>> the patch.
>>
>> Tests made with average 512B packets, EMC enabled, SMC disabled, TX 
>> flushing interval 50us. Note that the bottleneck in this case is the 
>> VM --> bonded PHY part which is the case of 5tuple hash calculation.
>>
>> See review comments inline.
>>
>> Best regards, Ilya Maximets.
>>
>> On 22.03.2019 11:44, Yanqin Wei (Arm Technology China) wrote:
>>> Hi , OVS Maintainers,
>>>
>>> Could you help to have a look at this patch? Thanks a lot.
>>>
>>> Best Regards,
>>> Wei Yanqin
>>>
>>> -Original Message-
>>> From: Yanqin Wei 
>>> Sent: Wednesday, March 13, 2019 1:28 PM
>>> To: d...@openvswitch.org
>>> Cc: nd ; Gavin Hu (Arm Technology China) 
>>> ; Yanqin Wei (Arm Technology China) 
>>> 
>>> Subject: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
>>> prefetching EMC entry.
>>>
>>

Re: [ovs-dev] [PATCH v3] dpif-netdev: dfc_process optimization by prefetching EMC entry.

2019-04-01 Thread Yanqin Wei (Arm Technology China)
Hi Ilya,

Really appreciate your time in doing benchmarking for this patch. And many 
thanks for your valuable comments.
It is quite possible that the second unneeded cache line prefetching causes 
performance drop in case of big number of flows (64K - 256K).
This impact the real world use case, I will try to improve it . For 1-8 flows 
cases, it should be not very important because of the lack of actual deployment.

Best Regards,
Wei Yanqin

-Original Message-
From: Ilya Maximets  
Sent: Monday, April 1, 2019 2:16 PM
To: Yanqin Wei (Arm Technology China) ; 
d...@openvswitch.org; ian.sto...@intel.com
Cc: nd ; Gavin Hu (Arm Technology China) 
Subject: Re: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
prefetching EMC entry.

On 29.03.2019 20:33, Ilya Maximets wrote:
> Hi.
> I made few tests on PVP with bonded PHY setup and found no significant 
> difference in maximum performance with low and medium number of flows
> (8 - 8192).

Correction: I see the a slight performance drop (~1%) for the cases with very 
low number of flows (1-8), and a performance increase (~ 2-3%) for the medium 
low number of flows (64 - 4096).
In this scenario packets from VM has already calculated hash on recirculation, 
so the prefetching optimization doesn't work for the first pass through the 
datapath, but works for the second.
Degradation on single or very low number of flows probably caused by additional 
checks and instructions for prefetching the memory that is in cache already.

> In case of big number of flows (64K - 256K) I see performance drop in 
> about 2-3%. I think that because of prefetching of a second cacheline, 
> which is unneeded because current_entry->key.hash likely != key->hash 
> and we don't need to compare miniflows while emc_lookup. Changing the 
> code to prefetch the first cacheline only decreases the drop to ~1% 
> (However, this increases the consumed processing cycles for medium 
> numbers of flows described below)

In general, this patch decreases the performance for cases where EMC is not 
efficient and improves it for cases where EMC is efficient, except the cases 
with very low numbers of flows, which could be the main concern.

> 
> OTOH, I see a slight decrease (~1%) of consumed cycles per packet for 
> the thread that polls HW NICs and send packets to VM, which is good.
> This improvement observed for a medium small number of flows: 512 - 8192.
> For a low (1 - 256) and high (64K - 256K) numbers of flows value of 
> consumed processing cycles per packet fro this thread was not affected by the 
> patch.
> 
> Tests made with average 512B packets, EMC enabled, SMC disabled, TX 
> flushing interval 50us. Note that the bottleneck in this case is the 
> VM --> bonded PHY part which is the case of 5tuple hash calculation.
> 
> See review comments inline.
> 
> Best regards, Ilya Maximets.
> 
> On 22.03.2019 11:44, Yanqin Wei (Arm Technology China) wrote:
>> Hi , OVS Maintainers,
>>
>> Could you help to have a look at this patch? Thanks a lot.
>>
>> Best Regards,
>> Wei Yanqin
>>
>> -Original Message-
>> From: Yanqin Wei 
>> Sent: Wednesday, March 13, 2019 1:28 PM
>> To: d...@openvswitch.org
>> Cc: nd ; Gavin Hu (Arm Technology China) 
>> ; Yanqin Wei (Arm Technology China) 
>> 
>> Subject: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
>> prefetching EMC entry.
>>
>> It is observed that the throughput of multi-flow is worse than single-flow 
>> in the EMC NIC to NIC cases. It is because CPU cache-miss increasing in EMC 
>> lookup. Each flow need load at least one EMC entry to CPU cache(several 
>> cache lines) and compare it with packet miniflow.
>> This patch improve it by prefetching EMC entry in advance. Hash value 
>> can be obtained from dpdk rss hash, so this step can be advanced 
>> ahead of
>> miniflow_extract() and prefetch EMC entry there. The prefetching size is 
>> defined as ROUND_UP(128,CACHE_LINE_SIZE), which can cover majority traffic 
>> including TCP/UDP protocol and need 2 cache lines in most modern CPU.
>> Performance test was run in some arm platform. 1000/1 flows NIC2NIC test 
>> achieved around 10% throughput improvement in thunderX2(aarch64 platform).
>>
>> Signed-off-by: Yanqin Wei 
>> Reviewed-by: Gavin Hu 
>> ---
>>  lib/dpif-netdev.c | 80 
>> ---
>>  1 file changed, 52 insertions(+), 28 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 
>> 4d6d0c3..982082c 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -189,6 +189,10 @@ struct netdev_flow_key {
>>  #define DEFAULT_EM_FLOW_INSERT

Re: [ovs-dev] [PATCH v3] dpif-netdev: dfc_process optimization by prefetching EMC entry.

2019-03-25 Thread Yanqin Wei (Arm Technology China)
Hi Ilya,

Thanks for your reply. We could have a look at test results in x86 by then.
I can understand patch 1054571. If both patches are apply to master, I could 
rebase prefetch EMC patch for it.
Mandatory hash computing in the ingress makes logic simple a lot, and it only 
costs a small price even in worst case(EMC/SMC disable & no hash-feature/load 
balance enable).

Best Regards,
Wei Yanqin

-Original Message-
From: Ilya Maximets  
Sent: Friday, March 22, 2019 9:12 PM
To: Yanqin Wei (Arm Technology China) ; 
d...@openvswitch.org; ian.sto...@intel.com
Cc: nd 
Subject: Re: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
prefetching EMC entry.

On 22.03.2019 11:44, Yanqin Wei (Arm Technology China) wrote:
> Hi , OVS Maintainers,
> 
> Could you help to have a look at this patch? Thanks a lot.

Hi. Thanks for improving performance and sorry for delay. Review process here 
in OVS is a bit slow due to lack of reviewers.

I have a plan to test this patch a bit on a next week. Want to check the 
performance impact on PVP cases on x86.

BTW, I have a patch that affects same code. Maybe it'll be interesting to you: 
https://patchwork.ozlabs.org/patch/1054571/

Best regards, Ilya Maximets.

> 
> Best Regards,
> Wei Yanqin
> 
> -Original Message-
> From: Yanqin Wei 
> Sent: Wednesday, March 13, 2019 1:28 PM
> To: d...@openvswitch.org
> Cc: nd ; Gavin Hu (Arm Technology China) 
> ; Yanqin Wei (Arm Technology China) 
> 
> Subject: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
> prefetching EMC entry.
> 
> It is observed that the throughput of multi-flow is worse than single-flow in 
> the EMC NIC to NIC cases. It is because CPU cache-miss increasing in EMC 
> lookup. Each flow need load at least one EMC entry to CPU cache(several cache 
> lines) and compare it with packet miniflow.
> This patch improve it by prefetching EMC entry in advance. Hash value 
> can be obtained from dpdk rss hash, so this step can be advanced ahead 
> of
> miniflow_extract() and prefetch EMC entry there. The prefetching size is 
> defined as ROUND_UP(128,CACHE_LINE_SIZE), which can cover majority traffic 
> including TCP/UDP protocol and need 2 cache lines in most modern CPU.
> Performance test was run in some arm platform. 1000/1 flows NIC2NIC test 
> achieved around 10% throughput improvement in thunderX2(aarch64 platform).
> 
> Signed-off-by: Yanqin Wei 
> Reviewed-by: Gavin Hu 
> ---
>  lib/dpif-netdev.c | 80 
> ---
>  1 file changed, 52 insertions(+), 28 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 
> 4d6d0c3..982082c 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -189,6 +189,10 @@ struct netdev_flow_key {
>  #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX / \
>  DEFAULT_EM_FLOW_INSERT_INV_PROB)
>  
> +/* DEFAULT_EMC_PREFETCH_SIZE can cover majority traffic including 
> +TCP/UDP
> + * protocol. */
> +#define DEFAULT_EMC_PREFETCH_SIZE ROUND_UP(128,CACHE_LINE_SIZE)
> +
>  struct emc_entry {
>  struct dp_netdev_flow *flow;
>  struct netdev_flow_key key;   /* key.hash used for emc hash value. */
> @@ -6166,15 +6170,20 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread 
> *pmd, struct dp_packet *packet_,  }
>  
>  static inline uint32_t
> -dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
> -const struct miniflow *mf)
> +dpif_netdev_packet_get_packet_rss_hash(struct dp_packet *packet,
> +bool md_is_valid)
>  {
> -uint32_t hash;
> +uint32_t hash,recirc_depth;
>  
> -if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
> -hash = dp_packet_get_rss_hash(packet);
> -} else {
> -hash = miniflow_hash_5tuple(mf, 0);
> +hash = dp_packet_get_rss_hash(packet);
> +
> +if (md_is_valid) {
> +/* The RSS hash must account for the recirculation depth to avoid
> + * collisions in the exact match cache */
> +recirc_depth = *recirc_depth_get_unsafe();
> +if (OVS_UNLIKELY(recirc_depth)) {
> +hash = hash_finish(hash, recirc_depth);
> +}
>  dp_packet_set_rss_hash(packet, hash);
>  }
>  
> @@ -6182,24 +6191,23 @@ 
> dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,  }
>  
>  static inline uint32_t
> -dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
> -const struct miniflow *mf)
> +dpif_netdev_packet_get_hash_5tuple(struct dp_packet *packet,
> +const struct miniflow *mf,
> +

Re: [ovs-dev] [PATCH v3] dpif-netdev: dfc_process optimization by prefetching EMC entry.

2019-03-24 Thread Yanqin Wei (Arm Technology China)
Hi Ian,

I also observed a minor throughput drop (around 1%) with single flow in arm 
platform, but not 25% drop.  Maybe the additional prefetch operation cause it.
Anyway, when you come back next week, let's discuss this patch again.   

Best Regards,
Wei Yanqin

-Original Message-
From: Ian Stokes  
Sent: Monday, March 25, 2019 6:16 AM
To: Yanqin Wei (Arm Technology China) ; d...@openvswitch.org
Cc: nd ; Gavin Hu (Arm Technology China) ; Ilya 
Maximets 
Subject: Re: [ovs-dev] [PATCH v3] dpif-netdev: dfc_process optimization by 
prefetching EMC entry.

On 3/13/2019 5:27 AM, Yanqin Wei wrote:
> It is observed that the throughput of multi-flow is worse than 
> single-flow in the EMC NIC to NIC cases. It is because CPU cache-miss 
> increasing in EMC lookup. Each flow need load at least one EMC entry 
> to CPU cache(several cache lines) and compare it with packet miniflow.
> This patch improve it by prefetching EMC entry in advance. Hash value 
> can be obtained from dpdk rss hash, so this step can be advanced ahead 
> of
> miniflow_extract() and prefetch EMC entry there. The prefetching size 
> is defined as ROUND_UP(128,CACHE_LINE_SIZE), which can cover majority 
> traffic including TCP/UDP protocol and need 2 cache lines in most modern CPU.
> Performance test was run in some arm platform. 1000/1 flows 
> NIC2NIC test achieved around 10% throughput improvement in 
> thunderX2(aarch64 platform).
> 

Thanks for this Wei, not a few review, please see some minor comments below WRT 
style issues.

I've also run some benchmarks on this. I was seeing typically a ~3% drop on x86 
with single flows with RFC2544. However once or twice, I saw a drop of up to 
25% on achievable lossless packet rate but I suspect it could be an anomaly in 
my setup.

Ilya, if you are testing this week on x86, it would be great you confirm if you 
see something similar in your benchmarks?

For vsperf phy2phy_scalability flow tests on x86 I saw an improvement of 
+3% after applying the patch for zero loss tests and +5% in the case of
phy2phy_scalability_cont so this looks promising.

As an FYI I'll be I'm out of office this coming week so will not have an 
opportunity to investigate further until I'm back in office. I'll be 
able to review and benchmark further then.


> Signed-off-by: Yanqin Wei 
> Reviewed-by: Gavin Hu 
Although it doesn't appear here or in patchwork, after downloading the 
patch the sign off and review tags above appear duplicated after being 
applied. Examining the mbox I can confirm they are duplicated, can you 
check this on your side also?

> ---
>   lib/dpif-netdev.c | 80 
> ---
>   1 file changed, 52 insertions(+), 28 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 4d6d0c3..982082c 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -189,6 +189,10 @@ struct netdev_flow_key {
>   #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX / \
>   DEFAULT_EM_FLOW_INSERT_INV_PROB)
>   
> +/* DEFAULT_EMC_PREFETCH_SIZE can cover majority traffic including TCP/UDP
> + * protocol. */
> +#define DEFAULT_EMC_PREFETCH_SIZE ROUND_UP(128,CACHE_LINE_SIZE)
> +
>   struct emc_entry {
>   struct dp_netdev_flow *flow;
>   struct netdev_flow_key key;   /* key.hash used for emc hash value. */
> @@ -6166,15 +6170,20 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread *pmd, 
> struct dp_packet *packet_,
>   }
>   
>   static inline uint32_t
> -dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
> -const struct miniflow *mf)
> +dpif_netdev_packet_get_packet_rss_hash(struct dp_packet *packet,
> +bool md_is_valid)
>   {
> -uint32_t hash;
> +uint32_t hash,recirc_depth;
>   
> -if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
> -hash = dp_packet_get_rss_hash(packet);
> -} else {
> -hash = miniflow_hash_5tuple(mf, 0);
> +hash = dp_packet_get_rss_hash(packet);
> +
> +if (md_is_valid) {
> +/* The RSS hash must account for the recirculation depth to avoid
> + * collisions in the exact match cache */
Minor, comment style, missing period at end of comment.

> +recirc_depth = *recirc_depth_get_unsafe();
> +if (OVS_UNLIKELY(recirc_depth)) {
> +hash = hash_finish(hash, recirc_depth);
> +}
>   dp_packet_set_rss_hash(packet, hash);
>   }
>   
> @@ -6182,24 +6191,23 @@ dpif_netdev_packet_get_rss_hash_orig_pkt(struct 
> dp_packet *packet,
>   }
>   
>   static inline uint32_t
> -dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
> -const struct miniflow *mf)
>

Re: [ovs-dev] [PATCH v3] dpif-netdev: dfc_process optimization by prefetching EMC entry.

2019-03-22 Thread Yanqin Wei (Arm Technology China)
Hi , OVS Maintainers,

Could you help to have a look at this patch? Thanks a lot.

Best Regards,
Wei Yanqin

-Original Message-
From: Yanqin Wei  
Sent: Wednesday, March 13, 2019 1:28 PM
To: d...@openvswitch.org
Cc: nd ; Gavin Hu (Arm Technology China) ; 
Yanqin Wei (Arm Technology China) 
Subject: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
prefetching EMC entry.

It is observed that the throughput of multi-flow is worse than single-flow in 
the EMC NIC to NIC cases. It is because CPU cache-miss increasing in EMC 
lookup. Each flow need load at least one EMC entry to CPU cache(several cache 
lines) and compare it with packet miniflow.
This patch improve it by prefetching EMC entry in advance. Hash value can be 
obtained from dpdk rss hash, so this step can be advanced ahead of
miniflow_extract() and prefetch EMC entry there. The prefetching size is 
defined as ROUND_UP(128,CACHE_LINE_SIZE), which can cover majority traffic 
including TCP/UDP protocol and need 2 cache lines in most modern CPU.
Performance test was run in some arm platform. 1000/1 flows NIC2NIC test 
achieved around 10% throughput improvement in thunderX2(aarch64 platform).

Signed-off-by: Yanqin Wei 
Reviewed-by: Gavin Hu 
---
 lib/dpif-netdev.c | 80 ---
 1 file changed, 52 insertions(+), 28 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 4d6d0c3..982082c 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -189,6 +189,10 @@ struct netdev_flow_key {
 #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX / \
 DEFAULT_EM_FLOW_INSERT_INV_PROB)
 
+/* DEFAULT_EMC_PREFETCH_SIZE can cover majority traffic including 
+TCP/UDP
+ * protocol. */
+#define DEFAULT_EMC_PREFETCH_SIZE ROUND_UP(128,CACHE_LINE_SIZE)
+
 struct emc_entry {
 struct dp_netdev_flow *flow;
 struct netdev_flow_key key;   /* key.hash used for emc hash value. */
@@ -6166,15 +6170,20 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread *pmd, 
struct dp_packet *packet_,  }
 
 static inline uint32_t
-dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
-const struct miniflow *mf)
+dpif_netdev_packet_get_packet_rss_hash(struct dp_packet *packet,
+bool md_is_valid)
 {
-uint32_t hash;
+uint32_t hash,recirc_depth;
 
-if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
-hash = dp_packet_get_rss_hash(packet);
-} else {
-hash = miniflow_hash_5tuple(mf, 0);
+hash = dp_packet_get_rss_hash(packet);
+
+if (md_is_valid) {
+/* The RSS hash must account for the recirculation depth to avoid
+ * collisions in the exact match cache */
+recirc_depth = *recirc_depth_get_unsafe();
+if (OVS_UNLIKELY(recirc_depth)) {
+hash = hash_finish(hash, recirc_depth);
+}
 dp_packet_set_rss_hash(packet, hash);
 }
 
@@ -6182,24 +6191,23 @@ dpif_netdev_packet_get_rss_hash_orig_pkt(struct 
dp_packet *packet,  }
 
 static inline uint32_t
-dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
-const struct miniflow *mf)
+dpif_netdev_packet_get_hash_5tuple(struct dp_packet *packet,
+const struct miniflow *mf,
+bool md_is_valid)
 {
-uint32_t hash, recirc_depth;
+uint32_t hash,recirc_depth;
 
-if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
-hash = dp_packet_get_rss_hash(packet);
-} else {
-hash = miniflow_hash_5tuple(mf, 0);
-dp_packet_set_rss_hash(packet, hash);
-}
+hash = miniflow_hash_5tuple(mf, 0);
+dp_packet_set_rss_hash(packet, hash);
 
-/* The RSS hash must account for the recirculation depth to avoid
- * collisions in the exact match cache */
-recirc_depth = *recirc_depth_get_unsafe();
-if (OVS_UNLIKELY(recirc_depth)) {
-hash = hash_finish(hash, recirc_depth);
-dp_packet_set_rss_hash(packet, hash);
+if (md_is_valid) {
+/* The RSS hash must account for the recirculation depth to avoid
+ * collisions in the exact match cache */
+recirc_depth = *recirc_depth_get_unsafe();
+if (OVS_UNLIKELY(recirc_depth)) {
+hash = hash_finish(hash, recirc_depth);
+dp_packet_set_rss_hash(packet, hash);
+}
 }
 return hash;
 }
@@ -6390,6 +6398,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
 bool smc_enable_db;
 size_t map_cnt = 0;
 bool batch_enable = true;
+bool is_5tuple_hash_needed;
 
 atomic_read_relaxed(>dp->smc_enable_db, _enable_db);
 pmd_perf_update_counter(>perf_stats,
@@ -6436,16 +6445,31 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
 }
 }
 
-miniflow_extract(packet, >mf);
-key->len = 0; /* Not computed yet. */
 /* If EMC and SMC disabled skip h

Re: [ovs-dev] [ovs-dev, v2] netdev-dpdk: dfc_process optimization by

2019-03-12 Thread Yanqin Wei (Arm Technology China)
Thanks for the comments. It will be fixed in the following patch v3.

-Original Message-
From: Ilya Maximets  
Sent: Tuesday, March 12, 2019 9:31 PM
To: Yanqin Wei (Arm Technology China) ; d...@openvswitch.org
Cc: nd ; Gavin Hu (Arm Technology China) 
Subject: Re: [ovs-dev,v2] netdev-dpdk: dfc_process optimization by

Hi.

Thanks for working on this.
Not a full review, just a few notes about formatting.

1. Looks like your subject line was accidentally cropped.
2. This change is local to generic parts of 'dpif-netdev', so, the "area" in
   a subject line should be 'dpif-netdev'. There is nothing DPDK specific here.

On 11.03.2019 14:44, Yanqin Wei wrote:
> It is observed that the throughput of multi-flow is worse than 
> single-flow in the EMC NIC to NIC cases. It is because CPU cache-miss 
> increasing in EMC lookup. Each flow need load at least one EMC entry 
> to CPU cache(several cache lines) and compare it with packet miniflow.
> This patch improve it by prefetching EMC entry in advance. Hash value 
> can be obtained from dpdk rss hash, so this step can be advanced ahead 
> of
> miniflow_extract() and prefetch EMC entry there. The prefetching size 
> is defined as ROUND_UP(128,CACHE_LINE_SIZE), which can cover majority 
> traffic including TCP/UDP protocol and need 2 cache lines in most modern CPU.
> Performance test was run in some arm platform. 1000/1 flows 
> NIC2NIC test achieved around 10% throughput improvement in 
> thunderX2(aarch64 platform).
> 
> Signed-off-by: Yanqin Wei 
> Reviewed-by: Gavin Hu 
> ---
>  lib/dpif-netdev.c | 80 
> ---
>  1 file changed, 52 insertions(+), 28 deletions(-)  mode change 100644 
> => 100755 lib/dpif-netdev.c
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c old mode 100644 new 
> mode 100755

3. Please, don't change the file mode.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3] dpif-netdev: dfc_process optimization by prefetching EMC entry.

2019-03-12 Thread Yanqin Wei
It is observed that the throughput of multi-flow is worse than single-flow
in the EMC NIC to NIC cases. It is because CPU cache-miss increasing in EMC
lookup. Each flow need load at least one EMC entry to CPU cache(several
cache lines) and compare it with packet miniflow.
This patch improve it by prefetching EMC entry in advance. Hash value can
be obtained from dpdk rss hash, so this step can be advanced ahead of
miniflow_extract() and prefetch EMC entry there. The prefetching size is
defined as ROUND_UP(128,CACHE_LINE_SIZE), which can cover majority traffic
including TCP/UDP protocol and need 2 cache lines in most modern CPU.
Performance test was run in some arm platform. 1000/1 flows NIC2NIC
test achieved around 10% throughput improvement in thunderX2(aarch64
platform).

Signed-off-by: Yanqin Wei 
Reviewed-by: Gavin Hu 
---
 lib/dpif-netdev.c | 80 ---
 1 file changed, 52 insertions(+), 28 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 4d6d0c3..982082c 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -189,6 +189,10 @@ struct netdev_flow_key {
 #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX / \
 DEFAULT_EM_FLOW_INSERT_INV_PROB)
 
+/* DEFAULT_EMC_PREFETCH_SIZE can cover majority traffic including TCP/UDP
+ * protocol. */
+#define DEFAULT_EMC_PREFETCH_SIZE ROUND_UP(128,CACHE_LINE_SIZE)
+
 struct emc_entry {
 struct dp_netdev_flow *flow;
 struct netdev_flow_key key;   /* key.hash used for emc hash value. */
@@ -6166,15 +6170,20 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread *pmd, 
struct dp_packet *packet_,
 }
 
 static inline uint32_t
-dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
-const struct miniflow *mf)
+dpif_netdev_packet_get_packet_rss_hash(struct dp_packet *packet,
+bool md_is_valid)
 {
-uint32_t hash;
+uint32_t hash,recirc_depth;
 
-if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
-hash = dp_packet_get_rss_hash(packet);
-} else {
-hash = miniflow_hash_5tuple(mf, 0);
+hash = dp_packet_get_rss_hash(packet);
+
+if (md_is_valid) {
+/* The RSS hash must account for the recirculation depth to avoid
+ * collisions in the exact match cache */
+recirc_depth = *recirc_depth_get_unsafe();
+if (OVS_UNLIKELY(recirc_depth)) {
+hash = hash_finish(hash, recirc_depth);
+}
 dp_packet_set_rss_hash(packet, hash);
 }
 
@@ -6182,24 +6191,23 @@ dpif_netdev_packet_get_rss_hash_orig_pkt(struct 
dp_packet *packet,
 }
 
 static inline uint32_t
-dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
-const struct miniflow *mf)
+dpif_netdev_packet_get_hash_5tuple(struct dp_packet *packet,
+const struct miniflow *mf,
+bool md_is_valid)
 {
-uint32_t hash, recirc_depth;
+uint32_t hash,recirc_depth;
 
-if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
-hash = dp_packet_get_rss_hash(packet);
-} else {
-hash = miniflow_hash_5tuple(mf, 0);
-dp_packet_set_rss_hash(packet, hash);
-}
+hash = miniflow_hash_5tuple(mf, 0);
+dp_packet_set_rss_hash(packet, hash);
 
-/* The RSS hash must account for the recirculation depth to avoid
- * collisions in the exact match cache */
-recirc_depth = *recirc_depth_get_unsafe();
-if (OVS_UNLIKELY(recirc_depth)) {
-hash = hash_finish(hash, recirc_depth);
-dp_packet_set_rss_hash(packet, hash);
+if (md_is_valid) {
+/* The RSS hash must account for the recirculation depth to avoid
+ * collisions in the exact match cache */
+recirc_depth = *recirc_depth_get_unsafe();
+if (OVS_UNLIKELY(recirc_depth)) {
+hash = hash_finish(hash, recirc_depth);
+dp_packet_set_rss_hash(packet, hash);
+}
 }
 return hash;
 }
@@ -6390,6 +6398,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
 bool smc_enable_db;
 size_t map_cnt = 0;
 bool batch_enable = true;
+bool is_5tuple_hash_needed;
 
 atomic_read_relaxed(>dp->smc_enable_db, _enable_db);
 pmd_perf_update_counter(>perf_stats,
@@ -6436,16 +6445,31 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
 }
 }
 
-miniflow_extract(packet, >mf);
-key->len = 0; /* Not computed yet. */
 /* If EMC and SMC disabled skip hash computation */
 if (smc_enable_db == true || cur_min != 0) {
-if (!md_is_valid) {
-key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
->mf);
+if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
+is_5tuple_hash_needed = false;
+key->hash =
+   dpif_netdev_packet_ge

[ovs-dev] [PATCH v2] netdev-dpdk: dfc_process optimization by

2019-03-11 Thread Yanqin Wei
It is observed that the throughput of multi-flow is worse than single-flow
in the EMC NIC to NIC cases. It is because CPU cache-miss increasing in EMC
lookup. Each flow need load at least one EMC entry to CPU cache(several
cache lines) and compare it with packet miniflow.
This patch improve it by prefetching EMC entry in advance. Hash value can
be obtained from dpdk rss hash, so this step can be advanced ahead of
miniflow_extract() and prefetch EMC entry there. The prefetching size is
defined as ROUND_UP(128,CACHE_LINE_SIZE), which can cover majority traffic
including TCP/UDP protocol and need 2 cache lines in most modern CPU.
Performance test was run in some arm platform. 1000/1 flows NIC2NIC
test achieved around 10% throughput improvement in thunderX2(aarch64
platform).

Signed-off-by: Yanqin Wei 
Reviewed-by: Gavin Hu 
---
 lib/dpif-netdev.c | 80 ---
 1 file changed, 52 insertions(+), 28 deletions(-)
 mode change 100644 => 100755 lib/dpif-netdev.c

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
old mode 100644
new mode 100755
index 4d6d0c3..982082c
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -189,6 +189,10 @@ struct netdev_flow_key {
 #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX / \
 DEFAULT_EM_FLOW_INSERT_INV_PROB)
 
+/* DEFAULT_EMC_PREFETCH_SIZE can cover majority traffic including TCP/UDP
+ * protocol. */
+#define DEFAULT_EMC_PREFETCH_SIZE ROUND_UP(128,CACHE_LINE_SIZE)
+
 struct emc_entry {
 struct dp_netdev_flow *flow;
 struct netdev_flow_key key;   /* key.hash used for emc hash value. */
@@ -6166,15 +6170,20 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread *pmd, 
struct dp_packet *packet_,
 }
 
 static inline uint32_t
-dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
-const struct miniflow *mf)
+dpif_netdev_packet_get_packet_rss_hash(struct dp_packet *packet,
+bool md_is_valid)
 {
-uint32_t hash;
+uint32_t hash,recirc_depth;
 
-if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
-hash = dp_packet_get_rss_hash(packet);
-} else {
-hash = miniflow_hash_5tuple(mf, 0);
+hash = dp_packet_get_rss_hash(packet);
+
+if (md_is_valid) {
+/* The RSS hash must account for the recirculation depth to avoid
+ * collisions in the exact match cache */
+recirc_depth = *recirc_depth_get_unsafe();
+if (OVS_UNLIKELY(recirc_depth)) {
+hash = hash_finish(hash, recirc_depth);
+}
 dp_packet_set_rss_hash(packet, hash);
 }
 
@@ -6182,24 +6191,23 @@ dpif_netdev_packet_get_rss_hash_orig_pkt(struct 
dp_packet *packet,
 }
 
 static inline uint32_t
-dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
-const struct miniflow *mf)
+dpif_netdev_packet_get_hash_5tuple(struct dp_packet *packet,
+const struct miniflow *mf,
+bool md_is_valid)
 {
-uint32_t hash, recirc_depth;
+uint32_t hash,recirc_depth;
 
-if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
-hash = dp_packet_get_rss_hash(packet);
-} else {
-hash = miniflow_hash_5tuple(mf, 0);
-dp_packet_set_rss_hash(packet, hash);
-}
+hash = miniflow_hash_5tuple(mf, 0);
+dp_packet_set_rss_hash(packet, hash);
 
-/* The RSS hash must account for the recirculation depth to avoid
- * collisions in the exact match cache */
-recirc_depth = *recirc_depth_get_unsafe();
-if (OVS_UNLIKELY(recirc_depth)) {
-hash = hash_finish(hash, recirc_depth);
-dp_packet_set_rss_hash(packet, hash);
+if (md_is_valid) {
+/* The RSS hash must account for the recirculation depth to avoid
+ * collisions in the exact match cache */
+recirc_depth = *recirc_depth_get_unsafe();
+if (OVS_UNLIKELY(recirc_depth)) {
+hash = hash_finish(hash, recirc_depth);
+dp_packet_set_rss_hash(packet, hash);
+}
 }
 return hash;
 }
@@ -6390,6 +6398,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
 bool smc_enable_db;
 size_t map_cnt = 0;
 bool batch_enable = true;
+bool is_5tuple_hash_needed;
 
 atomic_read_relaxed(>dp->smc_enable_db, _enable_db);
 pmd_perf_update_counter(>perf_stats,
@@ -6436,16 +6445,31 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
 }
 }
 
-miniflow_extract(packet, >mf);
-key->len = 0; /* Not computed yet. */
 /* If EMC and SMC disabled skip hash computation */
 if (smc_enable_db == true || cur_min != 0) {
-if (!md_is_valid) {
-key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
->mf);
+if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
+is_5tuple_hash_needed = false;
+ 

[ovs-dev] [PATCH v1 1/1] hash: Enable hash_bytes128 optimization for aarch64.

2019-02-27 Thread Yanqin Wei
"hash_bytes128" has two versions for 64 bits and 32 bits system. This
should be common optimization for their respective platforms. But 64 bits
version was only enabled in x86-64. This patch enable it for aarch64
platform.
Micro benchmarking test was run in two kinds of arm platform. It was
observed that 50% performance improvement in thunderX2 and 40% improvement
in TaiShan(Cortex-A72).

Signed-off-by: Yanqin Wei 
---
 lib/hash.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/hash.c b/lib/hash.c
index c64f25e..06f8339 100644
--- a/lib/hash.c
+++ b/lib/hash.c
@@ -72,7 +72,7 @@ hash_words64__(const uint64_t p[], size_t n_words, uint32_t 
basis)
 return hash_words64_inline(p, n_words, basis);
 }
 
-#if !(defined(__x86_64__))
+#if !(defined(__x86_64__)) && !(defined(__aarch64__))
 void
 hash_bytes128(const void *p_, size_t len, uint32_t basis, ovs_u128 *out)
 {
@@ -233,7 +233,7 @@ hash_bytes128(const void *p_, size_t len, uint32_t basis, 
ovs_u128 *out)
 out->u32[3] = h4;
 }
 
-#else /* __x86_64__ */
+#else /* __x86_64__ or __aarch64__*/
 
 static inline uint64_t
 hash_rot64(uint64_t x, int8_t r)
@@ -361,4 +361,4 @@ hash_bytes128(const void *p_, size_t len, uint32_t basis, 
ovs_u128 *out)
 out->u64.lo = h1;
 out->u64.hi = h2;
 }
-#endif /* __x86_64__ */
+#endif /* __x86_64__ or __aarch64__*/
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 1/1] hash: Enable hash_bytes128 optimization for aarch64.

2019-02-26 Thread Yanqin Wei (Arm Technology China)
Hi Ben,

Sorry for the format issue. I will resubmit it by "git send-email".

Best Regards,
Wei Yanqin

-Original Message-
From: Ben Pfaff 
Sent: Saturday, February 23, 2019 5:03 AM
To: Yanqin Wei (Arm Technology China) 
Cc: d...@openvswitch.org
Subject: Re: [ovs-dev] [PATCH v1 1/1] hash: Enable hash_bytes128 optimization 
for aarch64.

On Mon, Feb 18, 2019 at 05:46:01AM +0000, Yanqin Wei (Arm Technology China) 
wrote:
> "hash_bytes128" has two versions for 64 bits and 32 bits system. This
> should be common optimization for their respective platforms.
> But 64 bits version was only enabled in x86-64. This patch enable it
> for
> aarch64 platform.
> Micro benchmarking test was run in two kinds of arm platform.  It was
> observed that 50% performance improvement in thunderX2 and 40%
> improvement in TaiShan(Cortex-A72).
>
> Signed-off-by: Yanqin Wei 

Thanks for working to make OVS better.

This patch is whitespace damaged.  For instance, lines that should begin with 
spaces lack them.  I cannot apply it.

You can send the patch another way (for example, with "git send-email") or use 
a Github pull request.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 1/1] hash: Enable hash_bytes128 optimization for aarch64.

2019-02-17 Thread Yanqin Wei (Arm Technology China)
"hash_bytes128" has two versions for 64 bits and 32 bits system. This
should be common optimization for their respective platforms.
But 64 bits version was only enabled in x86-64. This patch enable it for
aarch64 platform.
Micro benchmarking test was run in two kinds of arm platform.  It was
observed that 50% performance improvement in thunderX2 and 40%
improvement in TaiShan(Cortex-A72).

Signed-off-by: Yanqin Wei 

---
lib/hash.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/hash.c b/lib/hash.c
index c64f25e..06f8339 100644
--- a/lib/hash.c
+++ b/lib/hash.c
@@ -72,7 +72,7 @@ hash_words64__(const uint64_t p[], size_t n_words, uint32_t 
basis)
 return hash_words64_inline(p, n_words, basis);
}

-#if !(defined(__x86_64__))
+#if !(defined(__x86_64__)) && !(defined(__aarch64__))
void
hash_bytes128(const void *p_, size_t len, uint32_t basis, ovs_u128 *out)
{
@@ -233,7 +233,7 @@ hash_bytes128(const void *p_, size_t len, uint32_t basis, 
ovs_u128 *out)
 out->u32[3] = h4;
}

-#else /* __x86_64__ */
+#else /* __x86_64__ or __aarch64__*/

static inline uint64_t
hash_rot64(uint64_t x, int8_t r)
@@ -361,4 +361,4 @@ hash_bytes128(const void *p_, size_t len, uint32_t basis, 
ovs_u128 *out)
 out->u64.lo = h1;
 out->u64.hi = h2;
}
-#endif /* __x86_64__ */
+#endif /* __x86_64__ or __aarch64__*/
--
2.7.4
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 1/1] hash: Implement hash for aarch64 using CRC32c intrinsics.

2019-01-25 Thread Yanqin Wei (Arm Technology China)
This commit adds lib/hash-aarch64.h to implement hash for aarch64.
It is based on aarch64 built-in CRC32c intrinsics, which accelerates
hash function for datapath performance.

test:
1. "test-hash" case passed in aarch64 platform.
2.  OVS-DPDK datapth performance test was run(NIC to NIC).
Test bed: aarch64(Centriq 2400) platform.
Test case: DPCLS forwarding(disable EMC + avg 10 subtable lookups)
Test result: improve around 10%.

Signed-off-by: Yanqin Wei 

---
lib/automake.mk|   1 +
lib/hash-aarch64.h | 151 +
lib/hash.h |   5 +-
3 files changed, 156 insertions(+), 1 deletion(-)
create mode 100644 lib/hash-aarch64.h

diff --git a/lib/automake.mk b/lib/automake.mk
index b1ff495ff..ba1041095 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -100,6 +100,7 @@ lib_libopenvswitch_la_SOURCES = \
lib/guarded-list.h \
lib/hash.c \
lib/hash.h \
+   lib/hash-aarch64.h \
lib/hindex.c \
lib/hindex.h \
lib/hmap.c \
diff --git a/lib/hash-aarch64.h b/lib/hash-aarch64.h
new file mode 100644
index 0..6993e2a66
--- /dev/null
+++ b/lib/hash-aarch64.h
@@ -0,0 +1,151 @@
+/*
+ * Copyright (c) 2019 Arm Limited
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/* This header implements HASH operation primitives on aarch64. */
+#ifndef HASH_AARCH64_H
+#define HASH_AARCH64_H 1
+
+#ifndef HASH_H
+#error "This header should only be included indirectly via hash.h."
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include 
+
+static inline uint32_t hash_add(uint32_t hash, uint32_t data)
+{
+return __crc32cw(hash, data);
+}
+
+/* Add the halves of 'data' in the memory order. */
+static inline uint32_t hash_add64(uint32_t hash, uint64_t data)
+{
+return __crc32cd(hash, data);
+}
+
+static inline uint32_t hash_finish(uint32_t hash, uint64_t final)
+{
+/* The finishing multiplier 0x805204f3 has been experimentally
+ * derived to pass the testsuite hash tests. */
+hash = __crc32cd(hash, final) * 0x805204f3;
+return hash ^ hash >> 16; /* Increase entropy in LSBs. */
+}
+
+/* Returns the hash of the 'n' 32-bit words at 'p_', starting from 'basis'.
+ * We access 'p_' as a uint64_t pointer.
+ *
+ * This is inlined for the compiler to have access to the 'n_words', which
+ * in many cases is a constant. */
+static inline uint32_t
+hash_words_inline(const uint32_t p_[], size_t n_words, uint32_t basis)
+{
+const uint64_t *p = (const void *)p_;
+uint32_t hash1 = basis;
+uint32_t hash2 = 0;
+uint32_t hash3 = n_words;
+const uint32_t *endp = (const uint32_t *)p + n_words;
+const uint64_t *limit = p + n_words / 2 - 3;
+
+while (p <= limit) {
+hash1 = __crc32cd(hash1, p[0]);
+hash2 = __crc32cd(hash2, p[1]);
+hash3 = __crc32cd(hash3, p[2]);
+p += 3;
+}
+switch (endp - (const uint32_t *)p) {
+case 1:
+hash1 = __crc32cw(hash1, *(const uint32_t *)[0]);
+break;
+case 2:
+hash1 = __crc32cd(hash1, p[0]);
+break;
+case 3:
+hash1 = __crc32cd(hash1, p[0]);
+hash2 = __crc32cw(hash2, *(const uint32_t *)[1]);
+break;
+case 4:
+hash1 = __crc32cd(hash1, p[0]);
+hash2 = __crc32cd(hash2, p[1]);
+break;
+case 5:
+hash1 = __crc32cd(hash1, p[0]);
+hash2 = __crc32cd(hash2, p[1]);
+hash3 = __crc32cw(hash3, *(const uint32_t *)[2]);
+break;
+}
+return hash_finish(hash1, (uint64_t)hash2 << 32 | hash3);
+}
+
+/* A simpler version for 64-bit data.
+ * 'n_words' is the count of 64-bit words, basis is 64 bits. */
+static inline uint32_t
+hash_words64_inline(const uint64_t p[], size_t n_words, uint32_t basis)
+{
+uint32_t hash1 = basis;
+uint32_t hash2 = 0;
+uint32_t hash3 = n_words;
+const uint64_t *endp = p + n_words;
+const uint64_t *limit = endp - 3;
+
+while (p <= limit) {
+hash1 = __crc32cd(hash1, p[0]);
+hash2 = __crc32cd(hash2, p[1]);
+hash3 = __crc32cd(hash3, p[2]);
+p += 3;
+}
+switch (endp - p) {
+case 1:
+hash1 = __crc32cd(hash1, p[0]);
+break;
+case 2:
+hash1 = __crc32cd(hash1, p[0]);
+hash2 = __crc32cd(hash2, p[1]);
+break;
+}
+return hash_finish(hash1, (uint64_t)hash2 << 32