Re: [ovs-dev] [PATCH 2/3] ofproto-dpif: Improve dp_hash selection method for select groups

2018-04-10 Thread ychen
Hi, Jan:
When I test dp_hash with the new patch, vswitchd was killed by segment 
fault in some conditions.
1. add group with no buckets, then winner will be NULL
2. add buckets with weight with 0, then winner will also be NULL


I did little modify to the patch, will you help to check whether it is correct?


diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 8f6070d..b3a9639 100755
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -4773,6 +4773,8 @@ group_setup_dp_hash_table(struct group_dpif *group, 
size_t max_hash)
 webster[i].value = bucket->weight;
 i++;
 }
+//consider bucket weight equal to 0
+if (!min_weight) min_weight = 1;


 uint32_t min_slots = ceil(total_weight / min_weight);
 n_hash = MAX(16, 1L << log_2_ceil(min_slots));
@@ -4794,11 +4796,12 @@ group_setup_dp_hash_table(struct group_dpif *group, 
size_t max_hash)
 for (int hash = 0; hash < n_hash; hash++) {
 VLOG_DBG("Hash value: %d", hash);
 double max_val = 0.0;
-struct webster *winner;
+struct webster *winner = NULL;
 for (i = 0; i < n_buckets; i++) {
 VLOG_DBG("Webster[%d]: divisor=%d value=%.2f",
  i, webster[i].divisor, webster[i].value);
-if (webster[i].value > max_val) {
+// use >= in condition there is only one bucket with weight 0
+if (webster[i].value >= max_val) {
 max_val = webster[i].value;
 winner = [i];
 }
@@ -4827,7 +4830,8 @@ group_set_selection_method(struct group_dpif *group)
 group->selection_method = SEL_METHOD_DEFAULT;
 } else if (!strcmp(selection_method, "dp_hash")) {
 /* Try to use dp_hash if possible at all. */
-if (group_setup_dp_hash_table(group, 64)) {
+uint32_t n_buckets = group->up.n_buckets;
+if (n_buckets && group_setup_dp_hash_table(group, 64)) {
 group->selection_method = SEL_METHOD_DP_HASH;
 group->hash_alg = props->selection_method_param >> 32;
 if (group->hash_alg >= __OVS_HASH_MAX) {




Another question, I found in function xlate_default_select_group and 
xlate_hash_fields_select_group,
when group_best_live_bucket is NULL, it will call ofproto_group_unref,
why dp_hash function no need to call it when there is no best bucket 
found?(exp: group with no buckets)





At 2018-03-21 02:16:17, "Jan Scheurich"  wrote:
>The current implementation of the "dp_hash" selection method suffers
>from two deficiences: 1. The hash mask and hence the number of dp_hash
>values is just large enough to cover the number of group buckets, but
>does not consider the case that buckets have different weights. 2. The
>xlate-time selection of best bucket from the masked dp_hash value often
>results in bucket load distributions that are quite different from the
>bucket weights because the number of available masked dp_hash values
>is too small (2-6 bits compared to 32 bits of a full hash in the default
>hash selection method).
>
>This commit provides a more accurate implementation of the dp_hash
>select group by applying the well known Webster method for distributing
>a small number of "seats" fairly over the weighted "parties"
>(see https://en.wikipedia.org/wiki/Webster/Sainte-Lagu%C3%AB_method).
>The dp_hash mask is autmatically chosen large enough to provide good
>enough accuracy even with widely differing weights.
>
>This distribution happens at group modification time and the resulting
>table is stored with the group-dpif struct. At xlation time, we use the
>masked dp_hash values as index to look up the assigned bucket.
>
>If the bucket should not be live, we do a circular search over the
>mapping table until we find the first live bucket. As the buckets in
>the table are by construction in pseudo-random order with a frequency
>according to their weight, this method maintains correct distribution
>even if one or more buckets are non-live.
>
>Xlation is further simplified by storing some derived select group state
>at group construction in struct group-dpif in a form better suited for
>xlation purposes.
>
>Signed-off-by: Jan Scheurich 
>Signed-off-by: Nitin Katiyar 
>Co-authored-by: Nitin Katiyar 
>Signed-off-by: Jan Scheurich 
>---
> include/openvswitch/ofp-group.h |   1 +
> ofproto/ofproto-dpif-xlate.c|  70 
> ofproto/ofproto-dpif.c  | 142 
> ofproto/ofproto-dpif.h  |  13 
> 4 files changed, 200 insertions(+), 26 deletions(-)
>
>diff --git a/include/openvswitch/ofp-group.h b/include/openvswitch/ofp-group.h
>index 8d893a5..af4033d 100644
>--- a/include/openvswitch/ofp-group.h
>+++ b/include/openvswitch/ofp-group.h
>@@ -47,6 +47,7 @@ struct bucket_counter {
> /* Bucket for use in groups. */
> 

Re: [ovs-dev] [PATCH v2] ovn: Avoid nb_cfg update notification flooding

2018-04-10 Thread Han Zhou
On Tue, Apr 10, 2018 at 5:04 PM, Ben Pfaff  wrote:
>
> On Fri, Apr 06, 2018 at 02:40:21PM -0700, Han Zhou wrote:
> > On Fri, Apr 6, 2018 at 1:54 PM, Ben Pfaff  wrote:
> > >
> > > Thanks for working on making OVN faster and scale better.
> > >
> > > I see what you mean about how nb_cfg can be a scale problem.  Really,
> > > each hypervisor only cares about its own row, but each of them is
being
> > > notified about the current value in every row.  Ideally, we want the
HVs
> > > to be able to say to ovsdb-server, "Give me all the data in the
Chassis
> > > table, except don't bother with nb_cfg in any rows except my own row."
> > >
> > > Actually, there's a way for ovn-controller to do that: the OVSDB
> > > protocol definition of monitor_cond allows the client to specify
> > > multiple sets of columns to monitor in a table, and each set of
columns
> > > has an independently attached condition.  This can be used to specify
> > > that most of the columns are monitored unconditionally but nb_cfg is
> > > monitored only for a particular row.
> > >
> > > But there are still problems:
> > >
> > > 1. The implementation in ovsdb-server looks broken.  I don't think it
> > >implements the spec.  It certainly isn't tested--none of the
existing
> > >clients ever passes more than a single set of columns.  I think
that
> > >it will work if all the sets of columns use the same condition, but
> > >that doesn't help with this case.  It will need to be fixed.
> > >
> > > 2. The client implementation in ovsdb-idl doesn't anticipate
difference
> > >conditions for different columns.  It will need to be enhanced to
> > >support this use.
> > >
> > > In the short term this is going to be more work than just creating a
new
> > > table.  In the long term, though, it will allow us to be more flexible
> > > and more agile, since improvements similar to the one in this patch
will
> > > require only client changes, not database schema changes.
> > >
> > > Comments?
> >
> > Thanks for the valuable information! I didn't even know that ovsdb is
> > supposed to support multiple set columns with different monitor
conditions.
> > It surely would be the right way to go if it is supported. However, as
you
> > mentioned the mechanism doesn't work yet and I am not sure how much
work is
> > needed to make it work, it seems reasonable to me to "workaround" it at
> > least for short term until that is ready in the future. Then we could
> > simply resurrect the old column in the old table with the new monitor
> > conditions. What do you think?
>
> This kind of workaround would essentially persist forever.  We'd never
> be able to get rid of it--or, more to the point, we would never be able
> to justify the work.  We'd be stuck with something ugly.
>
> So: if you think the workaround is valuable, then one way to justify it
> would be to reformulate it in a way that isn't so ugly.  As currently
> written, when a person looks at the table structure, (s)he notices a
> table that has a single column and is named for that single column.
> That sticks out, clearly, as some kind of compromise, and the
> documentation even calls it out as an optimization.  But there might be
> a logical, conceptual reason why it makes sense to have a separate table
> for some kinds of chassis-related data.  If there is, then that would
> suggest a name for the table, and for the class of data that should
> should go in there, and it would make it possible to have something that
> is both faster and a reasonable design.
>
> I'm not saying that there is necessarily a logical, conceptual reason
> that one can come up here, but if there is then it would certainly make
> it much easier to support the patch.

Hi Ben, I'm not sure if I got your point completely. A logical reason for
this separate table might be for holding chassis private data, the data
that other chassis is not interested in. Of course the reason for holding
such private data in a separate table is still this huge performance
difference as it is shown in the test result. So I think the table can be
named as Chassis_Private_Data, and I can document in ovn-sb.xml more about
the performance considerations, and the future improvement about the
monitor_cond. Would this address your point or did I totally missed your
point?

Thanks,
Han
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [patch v1] conntrack-tcp: Handle tcp session reuse.

2018-04-10 Thread Darrell Ball
Not yet; I will check with Justin when he gets back from PTO

On 4/10/18, 4:50 PM, "Ben Pfaff"  wrote:

Did you find anyone to take a look?

On Fri, Mar 09, 2018 at 07:29:06PM +, Darrell Ball wrote:
> Windows folks have also been looking at this, as this file is mostly a 
common port from bsd.
> I’ll check with Sai
> 
> On 3/9/18, 11:22 AM, "ovs-dev-boun...@openvswitch.org on behalf of Ben 
Pfaff"  wrote:
> 
> On Wed, Feb 28, 2018 at 11:25:50PM -0800, Darrell Ball wrote:
> > Fix tcp sequence tracking for session reuse cases.  This can happen,
> > for example by doing VM migration, where sequence tracking needs to
> > be more permissive.  The solution is to be more permissive for
> > session restart and session start only.  We don't differentiate
> > session start here where we could be more strict, although we could,
> > because the gain in protection is almost zero and the code 
modularity
> > would be lessened and code complexity increased.
> > This issue originates in release 2.7.
> > 
> > Signed-off-by: Darrell Ball 
> 
> Darrell, who should review this?  Justin, are you the right person?
> ___
> dev mailing list
> d...@openvswitch.org
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09CGX7JQ5Ih-uZnsw=cwtm0y5t5FEfGZdPOmWiiE5ODS8BoNvdrbv6oFCZNEk=yHLhMPNWgZwAe36ObZ2obrSzPWqVbJZOg5ofA18IS_k=
> 
> 


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] ovn: Avoid nb_cfg update notification flooding

2018-04-10 Thread Ben Pfaff
On Fri, Apr 06, 2018 at 02:40:21PM -0700, Han Zhou wrote:
> On Fri, Apr 6, 2018 at 1:54 PM, Ben Pfaff  wrote:
> >
> > Thanks for working on making OVN faster and scale better.
> >
> > I see what you mean about how nb_cfg can be a scale problem.  Really,
> > each hypervisor only cares about its own row, but each of them is being
> > notified about the current value in every row.  Ideally, we want the HVs
> > to be able to say to ovsdb-server, "Give me all the data in the Chassis
> > table, except don't bother with nb_cfg in any rows except my own row."
> >
> > Actually, there's a way for ovn-controller to do that: the OVSDB
> > protocol definition of monitor_cond allows the client to specify
> > multiple sets of columns to monitor in a table, and each set of columns
> > has an independently attached condition.  This can be used to specify
> > that most of the columns are monitored unconditionally but nb_cfg is
> > monitored only for a particular row.
> >
> > But there are still problems:
> >
> > 1. The implementation in ovsdb-server looks broken.  I don't think it
> >implements the spec.  It certainly isn't tested--none of the existing
> >clients ever passes more than a single set of columns.  I think that
> >it will work if all the sets of columns use the same condition, but
> >that doesn't help with this case.  It will need to be fixed.
> >
> > 2. The client implementation in ovsdb-idl doesn't anticipate difference
> >conditions for different columns.  It will need to be enhanced to
> >support this use.
> >
> > In the short term this is going to be more work than just creating a new
> > table.  In the long term, though, it will allow us to be more flexible
> > and more agile, since improvements similar to the one in this patch will
> > require only client changes, not database schema changes.
> >
> > Comments?
> 
> Thanks for the valuable information! I didn't even know that ovsdb is
> supposed to support multiple set columns with different monitor conditions.
> It surely would be the right way to go if it is supported. However, as you
> mentioned the mechanism doesn't work yet and I am not sure how much work is
> needed to make it work, it seems reasonable to me to "workaround" it at
> least for short term until that is ready in the future. Then we could
> simply resurrect the old column in the old table with the new monitor
> conditions. What do you think?

This kind of workaround would essentially persist forever.  We'd never
be able to get rid of it--or, more to the point, we would never be able
to justify the work.  We'd be stuck with something ugly.

So: if you think the workaround is valuable, then one way to justify it
would be to reformulate it in a way that isn't so ugly.  As currently
written, when a person looks at the table structure, (s)he notices a
table that has a single column and is named for that single column.
That sticks out, clearly, as some kind of compromise, and the
documentation even calls it out as an optimization.  But there might be
a logical, conceptual reason why it makes sense to have a separate table
for some kinds of chassis-related data.  If there is, then that would
suggest a name for the table, and for the class of data that should
should go in there, and it would make it possible to have something that
is both faster and a reasonable design.

I'm not saying that there is necessarily a logical, conceptual reason
that one can come up here, but if there is then it would certainly make
it much easier to support the patch.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1] tests: Added NSH related unit test cases for datapath

2018-04-10 Thread Gregory Rose

On 4/6/2018 7:35 AM, Gregory Rose wrote:



On 4/4/2018 10:23 AM, Ben Pfaff wrote:

On Thu, Mar 29, 2018 at 04:46:09PM -0700, Ashish Varma wrote:
Added test cases for encap, decap, replace and forwarding of NSH 
packets.

Also added a python script 'sendpkt.py' to send hex ethernet frames.

Signed-off-by: Ashish Varma 

Ashish, do you have a suggestion who should review this?  Greg, are you
the right person?
I can have a look at them now that I'm back but need to get caught up 
with some other issues first.


Ashish,

I continue to be otherwise occupied so I'm sorry for the delay in 
reviewing your patch.  It is still on my radar!


Thanks,

- Greg



Thanks,

- Greg


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [patch v1] conntrack-tcp: Handle tcp session reuse.

2018-04-10 Thread Ben Pfaff
Did you find anyone to take a look?

On Fri, Mar 09, 2018 at 07:29:06PM +, Darrell Ball wrote:
> Windows folks have also been looking at this, as this file is mostly a common 
> port from bsd.
> I’ll check with Sai
> 
> On 3/9/18, 11:22 AM, "ovs-dev-boun...@openvswitch.org on behalf of Ben Pfaff" 
>  wrote:
> 
> On Wed, Feb 28, 2018 at 11:25:50PM -0800, Darrell Ball wrote:
> > Fix tcp sequence tracking for session reuse cases.  This can happen,
> > for example by doing VM migration, where sequence tracking needs to
> > be more permissive.  The solution is to be more permissive for
> > session restart and session start only.  We don't differentiate
> > session start here where we could be more strict, although we could,
> > because the gain in protection is almost zero and the code modularity
> > would be lessened and code complexity increased.
> > This issue originates in release 2.7.
> > 
> > Signed-off-by: Darrell Ball 
> 
> Darrell, who should review this?  Justin, are you the right person?
> ___
> dev mailing list
> d...@openvswitch.org
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09CGX7JQ5Ih-uZnsw=cwtm0y5t5FEfGZdPOmWiiE5ODS8BoNvdrbv6oFCZNEk=yHLhMPNWgZwAe36ObZ2obrSzPWqVbJZOg5ofA18IS_k=
> 
> 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 0/2] add icmp6 action support to ovn acl framework

2018-04-10 Thread Ben Pfaff
On Mon, Apr 09, 2018 at 12:00:23PM +0200, Lorenzo Bianconi wrote:
> Changes since v1:
> - squashed ACTION_OPCODE_ICMP4 and ACTION_OPCODE_ICMP6 in ACTION_OPCODE_ICMP
> - updated ovn-northd manpage
> - added a NEWS item that describes the new features
> 
> Lorenzo Bianconi (2):
>   OVN: add icmp6{} action support
>   OVN: add icmp6 action to ovn acl reject support

Thanks a lot, I applied this series to master.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Handle gratuitous ARP requests and replies in tnl_arp_snoop()

2018-04-10 Thread Ben Pfaff
On Thu, Apr 05, 2018 at 12:20:27PM +, Manohar Krishnappa Chidambaraswamy 
wrote:
> Problem:
> 
> In user-space tunneling implementation, tnl_arp_snoop() snoops only ARP
> *reply* packets to resolve tunnel nexthop IP addresses to MAC addresses.
> Normally the ARP requests are periodically sent by the local host IP stack,
> so that the ARP cache in OVS is refreshed and entries do not time out.
> However, if the remote tunnel nexthop is a VRRP IP, and the gateway
> periodically sends gratuitous ARP *requests* to announce itself,
> tnl_arp_snoop() treats them as INVALID. Consequently, the ARP cache in OVS
> expires after 10 minutes, which results in dropping of the next packet(s)
> until a new ARP request is responded to.
> 
> Fix:
> 
> Enhance the tunnel neighbor resolution logic in OVS to not only snoop on
> ARP replies but also on gratuitous ARP requests.
> 
> Signed-off-by: Jan Scheurich 
> From: Manohar K C 
> CC: Jan Scheurich 

Thanks, applied to master.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC] [PATCH] 1/1 Add multi-column index support for the Python IDL

2018-04-10 Thread Ben Pfaff
On Mon, Apr 09, 2018 at 03:03:03PM -0500, twil...@redhat.com wrote:
> From: Terry Wilson 
> 
> This adds multi-column index support for the Python IDL that is
> similar to the feature in the C IDL.
> 
> Signed-off-by: Terry Wilson 

Thanks for working on this.

I think that this adds a new Python module dependency on
"sortedcontainers".  I didn't have it installed for python2 or
python3--at least on Debian, it is not installed by default--so many
tests failed.  I guess that we should at least document that, and
probably it means that there should be new dependencies.

(When I installed the Python module, everything was fine.)

Thanks,

Ben.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] docs: Fix 7 byte octets MAC addresses in dpdk.rst

2018-04-10 Thread Ben Pfaff
On Mon, Apr 09, 2018 at 05:18:55PM +0200, Timothy Redaelli wrote:
> Currently the code relies on the standard 6 byte octets, but the
> documentation uses a wrong 7-byte octects.
> This commit fix the documention in order to use the correct 6 byte octets
> syntax.
> 
> Fixes: 5e7588186839 ("netdev-dpdk: fix port addition for ports sharing same 
> PCI id")
> 
> Signed-off-by: Timothy Redaelli 

I find the use of "byte octets" odd here, since a byte and an octet are
the same thing.  I would tend to say more like "docs: Fix 7-octet MAC
addresses".

I assume Ian will put this in his next DPDK pull request.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] stopwatch: Fix Windows incompatibility

2018-04-10 Thread Ben Pfaff
On Mon, Apr 09, 2018 at 12:07:20PM -0500, Mark Michelson wrote:
> Stopwatch was implemented using a Unix-only pipe structure. This commit
> changes to using a guarded list and latch in order to pass data between
> threads.
> 
> Signed-off-by: Mark Michelson 

Thanks, applied to master.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/1] netdev-dpdk: fix RX jumbo for NICs not supporting scatter

2018-04-10 Thread Stokes, Ian
> Currently to RX jumbo packets fails for NICs not supporting scatter.
> Scatter is not strictly needed for jumbo support on RX. This change fixes
> the issue by only enabling scatter for NICs supporting it.
> 
> Reported-by: Louis Peens 
> Signed-off-by: Pablo Cascón 
> Reviewed-by: Simon Horman 
> ---
>  lib/netdev-dpdk.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index ee39cbe..28b20b5
> 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -694,11 +694,14 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
> int n_rxq, int n_txq)
>  int diag = 0;
>  int i;
>  struct rte_eth_conf conf = port_conf;
> +struct rte_eth_dev_info info;
> 
>  /* For some NICs (e.g. Niantic), scatter_rx mode needs to be
> explicitly
>   * enabled. */
>  if (dev->mtu > ETHER_MTU) {
> -conf.rxmode.enable_scatter = 1;
> +rte_eth_dev_info_get(dev->port_id, );
> +if (info.rx_offload_capa & DEV_RX_OFFLOAD_SCATTER)
> +conf.rxmode.enable_scatter = 1;

Thanks for this, quick note:

conf.rxmode.enable_scatter = 1; should be enclosed in braces as per OVS coding 
style.

I'll have some time to test this tomorrow, I take it this should be backported 
to OVS2.9 and OVS 2.8 also?

Ian

>  }
> 
>  conf.rxmode.hw_ip_checksum = (dev->hw_ol_features &
> --
> 2.7.4
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v8 2/6] dpif-netdev: retrieve flow directly from the flow mark

2018-04-10 Thread Stokes, Ian
> Subject: [PATCH v8 2/6] dpif-netdev: retrieve flow directly from the flow
> mark
> 
> From: Yuanhan Liu 
> 
> So that we could skip some very costly CPU operations, including but not
> limiting to miniflow_extract, emc lookup, dpcls lookup, etc. Thus,
> performance could be greatly improved.
> 
> A PHY-PHY forwarding with 1000 mega flows (udp,tp_src=1000-1999) and
> 1 million streams (tp_src=1000-1999, tp_dst=2000-2999) show more that 260%
> performance boost.
> 
> Note that though the heavy miniflow_extract is skipped, we still have to
> do per packet checking, due to we have to check the tcp_flags.
> 
> Co-authored-by: Finn Christensen 
> Signed-off-by: Yuanhan Liu 
> Signed-off-by: Finn Christensen 
> Signed-off-by: Shahaf Shuler 
> ---
>  lib/dp-packet.h   |  13 +
>  lib/dpif-netdev.c |  44 --
>  lib/flow.c| 155 +++--
>  lib/flow.h|   1 +
>  4 files changed, 175 insertions(+), 38 deletions(-)
> 
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 21c8ca5..dd3f17b
> 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -691,6 +691,19 @@ reset_dp_packet_checksum_ol_flags(struct dp_packet
> *p)  #define reset_dp_packet_checksum_ol_flags(arg)
>  #endif
> 
> +static inline bool
> +dp_packet_has_flow_mark(struct dp_packet *p OVS_UNUSED,
> +uint32_t *mark OVS_UNUSED) { #ifdef DPDK_NETDEV
> +if (p->mbuf.ol_flags & PKT_RX_FDIR_ID) {
> +*mark = p->mbuf.hash.fdir.hi;
> +return true;
> +}
> +#endif
> +return false;
> +}
> +
>  enum { NETDEV_MAX_BURST = 32 }; /* Maximum number packets in a batch. */
> 
>  struct dp_packet_batch {
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 2fdb6ef..7489a2f
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -2015,6 +2015,23 @@ flow_mark_flush(struct dp_netdev_pmd_thread *pmd)
>  }
>  }
> 
> +static struct dp_netdev_flow *
> +mark_to_flow_find(const struct dp_netdev_pmd_thread *pmd,
> +  const uint32_t mark)
> +{
> +struct dp_netdev_flow *flow;
> +
> +CMAP_FOR_EACH_WITH_HASH (flow, mark_node, hash_int(mark, 0),
> + _mark.mark_to_flow) {
> +if (flow->mark == mark && flow->pmd_id == pmd->core_id &&
> +flow->dead == false) {
> +return flow;
> +}
> +}
> +
> +return NULL;
> +}
> +
>  static void
>  dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread *pmd,
>struct dp_netdev_flow *flow) @@ -5204,10
> +5221,10 @@ struct packet_batch_per_flow {  static inline void
> packet_batch_per_flow_update(struct packet_batch_per_flow *batch,
>   struct dp_packet *packet,
> - const struct miniflow *mf)
> + uint16_t tcp_flags)
>  {
>  batch->byte_count += dp_packet_size(packet);
> -batch->tcp_flags |= miniflow_get_tcp_flags(mf);
> +batch->tcp_flags |= tcp_flags;
>  batch->array.packets[batch->array.count++] = packet;  }
> 
> @@ -5241,7 +5258,7 @@ packet_batch_per_flow_execute(struct
> packet_batch_per_flow *batch,
> 
>  static inline void
>  dp_netdev_queue_batches(struct dp_packet *pkt,
> -struct dp_netdev_flow *flow, const struct
> miniflow *mf,
> +struct dp_netdev_flow *flow, uint16_t
> + tcp_flags,
>  struct packet_batch_per_flow *batches,
>  size_t *n_batches)  { @@ -5252,7 +5269,7 @@
> dp_netdev_queue_batches(struct dp_packet *pkt,
>  packet_batch_per_flow_init(batch, flow);
>  }
> 
> -packet_batch_per_flow_update(batch, pkt, mf);
> +packet_batch_per_flow_update(batch, pkt, tcp_flags);
>  }
> 
>  /* Try to process all ('cnt') the 'packets' using only the exact match
> cache @@ -5283,6 +5300,7 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
>  const size_t cnt = dp_packet_batch_size(packets_);
>  uint32_t cur_min;
>  int i;
> +uint16_t tcp_flags;
> 
>  atomic_read_relaxed(>dp->emc_insert_min, _min);
>  pmd_perf_update_counter(>perf_stats,
> @@ -5291,6 +5309,7 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
> 
>  DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
>  struct dp_netdev_flow *flow;
> +uint32_t mark;
> 
>  if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>  dp_packet_delete(packet);
> @@ -5298,6 +5317,16 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>  continue;
>  }
> 
> +if (dp_packet_has_flow_mark(packet, )) {
> +flow = mark_to_flow_find(pmd, mark);
> +if (flow) {
> +tcp_flags = parse_tcp_flags(packet);
> +dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> +   

Re: [ovs-dev] [PATCH v8 3/6] netdev-dpdk: implement flow offload with rte flow

2018-04-10 Thread Stokes, Ian
> The basic yet the major part of this patch is to translate the "match"
> to rte flow patterns. And then, we create a rte flow with MARK + RSS
> actions. Afterwards, all packets match the flow will have the mark id in
> the mbuf.
> 
> The reason RSS is needed is, for most NICs, a MARK only action is not
> allowed. It has to be used together with some other actions, such as
> QUEUE, RSS, etc. However, QUEUE action can specify one queue only, which
> may break the rss. Likely, RSS action is currently the best we could now.
> Thus, RSS action is choosen.
> 
> For any unsupported flows, such as MPLS, -1 is returned, meaning the flow
> offload is failed and then skipped.
> 
> Co-authored-by: Yuanhan Liu 
> Signed-off-by: Finn Christensen 
> Signed-off-by: Yuanhan Liu 
> Signed-off-by: Shahaf Shuler 
> ---
>  lib/netdev-dpdk.c | 563 -
>  1 file changed, 562 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index af9843a..df4d480
> 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -38,7 +38,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
> +#include "cmap.h"
>  #include "dirs.h"
>  #include "dp-packet.h"
>  #include "dpdk.h"
> @@ -51,6 +53,7 @@
>  #include "openvswitch/list.h"
>  #include "openvswitch/ofp-print.h"
>  #include "openvswitch/vlog.h"
> +#include "openvswitch/match.h"
>  #include "ovs-numa.h"
>  #include "ovs-thread.h"
>  #include "ovs-rcu.h"
> @@ -60,6 +63,7 @@
>  #include "sset.h"
>  #include "unaligned.h"
>  #include "timeval.h"
> +#include "uuid.h"
>  #include "unixctl.h"
> 
>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM}; @@ -170,6 +174,17 @@ static
> const struct rte_eth_conf port_conf = {  };
> 
>  /*
> + * A mapping from ufid to dpdk rte_flow.
> + */
> +static struct cmap ufid_to_rte_flow = CMAP_INITIALIZER;
> +
> +struct ufid_to_rte_flow_data {
> +struct cmap_node node;
> +ovs_u128 ufid;
> +struct rte_flow *rte_flow;
> +};
> +
> +/*
>   * These callbacks allow virtio-net devices to be added to vhost ports
> when
>   * configuration has been fully completed.
>   */
> @@ -3709,6 +3724,552 @@ unlock:
>  return err;
>  }
> 
> +
> +/* Find rte_flow with @ufid */
> +static struct rte_flow *
> +ufid_to_rte_flow_find(const ovs_u128 *ufid) {
> +size_t hash = hash_bytes(ufid, sizeof(*ufid), 0);
> +struct ufid_to_rte_flow_data *data;
> +
> +CMAP_FOR_EACH_WITH_HASH (data, node, hash, _to_rte_flow) {
> +if (ovs_u128_equals(*ufid, data->ufid)) {
> +return data->rte_flow;
> +}
> +}
> +
> +return NULL;
> +}
> +
> +static inline void
> +ufid_to_rte_flow_associate(const ovs_u128 *ufid,
> +   struct rte_flow *rte_flow) {
> +size_t hash = hash_bytes(ufid, sizeof(*ufid), 0);
> +struct ufid_to_rte_flow_data *data = xzalloc(sizeof(*data));
> +
> +/*
> + * We should not simply overwrite an existing rte flow.
> + * We should have deleted it first before re-adding it.
> + * Thus, if following assert triggers, something is wrong:
> + * the rte_flow is not destroyed.
> + */
> +ovs_assert(ufid_to_rte_flow_find(ufid) == NULL);
> +
> +data->ufid = *ufid;
> +data->rte_flow = rte_flow;
> +
> +cmap_insert(_to_rte_flow,
> +CONST_CAST(struct cmap_node *, >node), hash); }
> +
> +static inline void
> +ufid_to_rte_flow_disassociate(const ovs_u128 *ufid) {
> +size_t hash = hash_bytes(ufid, sizeof(*ufid), 0);
> +struct ufid_to_rte_flow_data *data;
> +
> +CMAP_FOR_EACH_WITH_HASH (data, node, hash, _to_rte_flow) {
> +if (ovs_u128_equals(*ufid, data->ufid)) {
> +cmap_remove(_to_rte_flow,
> +CONST_CAST(struct cmap_node *, >node),
> hash);
> +free(data);
> +return;
> +}
> +}
> +
> +VLOG_WARN("ufid "UUID_FMT" is not associated with an rte flow\n",
> +  UUID_ARGS((struct uuid *)ufid)); }
> +
> +/*
> + * To avoid individual xrealloc calls for each new element, a
> 'curent_max'
> + * is used to keep track of current allocated number of elements.
> +Starts
> + * by 8 and doubles on each xrealloc call  */ struct flow_patterns {
> +struct rte_flow_item *items;
> +int cnt;
> +int current_max;
> +};
> +
> +struct flow_actions {
> +struct rte_flow_action *actions;
> +int cnt;
> +int current_max;
> +};
> +
> +static void
> +add_flow_pattern(struct flow_patterns *patterns, enum rte_flow_item_type
> type,
> + const void *spec, const void *mask) {
> +int cnt = patterns->cnt;
> +
> +if (cnt == 0) {
> +patterns->current_max = 8;
> +patterns->items = xcalloc(patterns->current_max, sizeof(struct
> rte_flow_item));

Above line exceeds the 79 character limit.

> +} else if (cnt == patterns->current_max) {
> +patterns->current_max *= 2;
> 

Re: [ovs-dev] [PATCH v8 5/6] dpif-netdev: do hw flow offload in a thread

2018-04-10 Thread Stokes, Ian
> Currently, the major trigger for hw flow offload is at upcall handling,
> which is actually in the datapath. Moreover, the hw offload installation
> and modification is not that lightweight. Meaning, if there are so many
> flows being added or modified frequently, it could stall the datapath,
> which could result to packet loss.
> 
> To diminish that, all those flow operations will be recorded and appended
> to a list. A thread is then introduced to process this list (to do the
> real flow offloading put/del operations). This could leave the datapath as
> lightweight as possible.

Just a general query and not related to this patch specifically, but have you 
given any thought to statistics for the HWOL usecase? Should they be tracked in 
any way for OVS or if tracked by the NIC can they be accessed by OVS?

More comments inline below.
> 
> Signed-off-by: Yuanhan Liu 
> Signed-off-by: Shahaf Shuler 
> ---
>  lib/dpif-netdev.c | 348 -
>  1 file changed, 258 insertions(+), 90 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 7489a2f..8300286
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -345,6 +345,12 @@ enum rxq_cycles_counter_type {
>  RXQ_N_CYCLES
>  };
> 
> +enum {
> +DP_NETDEV_FLOW_OFFLOAD_OP_ADD,
> +DP_NETDEV_FLOW_OFFLOAD_OP_MOD,
> +DP_NETDEV_FLOW_OFFLOAD_OP_DEL,
> +};
> +
>  #define XPS_TIMEOUT 50LL/* In microseconds. */
> 
>  /* Contained by struct dp_netdev_port's 'rxqs' member.  */ @@ -721,6
> +727,8 @@ static inline bool emc_entry_alive(struct emc_entry *ce);
> static void emc_clear_entry(struct emc_entry *ce);
> 
>  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
> +static void queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
> +  struct dp_netdev_flow *flow);
> 
>  static void
>  emc_cache_init(struct emc_cache *flow_cache) @@ -1854,13 +1862,11 @@
> struct flow_mark {
>  struct cmap megaflow_to_mark;
>  struct cmap mark_to_flow;
>  struct id_pool *pool;
> -struct ovs_mutex mutex;
>  };
> 
>  static struct flow_mark flow_mark = {
>  .megaflow_to_mark = CMAP_INITIALIZER,
>  .mark_to_flow = CMAP_INITIALIZER,
> -.mutex = OVS_MUTEX_INITIALIZER,
>  };
> 
>  static uint32_t
> @@ -2010,7 +2016,7 @@ flow_mark_flush(struct dp_netdev_pmd_thread *pmd)
> 
>  CMAP_FOR_EACH (flow, mark_node, _mark.mark_to_flow) {
>  if (flow->pmd_id == pmd->core_id) {
> -mark_to_flow_disassociate(pmd, flow);
> +queue_netdev_flow_del(pmd, flow);
>  }
>  }
>  }
> @@ -2032,6 +2038,251 @@ mark_to_flow_find(const struct
> dp_netdev_pmd_thread *pmd,
>  return NULL;
>  }
> 
> +struct dp_flow_offload_item {
> +struct dp_netdev_pmd_thread *pmd;
> +struct dp_netdev_flow *flow;
> +int op;
> +struct match match;
> +struct nlattr *actions;
> +size_t actions_len;
> +
> +struct ovs_list node;
> +};
> +
> +struct dp_flow_offload {
> +struct ovs_mutex mutex;
> +struct ovs_list list;
> +pthread_cond_t cond;
> +};
> +
> +static struct dp_flow_offload dp_flow_offload = {
> +.mutex = OVS_MUTEX_INITIALIZER,
> +.list  = OVS_LIST_INITIALIZER(_flow_offload.list),
> +};
> +
> +static struct ovsthread_once offload_thread_once
> += OVSTHREAD_ONCE_INITIALIZER;

The structs above are declared mid file after the pre-existing 
mark_to_flow_find function.

It would look cleaner if declared toward the beginning with the enums etc, so 
as to keep functions and structs separate.
> +
> +static struct dp_flow_offload_item *
> +dp_netdev_alloc_flow_offload(struct dp_netdev_pmd_thread *pmd,
> + struct dp_netdev_flow *flow,
> + int op)
> +{
> +struct dp_flow_offload_item *offload;
> +
> +offload = xzalloc(sizeof(*offload));
> +offload->pmd = pmd;
> +offload->flow = flow;
> +offload->op = op;
> +
> +dp_netdev_flow_ref(flow);
> +dp_netdev_pmd_try_ref(pmd);
> +
> +return offload;
> +}
> +
> +static void
> +dp_netdev_free_flow_offload(struct dp_flow_offload_item *offload) {
> +dp_netdev_pmd_unref(offload->pmd);
> +dp_netdev_flow_unref(offload->flow);
> +
> +free(offload->actions);
> +free(offload);
> +}
> +
> +static void
> +dp_netdev_append_flow_offload(struct dp_flow_offload_item *offload) {
> +ovs_mutex_lock(_flow_offload.mutex);
> +ovs_list_push_back(_flow_offload.list, >node);
> +xpthread_cond_signal(_flow_offload.cond);
> +ovs_mutex_unlock(_flow_offload.mutex);
> +}
> +
> +static int
> +dp_netdev_flow_offload_del(struct dp_flow_offload_item *offload) {
> +return mark_to_flow_disassociate(offload->pmd, offload->flow); }
> +
> +/*
> + * There are two flow offload operations here: addition and modification.
> + *
> + * For flow addition, this function does:
> + * - allocate a new flow mark id
> + * - 

Re: [ovs-dev] [PATCH 1/2] vswitchd: Allow user to directly specify sFlow agent address.

2018-04-10 Thread Ben Pfaff
It's currently IPv4 only but it's a good idea to add IPv6 support.  I'll
put that on my to-do list.

It's also a good idea to warn about the potential perils of this
feature.  I'll do that too.

I'll try to get to this soon, neither feature should be much work.

On Tue, Apr 10, 2018 at 10:13:24AM -0700, Neil McKee wrote:
> No objections.  Sounds useful for testing,  and for situations where
> the agent-address selection has been made by a higher controller.
> Just one question: it seems like it is only IPv4: can an IPv6
> agent-address be configured this way too?
> 
> I'll admit to this triggering bad memories of a situation when two
> switches were sending sFlow using each other's agent address,  so
> perhaps a warning not to use this "manually" would be appropriate?
> 
> --
> Neil McKee
> InMon Corp.
> http://www.inmon.com
> 
> 
> On Sat, Mar 31, 2018 at 5:12 PM, Ben Pfaff  wrote:
> > At least for testing purposes, and perhaps in production, it is useful to
> > be able to fix the agent IP address directly, rather that indirecting it
> > through a device name or the routing table.
> >
> > This commit uses this feature to fix the agent IP address used in the unit
> > tests.  This will be particularly useful in an upcoming commit that
> > disables the use of the system routing table in the unit tests, to make
> > the tests' results independent of the host's routes.
> >
> > CC: Neil McKee 
> > Signed-off-by: Ben Pfaff 
> > ---
> >  ofproto/ofproto-dpif-sflow.c |  3 ++-
> >  tests/ofproto-dpif.at|  8 
> >  vswitchd/vswitch.xml | 25 ++---
> >  3 files changed, 24 insertions(+), 12 deletions(-)
> >
> > diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c
> > index 60e1b4e0a661..5d8c0e19f8e3 100644
> > --- a/ofproto/ofproto-dpif-sflow.c
> > +++ b/ofproto/ofproto-dpif-sflow.c
> > @@ -453,7 +453,8 @@ sflow_choose_agent_address(const char *agent_device,
> >  agent_addr->type = SFLADDRESSTYPE_IP_V4;
> >
> >  if (agent_device) {
> > -if (!netdev_get_in4_by_name(agent_device, )) {
> > +if (!netdev_get_in4_by_name(agent_device, )
> > +|| !lookup_ip(agent_device, )) {
> >  goto success;
> >  }
> >  }
> > diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
> > index d2058eddd3eb..60f28e2a08dd 100644
> > --- a/tests/ofproto-dpif.at
> > +++ b/tests/ofproto-dpif.at
> > @@ -6333,7 +6333,7 @@ ovs-vsctl \
> >set Interface p1 options:ifindex=1003 -- \
> >set Bridge br0 sflow=@sf --  \
> >--id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\"  \
> > -  header=128 sampling=1 polling=1
> > +  header=128 sampling=1 polling=1 agent=127.0.0.1
> >
> >  dnl sleep long enough to get the sFlow datagram flushed out (may be 
> > delayed for up to 1 second)
> >  AT_CHECK([ovs-appctl time/warp 2000 100], [0], [ignore])
> > @@ -6382,7 +6382,7 @@ AT_CHECK([ovs-ofctl add-flow br0 action=3])
> >  dnl enable sflow
> >  ovs-vsctl \
> > set Bridge br0 sflow=@sf -- \
> > -   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" \
> > +   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" agent=127.0.0.1 
> > \
> >   header=128 sampling=1 polling=0
> >
> >  dnl introduce a packet that will be flooded to the tunnel
> > @@ -6474,7 +6474,7 @@ AT_CHECK([ovs-appctl netdev-dummy/receive p0 
> > 'recirc_id(0),in_port(1),eth(src=f8
> >  dnl configure sflow on int-br only
> >  ovs-vsctl \
> > set Bridge int-br sflow=@sf -- \
> > -   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" \
> > +   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" agent=127.0.0.1 
> > \
> >   header=128 sampling=1 polling=0
> >
> >  dnl set up route to 192.168.1.2 via br0
> > @@ -6553,7 +6553,7 @@ ovs-appctl time/stop
> >  dnl configure sflow
> >  ovs-vsctl \
> > set Bridge br0 sflow=@sf -- \
> > -   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" \
> > +   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" agent=127.0.0.1 
> > \
> >   header=128 sampling=1 polling=0
> >
> >  AT_CHECK([ovs-appctl netdev-dummy/receive p1 
> > 'in_port(1),eth(src=50:54:00:00:00:09,dst=50:54:00:00:00:0a),eth_type(0x8847),mpls(label=11,tc=3,ttl=64,bos=1)'])
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index f899a19764a4..9c2a8263e604 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -5099,13 +5099,24 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 
> > type=patch options:peer=p1 \
> >  monitoring of switches.
> >
> >  
> > -  Name of the network device whose IP address should be reported as the
> > -  ``agent address'' to collectors.  If not specified, the agent device 
> > is
> > -  figured from the first target address and the routing table.  If the
> > -  routing table does not contain a route to the target, the IP 

Re: [ovs-dev] [branch-2.9 PATCH 1/2] netdev-dpdk: Free mempool only when no in-use mbufs.

2018-04-10 Thread Kevin Traynor
On 04/09/2018 03:36 PM, Kevin Traynor wrote:
> On 04/06/2018 04:51 PM, Ilya Maximets wrote:
 DPDK mempools are freed when they are no longer needed.
 This can happen when a port is removed or a port's mtu is reconfigured so
 that a new mempool is used.

 It is possible that an mbuf is attempted to be returned to a freed mempool
 from NIC Tx queues and this can lead to a segfault.

 In order to prevent this, only free mempools when they are not needed and
 have no in-use mbufs. As this might not be possible immediately, sweep the
 mempools anytime a port tries to get a mempool.

>>>
>>> Thanks for working on this Kevin. Just a query below. From a testing POV I 
>>> didn't come across any issues.
>>>
 Fixes: 8d38823bdf8b ("netdev-dpdk: fix memory leak")
 Cc: mark.b.kavanagh81 at gmail.com
 Signed-off-by: Kevin Traynor 
 ---

 Found on OVS 2.6, but code is operationally similar on recent the branch-
 2.*'s. If the patch is acceptable I can backport to older branches.
>>
>> This issue was actually rised back in January while discussing mempool issue.
>>
> 
> Hmmm, thanks for pointing it out. Seems the last mails in the thread
> coincided with when I was traveling and I didn't read them back then.
> 
>>>
>>> Sounds good to me.

  lib/netdev-dpdk.c | 40 +++-
  1 file changed, 27 insertions(+), 13 deletions(-)

 diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index c19cedc..9b8e732
 100644
 --- a/lib/netdev-dpdk.c
 +++ b/lib/netdev-dpdk.c
 @@ -590,8 +590,24 @@ dpdk_mp_create(int socket_id, int mtu)  }

 +/* Free unused mempools. */
 +static void
 +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex) {
 +struct dpdk_mp *dmp, *next;
 +
 +LIST_FOR_EACH_SAFE (dmp, next, list_node, _mp_list) {
 +if (!dmp->refcount && rte_mempool_full(dmp->mp)) {
>>>
>>> I hadn't looked at rte_mempool_full() before. I noticed the documentation 
>>> warns not to use it as part of the datapath but for debug purposes only.
>>
>> Yes, rte_mempool_full() is racy and could return false-positives, but it's
>> the best and only useful function in DPDK API. In practice we can never be
>> sure if someone still uses the memory pool. I also used this function in
>> my solution for branch 2.8 posted here:
>>
>> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046100.html
>>
> 
> The similarities and differences are interesting!
> 
>> Memory handling in DPDK is awful. In fact, noone is able to use 
>> rte_mempool_free(),
>> because there is no way correct to check if memory pool is still in use.
>>
>> There is one dirty hack to check if mempool is really full. It based on the
>> internal mempool implementation and the fact that we're not allocating any
>> mbufs from the pool:
>> 1.  Check rte_mempool_ops_get_count(). <-- this exposed externally for some 
>> reason
>> 2.  Check rte_mempool_full().
>> 3.  Check rte_mempool_ops_get_count() again.
>> 4.  If rte_mempool_full() was 'true' and both calls to
>> rte_mempool_ops_get_count() returned same value > Mempool really 
>> full.
>> 5.  Here we could safely call rte_mempool_free().
>>
> 
> Yes, that would work.
> 

I've just sent a v2 (for branch-2.9 even though I forgot it in the
subject this time). It calls rte_mempool_full() twice before freeing.
Similar to above, as we know we are not getting mbufs from the mempool
anymore, the second call will confirm that we didn't have a race in the
first call.

Ilya, if you are ok with the patch, let me know if I should add you as
co-author as you had similar patch previously.

Kevin.

> The ring count is made before the cache count in rte_mempool_full() and
> as we are not getting mbufs I think we should only have the possibility
> of mbufs moving from a cache to the ring. In that case we may get a
> false negative but wouldn't get a false positive for rte_mempool_full().
> Obviously the downside is that it is implementation specific - but maybe
> it could be argued the implementation will not change on already
> released DPDK versions. What do you think?

>>> Does its use add to the downtime in a noticeable way while we reconfigure? 
>>> I'm thinking of a deployment where the number of lcores is high and we 
>>> reconfigure often. When cache is enabled, it has to browse the length of 
>>> all lcores. There may not be an issue here but was curious.
>>
> 
> I think it should be insignificant compared to actually
> allocating/freeing a mempool. I will try to check it.
> 
>> That is the interesting question. In my version (referenced above) I used
>> watchdog thread to periodically free mempools with zero refcount.
>> Not sure which solution is better.
>>
> 
> The advantage I see of being on the watchdog timer thread is that
> probably no one cares how long it takes :-) The advantage of being in
> the mempool config is that it runs (and guarantees 

[ovs-dev] [PATCH v2 2/2] netdev-dpdk: Add mempool reuse/free debug.

2018-04-10 Thread Kevin Traynor
There is debug when a new mempool is created, but not
when it is reused or freed. Add these as it is very
difficult to debug mempool issues from logs without
them.

Signed-off-by: Kevin Traynor 
---
 lib/netdev-dpdk.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 00306c4..88ef156 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -605,4 +605,5 @@ dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
  */
 if (OVS_LIKELY(rte_mempool_full(dmp->mp))) {
+VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
 ovs_list_remove(>list_node);
 rte_mempool_free(dmp->mp);
@@ -622,4 +623,5 @@ dpdk_mp_get(int socket_id, int mtu)
 LIST_FOR_EACH (dmp, list_node, _mp_list) {
 if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
+VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
 dmp->refcount++;
 reuse = true;
-- 
1.8.3.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 1/2] netdev-dpdk: Free mempool only when no in-use mbufs.

2018-04-10 Thread Kevin Traynor
DPDK mempools are freed when they are no longer needed.
This can happen when a port is removed or a port's mtu
is reconfigured so that a new mempool is used.

It is possible that an mbuf is attempted to be returned
to a freed mempool from NIC Tx queues and this can lead
to a segfault.

In order to prevent this, only free mempools when they
are not needed and have no in-use mbufs. As this might
not be possible immediately, sweep the mempools anytime
a port tries to get a mempool.

Fixes: 8d38823bdf8b ("netdev-dpdk: fix memory leak")
Cc: mark.b.kavanag...@gmail.com
Cc: Ilya Maximets 
Reported-by: Venkatesan Pradeep 
Signed-off-by: Kevin Traynor 
---

v2: Add second call to rte_mempool_full() as it is not atomic
so we can't trust one call to it.

 lib/netdev-dpdk.c | 48 +++-
 1 file changed, 35 insertions(+), 13 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index c19cedc..00306c4 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -590,8 +590,32 @@ dpdk_mp_create(int socket_id, int mtu)
 }
 
+/* Free unused mempools. */
+static void
+dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
+{
+struct dpdk_mp *dmp, *next;
+
+LIST_FOR_EACH_SAFE (dmp, next, list_node, _mp_list) {
+if (!dmp->refcount && rte_mempool_full(dmp->mp)) {
+/*
+ * rte_mempool_full() is not atomic, but at this point
+ * we will not be getting any more mbufs from the mempool
+ * so if it reports full again on a subsequent call, then
+ * we can be sure that it really is full.
+ */
+if (OVS_LIKELY(rte_mempool_full(dmp->mp))) {
+ovs_list_remove(>list_node);
+rte_mempool_free(dmp->mp);
+rte_free(dmp);
+}
+}
+}
+}
+
 static struct dpdk_mp *
 dpdk_mp_get(int socket_id, int mtu)
 {
 struct dpdk_mp *dmp;
+bool reuse = false;
 
 ovs_mutex_lock(_mp_mutex);
@@ -599,15 +623,18 @@ dpdk_mp_get(int socket_id, int mtu)
 if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
 dmp->refcount++;
-goto out;
+reuse = true;
+break;
 }
 }
+/* Sweep mempools after reuse or before create. */
+dpdk_mp_sweep();
 
-dmp = dpdk_mp_create(socket_id, mtu);
-if (dmp) {
-ovs_list_push_back(_mp_list, >list_node);
+if (!reuse) {
+dmp = dpdk_mp_create(socket_id, mtu);
+if (dmp) {
+ovs_list_push_back(_mp_list, >list_node);
+}
 }
 
-out:
-
 ovs_mutex_unlock(_mp_mutex);
 
@@ -615,5 +642,5 @@ out:
 }
 
-/* Release an existing mempool. */
+/* Decrement reference to a mempool. */
 static void
 dpdk_mp_put(struct dpdk_mp *dmp)
@@ -625,10 +652,5 @@ dpdk_mp_put(struct dpdk_mp *dmp)
 ovs_mutex_lock(_mp_mutex);
 ovs_assert(dmp->refcount);
-
-if (!--dmp->refcount) {
-ovs_list_remove(>list_node);
-rte_mempool_free(dmp->mp);
-rte_free(dmp);
- }
+dmp->refcount--;
 ovs_mutex_unlock(_mp_mutex);
 }
-- 
1.8.3.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/2] vswitchd: Allow user to directly specify sFlow agent address.

2018-04-10 Thread Neil McKee
No objections.  Sounds useful for testing,  and for situations where
the agent-address selection has been made by a higher controller.
Just one question: it seems like it is only IPv4: can an IPv6
agent-address be configured this way too?

I'll admit to this triggering bad memories of a situation when two
switches were sending sFlow using each other's agent address,  so
perhaps a warning not to use this "manually" would be appropriate?

--
Neil McKee
InMon Corp.
http://www.inmon.com


On Sat, Mar 31, 2018 at 5:12 PM, Ben Pfaff  wrote:
> At least for testing purposes, and perhaps in production, it is useful to
> be able to fix the agent IP address directly, rather that indirecting it
> through a device name or the routing table.
>
> This commit uses this feature to fix the agent IP address used in the unit
> tests.  This will be particularly useful in an upcoming commit that
> disables the use of the system routing table in the unit tests, to make
> the tests' results independent of the host's routes.
>
> CC: Neil McKee 
> Signed-off-by: Ben Pfaff 
> ---
>  ofproto/ofproto-dpif-sflow.c |  3 ++-
>  tests/ofproto-dpif.at|  8 
>  vswitchd/vswitch.xml | 25 ++---
>  3 files changed, 24 insertions(+), 12 deletions(-)
>
> diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c
> index 60e1b4e0a661..5d8c0e19f8e3 100644
> --- a/ofproto/ofproto-dpif-sflow.c
> +++ b/ofproto/ofproto-dpif-sflow.c
> @@ -453,7 +453,8 @@ sflow_choose_agent_address(const char *agent_device,
>  agent_addr->type = SFLADDRESSTYPE_IP_V4;
>
>  if (agent_device) {
> -if (!netdev_get_in4_by_name(agent_device, )) {
> +if (!netdev_get_in4_by_name(agent_device, )
> +|| !lookup_ip(agent_device, )) {
>  goto success;
>  }
>  }
> diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
> index d2058eddd3eb..60f28e2a08dd 100644
> --- a/tests/ofproto-dpif.at
> +++ b/tests/ofproto-dpif.at
> @@ -6333,7 +6333,7 @@ ovs-vsctl \
>set Interface p1 options:ifindex=1003 -- \
>set Bridge br0 sflow=@sf --  \
>--id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\"  \
> -  header=128 sampling=1 polling=1
> +  header=128 sampling=1 polling=1 agent=127.0.0.1
>
>  dnl sleep long enough to get the sFlow datagram flushed out (may be delayed 
> for up to 1 second)
>  AT_CHECK([ovs-appctl time/warp 2000 100], [0], [ignore])
> @@ -6382,7 +6382,7 @@ AT_CHECK([ovs-ofctl add-flow br0 action=3])
>  dnl enable sflow
>  ovs-vsctl \
> set Bridge br0 sflow=@sf -- \
> -   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" \
> +   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" agent=127.0.0.1 \
>   header=128 sampling=1 polling=0
>
>  dnl introduce a packet that will be flooded to the tunnel
> @@ -6474,7 +6474,7 @@ AT_CHECK([ovs-appctl netdev-dummy/receive p0 
> 'recirc_id(0),in_port(1),eth(src=f8
>  dnl configure sflow on int-br only
>  ovs-vsctl \
> set Bridge int-br sflow=@sf -- \
> -   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" \
> +   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" agent=127.0.0.1 \
>   header=128 sampling=1 polling=0
>
>  dnl set up route to 192.168.1.2 via br0
> @@ -6553,7 +6553,7 @@ ovs-appctl time/stop
>  dnl configure sflow
>  ovs-vsctl \
> set Bridge br0 sflow=@sf -- \
> -   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" \
> +   --id=@sf create sflow targets=\"127.0.0.1:$SFLOW_PORT\" agent=127.0.0.1 \
>   header=128 sampling=1 polling=0
>
>  AT_CHECK([ovs-appctl netdev-dummy/receive p1 
> 'in_port(1),eth(src=50:54:00:00:00:09,dst=50:54:00:00:00:0a),eth_type(0x8847),mpls(label=11,tc=3,ttl=64,bos=1)'])
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index f899a19764a4..9c2a8263e604 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -5099,13 +5099,24 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 
> type=patch options:peer=p1 \
>  monitoring of switches.
>
>  
> -  Name of the network device whose IP address should be reported as the
> -  ``agent address'' to collectors.  If not specified, the agent device is
> -  figured from the first target address and the routing table.  If the
> -  routing table does not contain a route to the target, the IP address
> -  defaults to the  in the
> -  collector's .  If an agent IP address cannot 
> be
> -  determined any of these ways, sFlow is disabled.
> +  
> +Determines the agent address, that is, the IP address reported to
> +collectors as the source of the sFlow data.  It may be an IP address 
> or
> +the name of a network device.  In the latter case, the network 
> device's
> +IP address is used,
> +  
> +
> +  
> +If not specified, the agent device is figured from the first target

[ovs-dev] Home Office o Trabajo en Casa

2018-04-10 Thread Contras, desventajas y limitaciones
 

 
Implementación de Home Office 
Abril 17 - webinar Interactivo

Introducción:

Las empresas han notado que ser flexibles en los horarios y espacios de trabajo 
les brinda múltiples beneficios: más productividad de sus colaboradores, 
economía en recursos e instalaciones, un mejor ambiente laboral… 



Temas a Tratar: 

1. El Home Office o trabajo en casa, la nueva forma de trabajar.
2. Contras, desventajas y limitaciones.
3. Políticas empresariales.
4.Supervisión y control del empleado que trabaja en casa.
5. Panorama del Home Office en el mundo.


 
 
Temario e Inscripciones:

Respondiendo por este medio "Home"+TELÉFONO + NOMBRE o marcando al:

045 + 5515546630 
 

 


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [branch-2.9 PATCH 1/2] netdev-dpdk: Free mempool only when no in-use mbufs.

2018-04-10 Thread Kevin Traynor
On 04/09/2018 03:36 PM, Kevin Traynor wrote:
> On 04/06/2018 04:51 PM, Ilya Maximets wrote:
 DPDK mempools are freed when they are no longer needed.
 This can happen when a port is removed or a port's mtu is reconfigured so
 that a new mempool is used.

 It is possible that an mbuf is attempted to be returned to a freed mempool
 from NIC Tx queues and this can lead to a segfault.

 In order to prevent this, only free mempools when they are not needed and
 have no in-use mbufs. As this might not be possible immediately, sweep the
 mempools anytime a port tries to get a mempool.

>>>
>>> Thanks for working on this Kevin. Just a query below. From a testing POV I 
>>> didn't come across any issues.
>>>
 Fixes: 8d38823bdf8b ("netdev-dpdk: fix memory leak")
 Cc: mark.b.kavanagh81 at gmail.com
 Signed-off-by: Kevin Traynor 
 ---

 Found on OVS 2.6, but code is operationally similar on recent the branch-
 2.*'s. If the patch is acceptable I can backport to older branches.
>>
>> This issue was actually rised back in January while discussing mempool issue.
>>
> 
> Hmmm, thanks for pointing it out. Seems the last mails in the thread
> coincided with when I was traveling and I didn't read them back then.
> 
>>>
>>> Sounds good to me.

  lib/netdev-dpdk.c | 40 +++-
  1 file changed, 27 insertions(+), 13 deletions(-)

 diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index c19cedc..9b8e732
 100644
 --- a/lib/netdev-dpdk.c
 +++ b/lib/netdev-dpdk.c
 @@ -590,8 +590,24 @@ dpdk_mp_create(int socket_id, int mtu)  }

 +/* Free unused mempools. */
 +static void
 +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex) {
 +struct dpdk_mp *dmp, *next;
 +
 +LIST_FOR_EACH_SAFE (dmp, next, list_node, _mp_list) {
 +if (!dmp->refcount && rte_mempool_full(dmp->mp)) {
>>>
>>> I hadn't looked at rte_mempool_full() before. I noticed the documentation 
>>> warns not to use it as part of the datapath but for debug purposes only.
>>
>> Yes, rte_mempool_full() is racy and could return false-positives, but it's
>> the best and only useful function in DPDK API. In practice we can never be
>> sure if someone still uses the memory pool. I also used this function in
>> my solution for branch 2.8 posted here:
>>
>> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046100.html
>>
> 
> The similarities and differences are interesting!
> 
>> Memory handling in DPDK is awful. In fact, noone is able to use 
>> rte_mempool_free(),
>> because there is no way correct to check if memory pool is still in use.
>>
>> There is one dirty hack to check if mempool is really full. It based on the
>> internal mempool implementation and the fact that we're not allocating any
>> mbufs from the pool:
>> 1.  Check rte_mempool_ops_get_count(). <-- this exposed externally for some 
>> reason
>> 2.  Check rte_mempool_full().
>> 3.  Check rte_mempool_ops_get_count() again.
>> 4.  If rte_mempool_full() was 'true' and both calls to
>> rte_mempool_ops_get_count() returned same value > Mempool really 
>> full.
>> 5.  Here we could safely call rte_mempool_free().
>>
> 
> Yes, that would work.
> 
> The ring count is made before the cache count in rte_mempool_full() and
> as we are not getting mbufs I think we should only have the possibility
> of mbufs moving from a cache to the ring. In that case we may get a
> false negative but wouldn't get a false positive for rte_mempool_full().
> Obviously the downside is that it is implementation specific - but maybe
> it could be argued the implementation will not change on already
> released DPDK versions. What do you think?
> 
>>>
>>> Does its use add to the downtime in a noticeable way while we reconfigure? 
>>> I'm thinking of a deployment where the number of lcores is high and we 
>>> reconfigure often. When cache is enabled, it has to browse the length of 
>>> all lcores. There may not be an issue here but was curious.
>>
> 
> I think it should be insignificant compared to actually
> allocating/freeing a mempool. I will try to check it.
> 

I measured 4K-5K cycles, so extra time for rte_mempool_full() is not an
issue in the control path.

>> That is the interesting question. In my version (referenced above) I used
>> watchdog thread to periodically free mempools with zero refcount.
>> Not sure which solution is better.
>>
> 
> The advantage I see of being on the watchdog timer thread is that
> probably no one cares how long it takes :-) The advantage of being in
> the mempool config is that it runs (and guarantees to run) just before a
> new mempool is going to be requested, so there is smaller risk of
> additional memory requirements.
> 
> In the short term I think we should use the a fix like the patches that
> don't cover every possible corner case or the get_count() that isn't the
> most elegant. Longer term we can think 

Re: [ovs-dev] [PATCH] rhel: user/group openvswitch does not exist

2018-04-10 Thread Timothy Redaelli
On Tue, 10 Apr 2018 09:49:36 -0400
Aaron Conole  wrote:

> From: Alan Pevec 
> 
> Default ownership[1] for config files is failing on an empty system:
>   Running scriptlet: openvswitch-2.9.0-3.fc28.x86_64
> warning: user openvswitch does not exist - using root
> warning: group openvswitch does not exist - using root
> ...
> 
> Required user/group need to be created in %pre as documented in
> Fedora guideline[2]
> 
> [1]
> https://github.com/openvswitch/ovs/commit/951d79e638ecdb3b1dcd19df1adb2ff91fe61af8
> 
> [2]
> https://fedoraproject.org/wiki/Packaging:UsersAndGroups#Dynamic_allocation
> 
> Submitted-at: https://github.com/openvswitch/ovs/pull/223
> Signed-off-by: Alan Pevec 
> Co-authored-by: Aaron Conole 
> Signed-off-by: Aaron Conole 
> ---
> NOTE: This differs from the pull request upstream as I've also moved
> the dpdk section to %pre, after talking with Alan.
> 
>  rhel/openvswitch-fedora.spec.in | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/rhel/openvswitch-fedora.spec.in
> b/rhel/openvswitch-fedora.spec.in index 4f2398d97..4b9831674 100644
> --- a/rhel/openvswitch-fedora.spec.in
> +++ b/rhel/openvswitch-fedora.spec.in
> @@ -92,6 +92,7 @@ Requires: openssl hostname iproute module-init-tools
>  #Upstream kernel commit 4f647e0a3c37b8d5086214128614a136064110c3
>  #Requires: kernel >= 3.15.0-0
>  
> +Requires(pre): shadow-utils
>  Requires(post): /usr/bin/getent
>  Requires(post): /usr/sbin/useradd
>  Requires(post): /usr/bin/sed

"Requires(post): /usr/bin/getent" and
"Requires(post): /usr/sbin/useradd" should be removed since they are
deprecated by "Requires(pre): shadow-utils" and not used anymore in
%post. Beside that it looks good to me.

> @@ -384,17 +385,23 @@ rm -rf $RPM_BUILD_ROOT
>  fi
>  %endif
>  
> +%pre
> +getent group openvswitch >/dev/null || groupadd -r openvswitch
> +getent passwd openvswitch >/dev/null || \
> +useradd -r -g openvswitch -d / -s /sbin/nologin \
> +-c "Open vSwitch Daemons" openvswitch
> +
> +%if %{with dpdk}
> +getent group hugetlbfs >/dev/null || groupadd hugetlbfs
> +usermod -a -G hugetlbfs openvswitch
> +%endif
> +exit 0
>  %post
>  if [ $1 -eq 1 ]; then
> -getent passwd openvswitch >/dev/null || \
> -useradd -r -d / -s /sbin/nologin -c "Open vSwitch Daemons"
> openvswitch -
>  sed -i
> 's:^#OVS_USER_ID=:OVS_USER_ID=:' /etc/sysconfig/openvswitch 
>  %if %{with dpdk}
> -getent group hugetlbfs >/dev/null || \
> -groupadd hugetlbfs
> -usermod -a -G hugetlbfs openvswitch
>  sed -i \
>  
> 's@OVS_USER_ID="openvswitch:openvswitch"@OVS_USER_ID="openvswitch:hugetlbfs"@'\
>  /etc/sysconfig/openvswitch



-- 
Timothy Redaelli
Software Engineer
Red Hat Italia
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] rhel: user/group openvswitch does not exist

2018-04-10 Thread Markos Chandras
On 10/04/18 14:49, Aaron Conole wrote:
> +%pre
> +getent group openvswitch >/dev/null || groupadd -r openvswitch
> +getent passwd openvswitch >/dev/null || \
> +useradd -r -g openvswitch -d / -s /sbin/nologin \
> +-c "Open vSwitch Daemons" openvswitch
> +
> +%if %{with dpdk}
> +getent group hugetlbfs >/dev/null || groupadd hugetlbfs
> +usermod -a -G hugetlbfs openvswitch
> +%endif
> +exit 0

Maybe this 'exit 0' is not actually needed here?

-- 
markos

SUSE LINUX GmbH | GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg) Maxfeldstr. 5, D-90409, Nürnberg

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] rhel: user/group openvswitch does not exist

2018-04-10 Thread Markos Chandras
On 10/04/18 14:49, Aaron Conole wrote:
> From: Alan Pevec 
> 
> Default ownership[1] for config files is failing on an empty system:
>   Running scriptlet: openvswitch-2.9.0-3.fc28.x86_64
> warning: user openvswitch does not exist - using root
> warning: group openvswitch does not exist - using root
> ...
> 
> Required user/group need to be created in %pre as documented in
> Fedora guideline[2]
> 
> [1] 
> https://github.com/openvswitch/ovs/commit/951d79e638ecdb3b1dcd19df1adb2ff91fe61af8
> 
> [2] https://fedoraproject.org/wiki/Packaging:UsersAndGroups#Dynamic_allocation
> 
> Submitted-at: https://github.com/openvswitch/ovs/pull/223
> Signed-off-by: Alan Pevec 
> Co-authored-by: Aaron Conole 
> Signed-off-by: Aaron Conole 
> ---

Looks good to me.

Reviewed-by: Markos Chandras 

-- 
markos

SUSE LINUX GmbH | GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg) Maxfeldstr. 5, D-90409, Nürnberg

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] rhel: user/group openvswitch does not exist

2018-04-10 Thread Aaron Conole
From: Alan Pevec 

Default ownership[1] for config files is failing on an empty system:
  Running scriptlet: openvswitch-2.9.0-3.fc28.x86_64
warning: user openvswitch does not exist - using root
warning: group openvswitch does not exist - using root
...

Required user/group need to be created in %pre as documented in
Fedora guideline[2]

[1] 
https://github.com/openvswitch/ovs/commit/951d79e638ecdb3b1dcd19df1adb2ff91fe61af8

[2] https://fedoraproject.org/wiki/Packaging:UsersAndGroups#Dynamic_allocation

Submitted-at: https://github.com/openvswitch/ovs/pull/223
Signed-off-by: Alan Pevec 
Co-authored-by: Aaron Conole 
Signed-off-by: Aaron Conole 
---
NOTE: This differs from the pull request upstream as I've also moved the dpdk
  section to %pre, after talking with Alan.

 rhel/openvswitch-fedora.spec.in | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/rhel/openvswitch-fedora.spec.in b/rhel/openvswitch-fedora.spec.in
index 4f2398d97..4b9831674 100644
--- a/rhel/openvswitch-fedora.spec.in
+++ b/rhel/openvswitch-fedora.spec.in
@@ -92,6 +92,7 @@ Requires: openssl hostname iproute module-init-tools
 #Upstream kernel commit 4f647e0a3c37b8d5086214128614a136064110c3
 #Requires: kernel >= 3.15.0-0
 
+Requires(pre): shadow-utils
 Requires(post): /usr/bin/getent
 Requires(post): /usr/sbin/useradd
 Requires(post): /usr/bin/sed
@@ -384,17 +385,23 @@ rm -rf $RPM_BUILD_ROOT
 fi
 %endif
 
+%pre
+getent group openvswitch >/dev/null || groupadd -r openvswitch
+getent passwd openvswitch >/dev/null || \
+useradd -r -g openvswitch -d / -s /sbin/nologin \
+-c "Open vSwitch Daemons" openvswitch
+
+%if %{with dpdk}
+getent group hugetlbfs >/dev/null || groupadd hugetlbfs
+usermod -a -G hugetlbfs openvswitch
+%endif
+exit 0
+
 %post
 if [ $1 -eq 1 ]; then
-getent passwd openvswitch >/dev/null || \
-useradd -r -d / -s /sbin/nologin -c "Open vSwitch Daemons" openvswitch
-
 sed -i 's:^#OVS_USER_ID=:OVS_USER_ID=:' /etc/sysconfig/openvswitch
 
 %if %{with dpdk}
-getent group hugetlbfs >/dev/null || \
-groupadd hugetlbfs
-usermod -a -G hugetlbfs openvswitch
 sed -i \
 
's@OVS_USER_ID="openvswitch:openvswitch"@OVS_USER_ID="openvswitch:hugetlbfs"@'\
 /etc/sysconfig/openvswitch
-- 
2.14.3

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/1] netdev-dpdk: fix RX jumbo for NICs not supporting scatter

2018-04-10 Thread Eelco Chaudron

On 10/04/18 11:36, Pablo Cascón wrote:

Currently to RX jumbo packets fails for NICs not supporting scatter.
Scatter is not strictly needed for jumbo support on RX. This change
fixes the issue by only enabling scatter for NICs supporting it.

Reported-by: Louis Peens 
Signed-off-by: Pablo Cascón 
Reviewed-by: Simon Horman 
---
  lib/netdev-dpdk.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index ee39cbe..28b20b5 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -694,11 +694,14 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int 
n_rxq, int n_txq)
  int diag = 0;
  int i;
  struct rte_eth_conf conf = port_conf;
+struct rte_eth_dev_info info;
  
  /* For some NICs (e.g. Niantic), scatter_rx mode needs to be explicitly

   * enabled. */
  if (dev->mtu > ETHER_MTU) {
-conf.rxmode.enable_scatter = 1;
+rte_eth_dev_info_get(dev->port_id, );
+if (info.rx_offload_capa & DEV_RX_OFFLOAD_SCATTER)
+conf.rxmode.enable_scatter = 1;
  }
  
  conf.rxmode.hw_ip_checksum = (dev->hw_ol_features &


Change looks good to me.


Acked-by: Eelco Chaudron 


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC 0/2] dpdk: minor refactor of the initialization step

2018-04-10 Thread Mooney, Sean K


> -Original Message-
> From: Aaron Conole [mailto:acon...@redhat.com]
> Sent: Monday, April 9, 2018 4:32 PM
> To: Mooney, Sean K 
> Cc: d...@openvswitch.org; Stokes, Ian ; Kevin
> Traynor ; Ilya Maximets ;
> Loftus, Ciara ; Terry Wilson
> ; Assaf Muller 
> Subject: Re: [RFC 0/2] dpdk: minor refactor of the initialization step
> 
> "Mooney, Sean K"  writes:
> 
> > So just from a deployment tools point of view I would like to point
> > out that This change could break existing workflow that deploy ovs in
> a docker container.
> > Kolla ansible assumes that if the docker ovs_vswitchd container is
> > still running that the is infact running in dpdk mode when we set
> > dpdk-init=true.
> 
> Is there a way to test this out and see the behavior?
[Mooney, Sean K] well you could use kolla to deploy ovs-dpdk :)
Am when I wrote the code I relied on the existing behavior.
when kolla ansible is deploying openstack we first deploy the ovsdb.
https://github.com/openstack/kolla-ansible/blob/4c39ea7eccd946775722646eec19f9ea5cbe6eb5/ansible/roles/ovs-dpdk/handlers/main.yml#L2-L37
Then we start the ovs-vswitchd container
https://github.com/openstack/kolla-ansible/blob/4c39ea7eccd946775722646eec19f9ea5cbe6eb5/ansible/roles/ovs-dpdk/handlers/main.yml#L55-L73
finally we configure the bridges and physical interfaces.
https://github.com/openstack/kolla-ansible/blob/4c39ea7eccd946775722646eec19f9ea5cbe6eb5/ansible/roles/ovs-dpdk/handlers/main.yml#L75-L90

the "- name: Ensuring ovsdpdk bridges are properly setup named" task does not 
use --no-wait when creating bridges and adding interfaces so it
will fail if the vswitchd is not running. This will result in ansible stopping 
to run any futher task on that node and reporting the error
to the user. If for some reason the ensure bridge task passed the next task 
that check an ip is assigned to the ovs bride would fail.
> 
> It does seem strange that for a possible configuration error we abort()
[Mooney, Sean K] why I would expect this to be standard behavior for any deamon.
e.g. the damon would validate it config is correct and exit if invalid.
If we don't abort the vswitch is in an undefined state. Is is still using 
hugepages
For example if the eal init fails after they are allocated.
> running the vswitchd (and with --monitor set, it will continue to
> abort() over and over - so I guess you're also not using the monitor
> thread?).  In the case that an abort does happen, does the Kolla script
> distinguish between issues where dpdk setup failed vs. some other
> software issue?
> 
> > Can I request that if you make this change you add something along
> the
> > lines of dpdk-init-is-fatal=true/false so that we can explicitly say
> which behavior we want.
> > I would not be surprised if people have built monitoring around "is
> > the ovs-vswitchd running"
> 
> I think they have, but I don't know that they use it to infer such low-
> level details (meaning a crash implies that dpdk configuration is
> wrong).
[Mooney, Sean K] they don't use is to infer that dpdk configuring is wronge
But rather that some configuration was wrong. Dpdk-init is currently considered
Fatal if it fails so it was treated the same as any other error that would have
caused the vsiwtchd process to exit. I belive in the opnfv community they used
the liveleyness of the vswitchd process and in the future dpdk keepalive
functionality to set the datapalne status filed in neutron for the host.
this allows openstack to be aware of dataplane outages.
> 
> > To infer at least at a highlevel that "everything is fine" where as
> > the log message/db field proposed Here will invalidate that.
> 
> I've added Assaf Mueller from our Open Stack team as well - maybe he
> has some additional details on those mechanisms outside of Kolla (maybe
> it exists in some kind of director / other software too, as you point
> out).
> 
> > it would be ease to check that field but its work that needs to be
> > done in multiple places.
> 
> I think such a knob wouldn't be useful.  I believe it would either have
> to be defaulted to 'dpdk-init-is-fatal=true' to abort on failure (which
> most users would want to change making it an undesirable default)
[Mooney, Sean K] I would argue against that I would never deploy with
dpdk-init-is-fatal=false. If your datapane does not start what is the point
of running ovs at all? It will not be able to forward packets.
, or
> the Kolla ansible scripts (and other detection mechanisms for dpdk
> failure - if they exist) would need to change.  Maybe there's another
> approach, though?
> 
> >> -Original Message-
> >> From: Aaron Conole [mailto:acon...@redhat.com]
> >> Sent: Thursday, April 5, 2018 10:23 PM
> >> To: d...@openvswitch.org
> >> Cc: Stokes, Ian ; Kevin Traynor
> >> 

Re: [ovs-dev] Mempool issue for OVS 2.9

2018-04-10 Thread Stokes, Ian
> >> -Original Message-
> >> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> >> Sent: Monday, 29 January, 2018 09:35
> >> To: Jan Scheurich ; Venkatesan Pradeep
> >> ; Stokes, Ian
> >> ; d...@openvswitch.org
> >> Cc: Kevin Traynor ; Flavio Leitner
> >> ; Loftus, Ciara ; Kavanagh,
> >> Mark B ; Ben Pfaff (b...@ovn.org)
> >> ; acon...@redhat.com; disc...@openvswitch.org
> >> Subject: Re: Mempool issue for OVS 2.9
> >>
> >> On 29.01.2018 11:19, Jan Scheurich wrote:
> >>> Hi,
> >>>
> >>> I'd like to take one step back and look at how much many mbufs we
> actually need.
> >>>
> >>> Today mbufs are consumed in the following places:
> >>>
> >>>  1. Rx queues of **physical** dpdk ports: dev->requested_n_rxq * dev-
> >requested_rxq_size
> >>> Note 1: These mbufs are hogged up at all times.
> >>> Note 2: There is little point in configuring more rx queues per
> phy port than there are PMDs to poll them.
> >>> Note 3: The rx queues of vhostuser ports exist as virtqueues in
> the guest and do not hog mbufs.
> >>>  2. One batch per PMD during processing: #PMD * NETDEV_MAX_BURST  3.
> >>> One batch per tx queue with time-based tx batching:
> >>> dev->requested_n_txq * NETDEV_MAX_BURST  4. Tx queues of **physical**
> ports: dev->requested_n_txq * expected peak tx queue fill level
> >>> Note 1:  The maximum of 2K mbufs per tx queue can only be reached
> if the OVS transmit rate exceeds the line rate for a long time.
> >> This can only happen for large packets and when the traffic
> >> originates from VMs on the compute node. This would be a case of
> >> under- dimensioning and packets would be dropped in any case. Excluding
> that scenario, a typical peak tx queue fill level would be when all PMDs
> transmit a full batch at the same time: #PMDs * NETDEV_MAX_BURST.
> >>
> >> Above assumption is wrong. Just look at ixgbe driver:
> >> drivers/net/ixgbe/ixgbe_rxtx.c: tx_xmit_pkts():
> >>
> >>/*
> >> * Begin scanning the H/W ring for done descriptors when the
> >> * number of available descriptors drops below tx_free_thresh.
> For
> >> * each done descriptor, free the associated buffer.
> >> */
> >>if (txq->nb_tx_free < txq->tx_free_thresh)
> >>┊   ixgbe_tx_free_bufs(txq);
> >>
> >> The default value for 'tx_free_thresh' is 32. So, if I'll configure
> >> number of TX descriptors to 4096, driver will start to free mbufs
> >> only when it will have more than 4063 mbufs inside its TX queue. No
> >> matter how frequent calls to send() function.
> >
> > OK, but that doesn't change my general argument. The mbufs hogged in the
> tx side of the phy port driver are coming from all ports (least likely the
> port itself). Considering them in dimensioning the port's private mempool
> is conceptually wrong. In my simplified dimensioning formula below I have
> already assumed full occupancy of the tx queue for phy ports. The second
> key observation is that vhostuser ports do not hog mbufs at all. And vhost
> zero copy doesn't change that.
> 
> Formula below maybe good for static environment. I want to change number
> of PMD threads dynamically in my deployments and this working in current
> per-port model and with oversized shared pool. If we'll try to reduce
> memory consumption of the shared pool we'll have to reconfigure all the
> devices each time we change the number of PMD threads. This would be
> really bad.
> So, size of the memory pool should not depend on dynamic characteristics
> of the datapath or other ports to avoid unexpected interrupts in traffic
> flows in case of random changes in configuration. Of course, it could
> depend on characteristics of the port itself in case of per-port model. In
> case of shared mempool model the size should only depend on static
> datapath configuration.

Hi all,

Now seems a good time to kick start this conversation again as there's a few 
patches floating around for mempools on master and 2.9.
I'm happy to work on a solution for this but before starting I'd like to agree 
on the requirements so we're all comfortable with the solution.

I see two use cases above, static and dynamic. Each have their own requirements 
(I'm keeping OVS 2.10 in mind here as it's an issue we need to resolve).

Static environment
1. For a given deployment, the 2.10 the mempool design should use the same or 
less memory as the shared mempool design of 2.9.
2. Memory pool size can depend on static datapath configurations, but the 
previous provisioning used in OVS 2.9 is acceptable also.

I think the shared mempool model suits the static environment, it's a rough way 
of provisioning memory but it works for the majority involved in the discussion 
to date.

Dynamic environment
1. Mempool size should not depend on dynamic characteristics (number 

[ovs-dev] [PATCH 1/1] netdev-dpdk: fix RX jumbo for NICs not supporting scatter

2018-04-10 Thread Pablo Cascón
Currently to RX jumbo packets fails for NICs not supporting scatter.
Scatter is not strictly needed for jumbo support on RX. This change
fixes the issue by only enabling scatter for NICs supporting it.

Reported-by: Louis Peens 
Signed-off-by: Pablo Cascón 
Reviewed-by: Simon Horman 
---
 lib/netdev-dpdk.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index ee39cbe..28b20b5 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -694,11 +694,14 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int 
n_rxq, int n_txq)
 int diag = 0;
 int i;
 struct rte_eth_conf conf = port_conf;
+struct rte_eth_dev_info info;
 
 /* For some NICs (e.g. Niantic), scatter_rx mode needs to be explicitly
  * enabled. */
 if (dev->mtu > ETHER_MTU) {
-conf.rxmode.enable_scatter = 1;
+rte_eth_dev_info_get(dev->port_id, );
+if (info.rx_offload_capa & DEV_RX_OFFLOAD_SCATTER)
+conf.rxmode.enable_scatter = 1;
 }
 
 conf.rxmode.hw_ip_checksum = (dev->hw_ol_features &
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [Suspected-Phishing] [PATCH v8 0/6] OVS-DPDK flow offload with rte_flow

2018-04-10 Thread Stokes, Ian
> Tuesday, March 27, 2018 10:55 AM, Shahaf Shuler:
> 
> Hi,
> 
> Any comments on this version?

I should have some time to look at this today. I would also echo the request 
for anyone else in the community interested to review.

Ian
> 
> >
> > Hi,
> >
> > Here is a joint work from Mellanox and Napatech, to enable the flow hw
> > offload with the DPDK generic flow interface (rte_flow).
> >
> > The basic idea is to associate the flow with a mark id (a unit32_t
> number).
> > Later, we then get the flow directly from the mark id, which could
> > bypass some heavy CPU operations, including but not limiting to mini
> > flow extract, emc lookup, dpcls lookup, etc.
> >
> > The association is done with CMAP in patch 1. The CPU workload
> > bypassing is done in patch 2. The flow offload is done in patch 3,
> > which mainly does two
> > things:
> >
> > - translate the ovs match to DPDK rte flow patterns
> > - bind those patterns with a RSS + MARK action.
> >
> > Patch 5 makes the offload work happen in another thread, for leaving
> > the datapath as light as possible.
> >
> > A PHY-PHY forwarding with 1000 mega flows (udp,tp_src=1000-1999) and 1
> > million streams (tp_src=1000-1999, tp_dst=2000-2999) show more than
> > 260% performance boost.
> >
> > Note that it's disabled by default, which can be enabled by:
> >
> > $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
> >
> > v8: - enhanced documentation with more details on supported protocols
> > - fixed VLOG to start with capital letter
> > - fixed compilation issues
> > - fixed coding style
> > - addressed the rest of Ian's comments
> >
> > v7: - fixed wrong hash for mark_to_flow that has been refactored in v6
> > - set the rss_conf for rss action to NULL, to workaround a mlx5
> change
> >   in DPDK v17.11. Note that it will obey the rss settings OVS-DPDK
> has
> >   set in the beginning. Thus, nothing should be effected.
> >
> > v6: - fixed a sparse warning
> > - added documentation
> > - used hash_int to compute mark to flow hash
> > - added more comments
> > - added lock for pot lookup
> > - rebased on top of the latest code
> >
> > v5: - fixed an issue that it took too long if we do flow add/remove
> >   repeatedly.
> > - removed an unused mutex lock
> > - turned most of the log level to DBG
> > - rebased on top of the latest code
> >
> > v4: - use RSS action instead of QUEUE action with MARK
> > - make it work with multiple queue (see patch 1)
> > - rebased on top of latest code
> >
> > v3: - The mark and id association is done with array instead of CMAP.
> > - Added a thread to do hw offload operations
> > - Removed macros completely
> > - dropped the patch to set FDIR_CONF, which is a workround some
> >   Intel NICs.
> > - Added a debug patch to show all flow patterns we have created.
> > - Misc fixes
> >
> > v2: - workaround the queue action issue
> > - fixed the tcp_flags being skipped issue, which also fixed the
> >   build warnings
> > - fixed l2 patterns for Intel nic
> > - Converted some macros to functions
> > - did not hardcode the max number of flow/action
> > - rebased on top of the latest code
> >
> > Thanks.
> >
> > ---
> >
> > Finn Christensen (1):
> >   netdev-dpdk: implement flow offload with rte flow
> >
> > Yuanhan Liu (5):
> >   dpif-netdev: associate flow with a mark id
> >   dpif-netdev: retrieve flow directly from the flow mark
> >   netdev-dpdk: add debug for rte flow patterns
> >   dpif-netdev: do hw flow offload in a thread
> >   Documentation: document ovs-dpdk flow offload
> >
> >  Documentation/howto/dpdk.rst |  22 ++
> >  NEWS |   3 +-
> >  lib/dp-packet.h  |  13 +
> >  lib/dpif-netdev.c| 497 -
> >  lib/flow.c   | 155 ++--
> >  lib/flow.h   |   1 +
> >  lib/netdev-dpdk.c| 740
> > +-
> >  lib/netdev.h |   6 +
> >  8 files changed, 1397 insertions(+), 40 deletions(-)
> >
> > --
> > 2.7.4
> >
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fma
> > il.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-
> > dev=02%7C01%7Cshahafs%40mellanox.com%7C06b32d22c2e34de31af
> > 508d593b82bd2%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C1%7C6365
> > 77341596013800=FrFTVCkg6HzG%2FMaFYw9zsytUuQPXMzyr8z893Qp
> > zYbc%3D=0
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev