Re: [ovs-dev] ovn-controller is taking 100% CPU all the time in one deployment

2019-08-30 Thread Han Zhou
On Thu, Aug 29, 2019 at 11:36 PM Numan Siddique  wrote:
>
>
>
> On Fri, Aug 30, 2019 at 1:04 AM Han Zhou  wrote:
>>
>>
>>
>> On Thu, Aug 29, 2019 at 12:16 PM Numan Siddique 
wrote:
>>>
>>>
>>>
>>> On Fri, Aug 30, 2019 at 12:37 AM Han Zhou  wrote:



 On Thu, Aug 29, 2019 at 11:40 AM Numan Siddique 
wrote:
 >
 > Hello Everyone,
 >
 > In one of the OVN deployments, we are seeing 100% CPU usage by
ovn-controllers all the time.
 >
 > After investigations we found the below
 >
 >  - ovn-controller is taking more than 20 seconds to complete full
loop (mainly in lflow_run() function)
 >
 >  - The physical switch is sending GARPs periodically every 10
seconds.
 >
 >  - There is ovn-bridge-mappings configured and these GARP packets
reaches br-int via the patch port.
 >
 >  - We have a flow in router pipeline which applies the action -
put_arp
 > if it is arp packet.
 >
 >  - ovn-controller pinctrl thread receives these garps, stores the
learnt mac-ips in the 'put_mac_bindings' hmap and notifies the
ovn-controller main thread by incrementing the seq no.
 >
 >  - In the ovn-controller main thread, after lflow_run() finishes,
pinctrl_wait() is called. This function calls - poll_immediate_wake() as
'put_mac_bindings' hmap is not empty.
 >
 > - This causes the ovn-controller poll_block() to not sleep at all
and this repeats all the time resulting in 100% cpu usage.
 >
 > The deployment has OVS/OVN 2.9.  We have back ported the
pinctrl_thread patch.
 >
 > Some time back I had reported an issue about lflow_run() taking lot
of time -
https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360414.html
 >
 > I think we need to improve the logical processing sooner or later.
 >
 > But to fix this issue urgently, we are thinking of the below
approach.
 >
 >  - pinctrl_thread will locally cache the mac_binding entries (just
like it caches the dns entries). (Please note pinctrl_thread can not access
the SB DB IDL).
 >
 > - Upon receiving any arp packet (via the put_arp action),
pinctrl_thread will check the local mac_binding cache and will only wake up
the main ovn-controller thread only if the mac_binding update is required.
 >
 > This approach will solve the issue since the MAC sent by the
physical switches will not change. So there is no need to wake up
ovn-controller main thread.
 >
 > In the present master/2.12 these GARPs will not cause this 100% cpu
loop issue because incremental processing will not recompute flows.
 >
 > Even though the above approach is not really required for
master/2.12, I think it is still Ok to have this as there is no harm.
 >
 > I would like to know your comments and any concerns if any.
 >
 > Thanks
 > Numan
 >

 Hi Numan,

 I think this approach should work. Just to make sure, to update the
cache efficiently (to avoid another kind of recompute), it should use ovsdb
change-tracking to update it incrementally.

 Regarding master/2.12, it is not harmful except that it will add some
more code and increase memory footprint. For our current use cases, there
can be easily 10,000s mac_bindings, but it may still be ok because each
entry is very small. However, is there any benefit for doing this in
master/2.12?
>>>
>>>
>>> I don't see much benefit. But I can't submit a patch to branch 2.9
without the fix getting merged in master first right ?
>>> May be once it is merged in branch 2.9, we can consider to delete it ?
>>>
>> I think it is just about how would you maintain a downstream branch.
Since it is downstream, I don't think you need a change to be in upstream
before fixing a problem. In this case it may be *no harm*, but what if the
upstream is completely changed and incompatible for such a fix any more? It
shouldn't prevent you from fixing your downstream. (Of course it is better
to not have downstream at all, but sometimes it is useful to have it for a
temporary period, and since you (and us, too) are already there ... :)
>
>
> The dowstream 2.9 what we have is - OVS 2.9.0 + a bunch of patches (to
fix issues) which are already merged upstream (preferably upstream branch
or at least upstream master).  Any downstream only patch is frowned upon.
When we updrade to 2.10 or higher versions there is  a risk of functional
changes if the patch is not upstream.
>
> If we have apply the approach I described above to downstream 2.9 then
there is definitely some functional change. When such GARPs are received,
in the case of our downstream 2.9 we will not wake up ovn-controller main
thread
> but with 2.12/master, we wake up the ovn-controller main thread.
>
> I still see no harm in having this in upstream master. May be instead of
having a complete clone of mac_bindings, we can have a subset of
mac_bindings cached only if those mac_bindings are learnt by an
ovn-controller.
>
> I will explore 

Re: [ovs-dev] [ovs-discuss] ovn-controller is taking 100% CPU all the time in one deployment

2019-08-30 Thread Han Zhou
On Fri, Aug 30, 2019 at 1:25 PM Numan Siddique  wrote:
>
> Hi Han,
>
> I am thinking of this approach to solve this problem. I still need to
test it.
> If you have any comments or concerns do let me know.
>
>
> **
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index 9a282..a83b56362 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -6552,6 +6552,41 @@ build_lrouter_flows(struct hmap *datapaths, struct
hmap *ports,
>
>  }
>
> +/* Handle GARP reply packets received on a distributed router
gateway
> + * port. GARP reply broadcast packets could be sent by external
> + * switches. We don't want them to be handled by all the
> + * ovn-controllers if they receive it. So add a priority-92 flow
to
> + * apply the put_arp action on a redirect chassis and drop it on
> + * other chassis.
> + * Note that we are already adding a priority-90 logical flow in
the
> + * table S_ROUTER_IN_IP_INPUT to apply the put_arp action if
> + * arp.op == 2.
> + * */
> +if (op->od->l3dgw_port && op == op->od->l3dgw_port
> +&& op->od->l3redirect_port) {
> +for (int i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
> +ds_clear();
> +ds_put_format(,
> +  "inport == %s && is_chassis_resident(%s)
&& "
> +  "eth.bcast && arp.op == 2 && arp.spa ==
%s/%u",
> +  op->json_key,
op->od->l3redirect_port->json_key,
> +  op->lrp_networks.ipv4_addrs[i].network_s,
> +  op->lrp_networks.ipv4_addrs[i].plen);
> +ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 92,
> +  ds_cstr(),
> +  "put_arp(inport, arp.spa, arp.sha);");
> +ds_clear();
> +ds_put_format(,
> +  "inport == %s && !is_chassis_resident(%s)
&& "
> +  "eth.bcast && arp.op == 2 && arp.spa ==
%s/%u",
> +  op->json_key,
op->od->l3redirect_port->json_key,
> +  op->lrp_networks.ipv4_addrs[i].network_s,
> +  op->lrp_networks.ipv4_addrs[i].plen);
> +ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 92,
> +  ds_cstr(), "drop;");
> +}
> +}
> +
>  /* A set to hold all load-balancer vips that need ARP responses.
*/
>  struct sset all_ips = SSET_INITIALIZER(_ips);
>  int addr_family;
> *
>
> If a physical switch sends GARP request packets we have existing logical
flows
> which handle them only on the gateway chassis.
>
> But if the physical switch sends GARP reply packets, then these packets
> are handled by ovn-controllers where bridge mappings are configured.
> I think its good enough if the gateway chassis handles these packet.
>
> In the deployment where we are seeing this issue, the physical switch
sends GARP reply
> packets.
>
> Thanks
> Numan
>
>
Hi Numan,

I think both GARP request and reply should be handled on all chassises. It
should work not only for physical switch, but also for virtual workloads.
At least our current use cases relies on that.

Thanks,
Han
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-discuss] ovn-controller is taking 100% CPU all the time in one deployment

2019-08-30 Thread Numan Siddique
Hi Han,

I am thinking of this approach to solve this problem. I still need to test
it.
If you have any comments or concerns do let me know.


**
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 9a282..a83b56362 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -6552,6 +6552,41 @@ build_lrouter_flows(struct hmap *datapaths, struct
hmap *ports,

 }

+/* Handle GARP reply packets received on a distributed router
gateway
+ * port. GARP reply broadcast packets could be sent by external
+ * switches. We don't want them to be handled by all the
+ * ovn-controllers if they receive it. So add a priority-92 flow to
+ * apply the put_arp action on a redirect chassis and drop it on
+ * other chassis.
+ * Note that we are already adding a priority-90 logical flow in
the
+ * table S_ROUTER_IN_IP_INPUT to apply the put_arp action if
+ * arp.op == 2.
+ * */
+if (op->od->l3dgw_port && op == op->od->l3dgw_port
+&& op->od->l3redirect_port) {
+for (int i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {
+ds_clear();
+ds_put_format(,
+  "inport == %s && is_chassis_resident(%s) && "
+  "eth.bcast && arp.op == 2 && arp.spa ==
%s/%u",
+  op->json_key,
op->od->l3redirect_port->json_key,
+  op->lrp_networks.ipv4_addrs[i].network_s,
+  op->lrp_networks.ipv4_addrs[i].plen);
+ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 92,
+  ds_cstr(),
+  "put_arp(inport, arp.spa, arp.sha);");
+ds_clear();
+ds_put_format(,
+  "inport == %s && !is_chassis_resident(%s) &&
"
+  "eth.bcast && arp.op == 2 && arp.spa ==
%s/%u",
+  op->json_key,
op->od->l3redirect_port->json_key,
+  op->lrp_networks.ipv4_addrs[i].network_s,
+  op->lrp_networks.ipv4_addrs[i].plen);
+ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 92,
+  ds_cstr(), "drop;");
+}
+}
+
 /* A set to hold all load-balancer vips that need ARP responses. */
 struct sset all_ips = SSET_INITIALIZER(_ips);
 int addr_family;
*

If a physical switch sends GARP request packets we have existing logical
flows
which handle them only on the gateway chassis.

But if the physical switch sends GARP reply packets, then these packets
are handled by ovn-controllers where bridge mappings are configured.
I think its good enough if the gateway chassis handles these packet.

In the deployment where we are seeing this issue, the physical switch sends
GARP reply
packets.

Thanks
Numan


On Fri, Aug 30, 2019 at 11:50 PM Han Zhou  wrote:

> On Fri, Aug 30, 2019 at 6:46 AM Mark Michelson 
> wrote:
> >
> > On 8/30/19 5:39 AM, Daniel Alvarez Sanchez wrote:
> > > On Thu, Aug 29, 2019 at 10:01 PM Mark Michelson 
> wrote:
> > >>
> > >> On 8/29/19 2:39 PM, Numan Siddique wrote:
> > >>> Hello Everyone,
> > >>>
> > >>> In one of the OVN deployments, we are seeing 100% CPU usage by
> > >>> ovn-controllers all the time.
> > >>>
> > >>> After investigations we found the below
> > >>>
> > >>>- ovn-controller is taking more than 20 seconds to complete full
> loop
> > >>> (mainly in lflow_run() function)
> > >>>
> > >>>- The physical switch is sending GARPs periodically every 10
> seconds.
> > >>>
> > >>>- There is ovn-bridge-mappings configured and these GARP packets
> > >>> reaches br-int via the patch port.
> > >>>
> > >>>- We have a flow in router pipeline which applies the action -
> put_arp
> > >>> if it is arp packet.
> > >>>
> > >>>- ovn-controller pinctrl thread receives these garps, stores the
> > >>> learnt mac-ips in the 'put_mac_bindings' hmap and notifies the
> > >>> ovn-controller main thread by incrementing the seq no.
> > >>>
> > >>>- In the ovn-controller main thread, after lflow_run() finishes,
> > >>> pinctrl_wait() is called. This function calls - poll_immediate_wake()
> as
> > >>> 'put_mac_bindings' hmap is not empty.
> > >>>
> > >>> - This causes the ovn-controller poll_block() to not sleep at all and
> > >>> this repeats all the time resulting in 100% cpu usage.
> > >>>
> > >>> The deployment has OVS/OVN 2.9.  We have back ported the
> pinctrl_thread
> > >>> patch.
> > >>>
> > >>> Some time back I had reported an issue about lflow_run() taking lot
> of
> > >>> time -
> https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360414.html
> > >>>
> > >>> I think we need to improve the logical processing sooner or later.
> > >>
> > >> I agree 

Re: [ovs-dev] [ovs-discuss] ovn-controller is taking 100% CPU all the time in one deployment

2019-08-30 Thread Han Zhou
On Fri, Aug 30, 2019 at 6:46 AM Mark Michelson  wrote:
>
> On 8/30/19 5:39 AM, Daniel Alvarez Sanchez wrote:
> > On Thu, Aug 29, 2019 at 10:01 PM Mark Michelson 
wrote:
> >>
> >> On 8/29/19 2:39 PM, Numan Siddique wrote:
> >>> Hello Everyone,
> >>>
> >>> In one of the OVN deployments, we are seeing 100% CPU usage by
> >>> ovn-controllers all the time.
> >>>
> >>> After investigations we found the below
> >>>
> >>>- ovn-controller is taking more than 20 seconds to complete full
loop
> >>> (mainly in lflow_run() function)
> >>>
> >>>- The physical switch is sending GARPs periodically every 10
seconds.
> >>>
> >>>- There is ovn-bridge-mappings configured and these GARP packets
> >>> reaches br-int via the patch port.
> >>>
> >>>- We have a flow in router pipeline which applies the action -
put_arp
> >>> if it is arp packet.
> >>>
> >>>- ovn-controller pinctrl thread receives these garps, stores the
> >>> learnt mac-ips in the 'put_mac_bindings' hmap and notifies the
> >>> ovn-controller main thread by incrementing the seq no.
> >>>
> >>>- In the ovn-controller main thread, after lflow_run() finishes,
> >>> pinctrl_wait() is called. This function calls - poll_immediate_wake()
as
> >>> 'put_mac_bindings' hmap is not empty.
> >>>
> >>> - This causes the ovn-controller poll_block() to not sleep at all and
> >>> this repeats all the time resulting in 100% cpu usage.
> >>>
> >>> The deployment has OVS/OVN 2.9.  We have back ported the
pinctrl_thread
> >>> patch.
> >>>
> >>> Some time back I had reported an issue about lflow_run() taking lot of
> >>> time -
https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360414.html
> >>>
> >>> I think we need to improve the logical processing sooner or later.
> >>
> >> I agree that this is very important. I know that logical flow
processing
> >> is the biggest bottleneck for ovn-controller, but 20 seconds is just
> >> ridiculous. In your scale testing, you found that lflow_run() was
taking
> >> 10 seconds to complete.
> > I support this statement 100% (20 seconds is just ridiculous). To be
> > precise, in this deployment we see over 23 seconds for the main loop
> > to process and I've seen even 30 seconds some times. I've been talking
> > to Numan these days about this issue and I support profiling this
> > actual deployment so that we can figure out how incremental processing
> > would help.
> >
> >>
> >> I'm curious if there are any factors in this particular deployment's
> >> configuration that might contribute to this. For instance, does this
> >> deployment have a glut of ACLs? Are they not using port groups?
> > They're not using port groups because it's 2.9 and it is not there.
> > However, I don't think port groups would make a big difference in
> > terms of ovn-controller computation. I might be wrong but Port Groups
> > help reduce the number of ACLs in the NB database while the # of
> > Logical Flows would still remain the same. We'll try to get the
> > contents of the NB database and figure out what's killing it.
> >
>
> You're right that port groups won't reduce the number of logical flows.

I think port-group reduces number of logical flows significantly, and also
reduces OVS flows when conjunctive matches are effective.
Please see my calculation here:
https://www.slideshare.net/hanzhou1978/large-scale-overlay-networks-with-ovn-problems-and-solutions/30

> However, it can reduce the computation in ovn-controller. The reason is
> that the logical flows generated by ACLs that use port groups may result
> in conjunctive matches being used. If you want a bit more information,
> see the "Port groups" section of this blog post I wrote:
>
>
https://developers.redhat.com/blog/2019/01/02/performance-improvements-in-ovn-past-and-future/
>
> The TL;DR is that with port groups, I saw the number of OpenFlow flows
> generated by ovn-controller drop by 3 orders of magnitude. And that
> meant that flow processing was 99% faster for large networks.
>
> You may not see the same sort of improvement for this deployment, mainly
> because my test case was tailored to illustrate how port groups help.
> There may be other factors in this deployment that complicate flow
> processing.
>
> >>
> >> This particular deployment's configuration may give us a good scenario
> >> for our testing to improve lflow processing time.
> > Absolutely!
> >>
> >>>
> >>> But to fix this issue urgently, we are thinking of the below approach.
> >>>
> >>>- pinctrl_thread will locally cache the mac_binding entries (just
like
> >>> it caches the dns entries). (Please note pinctrl_thread can not access
> >>> the SB DB IDL).
> >>
> >>>
> >>> - Upon receiving any arp packet (via the put_arp action),
pinctrl_thread
> >>> will check the local mac_binding cache and will only wake up the main
> >>> ovn-controller thread only if the mac_binding update is required.
> >>>
> >>> This approach will solve the issue since the MAC sent by the physical
> >>> switches will not change. So there is 

[ovs-dev] [patch v2] conntrack: Fix 'reverse_nat_packet()' variable datatype.

2019-08-30 Thread Darrell Ball
The datatype 'pad' in the function 'reverse_nat_packet()' was incorrectly
declared as 'char' instead of 'uint8_t'. This can affect reverse natting
of icmpX packets with padding > 127 bytes.  At the same time, add some
comments regarding 'extract_l3_ipvX' usage in this function.  Found by
inspection.

Fixes: edd1bef468c0 ("dpdk: Add more ICMP Related NAT support.")
Signed-off-by: Darrell Ball 
---

v2: Elaborate added comments.

 lib/conntrack.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/lib/conntrack.c b/lib/conntrack.c
index e5266e5..6452d82 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -688,7 +688,7 @@ static void
 reverse_nat_packet(struct dp_packet *pkt, const struct conn *conn)
 {
 char *tail = dp_packet_tail(pkt);
-char pad = dp_packet_l2_pad_size(pkt);
+uint8_t pad = dp_packet_l2_pad_size(pkt);
 struct conn_key inner_key;
 const char *inner_l4 = NULL;
 uint16_t orig_l3_ofs = pkt->l3_ofs;
@@ -698,6 +698,8 @@ reverse_nat_packet(struct dp_packet *pkt, const struct conn 
*conn)
 struct ip_header *nh = dp_packet_l3(pkt);
 struct icmp_header *icmp = dp_packet_l4(pkt);
 struct ip_header *inner_l3 = (struct ip_header *) (icmp + 1);
+/* This call is already verified to succeed during the code path from
+ * 'conn_key_extract()' which calls 'extract_l4_icmp()'. */
 extract_l3_ipv4(_key, inner_l3, tail - ((char *)inner_l3) - pad,
 _l4, false);
 pkt->l3_ofs += (char *) inner_l3 - (char *) nh;
@@ -719,6 +721,8 @@ reverse_nat_packet(struct dp_packet *pkt, const struct conn 
*conn)
 struct icmp6_error_header *icmp6 = dp_packet_l4(pkt);
 struct ovs_16aligned_ip6_hdr *inner_l3_6 =
 (struct ovs_16aligned_ip6_hdr *) (icmp6 + 1);
+/* This call is already verified to succeed during the code path from
+ * 'conn_key_extract()' which calls 'extract_l4_icmp6()'. */
 extract_l3_ipv6(_key, inner_l3_6,
 tail - ((char *)inner_l3_6) - pad,
 _l4);
-- 
1.9.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] Regarding TSO using AF_PACKET in OVS

2019-08-30 Thread William Tu
Hi Ramana,

I'm trying to understand your setup.

On Wed, Aug 28, 2019 at 4:11 AM Ramana Reddy  wrote:
>
> Hi Ben, Justin, Jesse and All,
>
> Hope everyone doing great.
>
> During my work, I create a socket using AF_PACKET with virtio_net_hdr and
> filled all the fields in the virtio_net_hdr
> and the flag to VIRTIO_NET_GSO_TCPV4 to enable TSO using af_packet.
>
> vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>
> The code is not working when I am trying to send large packets over OVS
> VXLAN tunnel. It's dropping the large packets and
> the application is resending the MTU sized packets. The code is working
> fine in non ovs case (directly sending without ovs).

Do you create AF_PACKET and bind its socket to OVS's vxlan device,
ex: vxlan_sys_4789?

In the non-ovs working case, do you bind AF_PACKET to the vxlan device created
by using ip link command?

>
> I understood that UDP tunneling offloading (VXLAN) not happening because of
> nic may not support this offloading feature,
> but when I send iperf (AF_INET) traffic, though the offloading is not
> happening, but the large packets are fragmented and the
> VXLAN tunneling sending the fragmented packets.  Why are we seeing this
> different behavior?

OVS's vxlan device is a light-weight tunnel device, so it might not
have all the
features.

>
> What is the expected behavior in AF_PACKET in OVS? Does OVS support
> AF_PACKET offloading mechanism?

AF_PACKET (see net/packet/af_packet.c) just from userspace send packet into
kernel and to the device you bind to.  It creates skb and invokes the
device's xmit
function.

William

> If not, how we can avoid the retransmission of the packets. What are things
> to be done so that the kernel fragments
> large packets and send them out without dropping ( in case if offloading
> feature not available)?
>
> Looking forward to your reply.
>
> Regards,
> Ramana
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH branch-2.8 2/2] datapath: Clear the L4 portion of the key for "later" fragments

2019-08-30 Thread Gregory Rose


On 8/29/2019 10:44 PM, Justin Pettit wrote:

On Aug 29, 2019, at 6:46 PM, Gregory Rose  wrote:


On 8/29/2019 3:25 PM, Justin Pettit wrote:

On Aug 29, 2019, at 10:55 AM, Greg Rose  wrote:

diff --git a/datapath/flow.c b/datapath/flow.c
index 083288f..92fc6ac 100644
--- a/datapath/flow.c
+++ b/datapath/flow.c
@@ -529,6 +529,7 @@ static int key_extract_l3l4(struct sk_buff *skb, struct 
sw_flow_key *key)
offset = nh->frag_off & htons(IP_OFFSET);
if (offset) {
key->ip.frag = OVS_FRAG_TYPE_LATER;
+memset(>tp, 0, sizeof(key->tp));
return 0;
}
if (nh->frag_off & htons(IP_MF) ||
@@ -647,8 +648,11 @@ static int key_extract_l3l4(struct sk_buff *skb, struct 
sw_flow_key *key)
return error;
}

-if (key->ip.frag == OVS_FRAG_TYPE_LATER)
+if (key->ip.frag == OVS_FRAG_TYPE_LATER) {
+memset(>tp, 0, sizeof(key->tp));
return 0;
+}
+#ifdef HAVE_SKB_GSO_UDP

My system's kernel is too new to be built against an OVS this old, but I 
noticed this patch for OVS versions 2.5 through 2.9 introduce this #ifdef 
without a corresponding #endif.  Was this intentional?  Does it even build?

--Justin



Conflict resolutions error - I'll probably have to resend branches 2.8 through 
2.5.

Thanks.  Can you please at least compile-check them before sending out v2?



I'll have to wait for a travis check to finish - I can't seem to get any 
branches older than 2.9 to compile on my
own system anymore due to some Python issues.  So I'll just let travis 
do the work.


Sorry about that, trying to do too many things at once...

- Greg

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v9] ovsdb-tool: Convert clustered db to standalone db.

2019-08-30 Thread aginwala
Thanks Han. Addressed the minor comment in v10.

On Thu, Aug 29, 2019 at 10:31 PM Han Zhou  wrote:

> Thanks for the update.
>
> On Thu, Aug 29, 2019 at 6:20 PM  wrote:
> >
> > From: Aliasgar Ginwala 
> >
> > Add support in ovsdb-tool for migrating clustered dbs to standalone dbs.
> > E.g. usage to migrate nb/sb db to standalone db from raft:
> > ovsdb-tool cluster-to-standalone ovnnb_db.db ovnnb_db_cluster.db
> >
> > Signed-off-by: Aliasgar Ginwala 
> > ---
> >  Documentation/ref/ovsdb.7.rst |   3 +
> >  NEWS  |   3 +
> >  ovsdb/ovsdb-tool.1.in |   8 +++
> >  ovsdb/ovsdb-tool.c| 101 +-
> >  tests/ovsdb-tool.at   |  43 +++
> >  5 files changed, 157 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/ref/ovsdb.7.rst
> b/Documentation/ref/ovsdb.7.rst
> > index cd1c63d64..b12d8066c 100644
> > --- a/Documentation/ref/ovsdb.7.rst
> > +++ b/Documentation/ref/ovsdb.7.rst
> > @@ -514,6 +514,9 @@ standalone database from the contents of a running
> clustered database.
> >  When the cluster is down and cannot be revived, ``ovsdb-client backup``
> will
> >  not work.
> >
> > +Use ``ovsdb-tool cluster-to-standalone`` to convert clustered database
> to
> > +standalone database when the cluster is down and cannot be revived.
> > +
> >  Upgrading or Downgrading a Database
> >  ---
> >
> > diff --git a/NEWS b/NEWS
> > index c5caa13d6..a02f9f1a6 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -49,6 +49,9 @@ v2.12.0 - xx xxx 
> > quickly after a brief disconnection, saving bandwidth and CPU
> time.
> > See section 4.1.15 of ovsdb-server(7) for details of related
> OVSDB
> > protocol extension.
> > + * Support to convert from cluster database to standalone database
> is now
> > +   available when clustered is down and cannot be revived using
> ovsdb-tool
> > +   . Check "Database Migration Commands" in ovsdb-tool man section.
> > - OVN:
> >   * IPAM/MACAM:
> > - select IPAM mac_prefix in a random manner if not provided by
> the user
> > diff --git a/ovsdb/ovsdb-tool.1.in b/ovsdb/ovsdb-tool.1.in
> > index ec85e14c4..31a918d90 100644
> > --- a/ovsdb/ovsdb-tool.1.in
> > +++ b/ovsdb/ovsdb-tool.1.in
> > @@ -147,6 +147,14 @@ avoid this possibility, specify
> \fB\-\-cid=\fIuuid\fR, where
> >  \fIuuid\fR is the cluster ID of the cluster to join, as printed by
> >  \fBovsdb\-tool get\-cid\fR.
> >  .
> > +.SS "Database Migration Commands"
> > +This commands will convert cluster database to standalone database.
> > +.
> > +.IP "\fBcluster\-to\-standalone\fI db clusterdb"
> > +Use this command to convert to standalone database from clustered
> database
> > +when the cluster is down and cannot be revived. It creates new
> standalone
> > +\fIdb\fR file from the given cluster \fIdb\fR file.
> > +.
> >  .SS "Version Management Commands"
> >  .so ovsdb/ovsdb-schemas.man
> >  .PP
> > diff --git a/ovsdb/ovsdb-tool.c b/ovsdb/ovsdb-tool.c
> > index 438f97590..3bbf4c8bc 100644
> > --- a/ovsdb/ovsdb-tool.c
> > +++ b/ovsdb/ovsdb-tool.c
> > @@ -173,6 +173,9 @@ usage(void)
> > "  compare-versions A OP B  compare OVSDB schema version
> numbers\n"
> > "  query [DB] TRNS execute read-only transaction on
> DB\n"
> > "  transact [DB] TRNS  execute read/write transaction on
> DB\n"
> > +   "  cluster-to-standalone DB DBConvert clustered DB to\n"
> > +   "  standalone DB when cluster is down and cannot be\n"
> > +   "revived\n"
> > "  [-m]... show-log [DB]   print DB's log entries\n"
> > "The default DB is %s.\n"
> > "The default SCHEMA is %s.\n",
> > @@ -942,6 +945,55 @@ print_raft_record(const struct raft_record *r,
> >  }
> >  }
> >
> > +static void
> > +raft_header_to_standalone_log(const struct raft_header *h,
> > +  struct ovsdb_log *db_log_data)
> > +{
> > +if (h->snap_index) {
> > +if (!h->snap.data || json_array(h->snap.data)->n != 2) {
> > +ovs_fatal(0, "Incorrect raft header data array length");
> > +}
> > +
> > +struct json *schema_json = json_array(h->snap.data)->elems[0];
> > +if (schema_json->type != JSON_NULL) {
> > +struct ovsdb_schema *schema;
> > +check_ovsdb_error(ovsdb_schema_from_json(schema_json,
> ));
> > +ovsdb_schema_destroy(schema);
> > +check_ovsdb_error(ovsdb_log_write_and_free(db_log_data,
> > +   schema_json));
> > +}
> > +
> > +struct json *data_json = json_array(h->snap.data)->elems[1];
> > +if (!data_json || data_json->type != JSON_OBJECT) {
> > +ovs_fatal(0, "Invalid raft header data");
> > +}
> > +if (data_json->type != JSON_NULL) {
> > +

[ovs-dev] [PATCH v10] ovsdb-tool: Convert clustered db to standalone db.

2019-08-30 Thread amginwal
From: Aliasgar Ginwala 

Add support in ovsdb-tool for migrating clustered dbs to standalone dbs.
E.g. usage to migrate nb/sb db to standalone db from raft:
ovsdb-tool cluster-to-standalone ovnnb_db.db ovnnb_db_cluster.db

Signed-off-by: Aliasgar Ginwala 
Acked-by: Han Zhou 
---
 Documentation/ref/ovsdb.7.rst |   3 +
 NEWS  |   3 +
 ovsdb/ovsdb-tool.1.in |   8 +++
 ovsdb/ovsdb-tool.c| 101 +-
 tests/ovsdb-tool.at   |  38 +
 5 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/Documentation/ref/ovsdb.7.rst b/Documentation/ref/ovsdb.7.rst
index cd1c63d64..b12d8066c 100644
--- a/Documentation/ref/ovsdb.7.rst
+++ b/Documentation/ref/ovsdb.7.rst
@@ -514,6 +514,9 @@ standalone database from the contents of a running 
clustered database.
 When the cluster is down and cannot be revived, ``ovsdb-client backup`` will
 not work.
 
+Use ``ovsdb-tool cluster-to-standalone`` to convert clustered database to
+standalone database when the cluster is down and cannot be revived.
+
 Upgrading or Downgrading a Database
 ---
 
diff --git a/NEWS b/NEWS
index c5caa13d6..a02f9f1a6 100644
--- a/NEWS
+++ b/NEWS
@@ -49,6 +49,9 @@ v2.12.0 - xx xxx 
quickly after a brief disconnection, saving bandwidth and CPU time.
See section 4.1.15 of ovsdb-server(7) for details of related OVSDB
protocol extension.
+ * Support to convert from cluster database to standalone database is now
+   available when clustered is down and cannot be revived using ovsdb-tool
+   . Check "Database Migration Commands" in ovsdb-tool man section.
- OVN:
  * IPAM/MACAM:
- select IPAM mac_prefix in a random manner if not provided by the user
diff --git a/ovsdb/ovsdb-tool.1.in b/ovsdb/ovsdb-tool.1.in
index ec85e14c4..31a918d90 100644
--- a/ovsdb/ovsdb-tool.1.in
+++ b/ovsdb/ovsdb-tool.1.in
@@ -147,6 +147,14 @@ avoid this possibility, specify \fB\-\-cid=\fIuuid\fR, 
where
 \fIuuid\fR is the cluster ID of the cluster to join, as printed by
 \fBovsdb\-tool get\-cid\fR.
 .
+.SS "Database Migration Commands"
+This commands will convert cluster database to standalone database.
+.
+.IP "\fBcluster\-to\-standalone\fI db clusterdb"
+Use this command to convert to standalone database from clustered database
+when the cluster is down and cannot be revived. It creates new standalone
+\fIdb\fR file from the given cluster \fIdb\fR file.
+.
 .SS "Version Management Commands"
 .so ovsdb/ovsdb-schemas.man
 .PP
diff --git a/ovsdb/ovsdb-tool.c b/ovsdb/ovsdb-tool.c
index 438f97590..3bbf4c8bc 100644
--- a/ovsdb/ovsdb-tool.c
+++ b/ovsdb/ovsdb-tool.c
@@ -173,6 +173,9 @@ usage(void)
"  compare-versions A OP B  compare OVSDB schema version numbers\n"
"  query [DB] TRNS execute read-only transaction on DB\n"
"  transact [DB] TRNS  execute read/write transaction on DB\n"
+   "  cluster-to-standalone DB DBConvert clustered DB to\n"
+   "  standalone DB when cluster is down and cannot be\n"
+   "revived\n"
"  [-m]... show-log [DB]   print DB's log entries\n"
"The default DB is %s.\n"
"The default SCHEMA is %s.\n",
@@ -942,6 +945,55 @@ print_raft_record(const struct raft_record *r,
 }
 }
 
+static void
+raft_header_to_standalone_log(const struct raft_header *h,
+  struct ovsdb_log *db_log_data)
+{
+if (h->snap_index) {
+if (!h->snap.data || json_array(h->snap.data)->n != 2) {
+ovs_fatal(0, "Incorrect raft header data array length");
+}
+
+struct json *schema_json = json_array(h->snap.data)->elems[0];
+if (schema_json->type != JSON_NULL) {
+struct ovsdb_schema *schema;
+check_ovsdb_error(ovsdb_schema_from_json(schema_json, ));
+ovsdb_schema_destroy(schema);
+check_ovsdb_error(ovsdb_log_write_and_free(db_log_data,
+   schema_json));
+}
+
+struct json *data_json = json_array(h->snap.data)->elems[1];
+if (!data_json || data_json->type != JSON_OBJECT) {
+ovs_fatal(0, "Invalid raft header data");
+}
+if (data_json->type != JSON_NULL) {
+check_ovsdb_error(ovsdb_log_write_and_free(db_log_data,
+   data_json));
+}
+}
+}
+
+static void
+raft_record_to_standalone_log(const struct raft_record *r,
+  struct ovsdb_log *db_log_data)
+{
+if (r->type == RAFT_REC_ENTRY) {
+if (!r->entry.data) {
+return;
+}
+if (json_array(r->entry.data)->n != 2) {
+ovs_fatal(0, "Incorrect raft record array length");
+}
+
+struct json *data_json = json_array(r->entry.data)->elems[1];
+if (data_json->type != JSON_NULL) {

Re: [ovs-dev] [ovs-discuss] ovn-controller is taking 100% CPU all the time in one deployment

2019-08-30 Thread Mark Michelson

On 8/30/19 5:39 AM, Daniel Alvarez Sanchez wrote:

On Thu, Aug 29, 2019 at 10:01 PM Mark Michelson  wrote:


On 8/29/19 2:39 PM, Numan Siddique wrote:

Hello Everyone,

In one of the OVN deployments, we are seeing 100% CPU usage by
ovn-controllers all the time.

After investigations we found the below

   - ovn-controller is taking more than 20 seconds to complete full loop
(mainly in lflow_run() function)

   - The physical switch is sending GARPs periodically every 10 seconds.

   - There is ovn-bridge-mappings configured and these GARP packets
reaches br-int via the patch port.

   - We have a flow in router pipeline which applies the action - put_arp
if it is arp packet.

   - ovn-controller pinctrl thread receives these garps, stores the
learnt mac-ips in the 'put_mac_bindings' hmap and notifies the
ovn-controller main thread by incrementing the seq no.

   - In the ovn-controller main thread, after lflow_run() finishes,
pinctrl_wait() is called. This function calls - poll_immediate_wake() as
'put_mac_bindings' hmap is not empty.

- This causes the ovn-controller poll_block() to not sleep at all and
this repeats all the time resulting in 100% cpu usage.

The deployment has OVS/OVN 2.9.  We have back ported the pinctrl_thread
patch.

Some time back I had reported an issue about lflow_run() taking lot of
time - https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360414.html

I think we need to improve the logical processing sooner or later.


I agree that this is very important. I know that logical flow processing
is the biggest bottleneck for ovn-controller, but 20 seconds is just
ridiculous. In your scale testing, you found that lflow_run() was taking
10 seconds to complete.

I support this statement 100% (20 seconds is just ridiculous). To be
precise, in this deployment we see over 23 seconds for the main loop
to process and I've seen even 30 seconds some times. I've been talking
to Numan these days about this issue and I support profiling this
actual deployment so that we can figure out how incremental processing
would help.



I'm curious if there are any factors in this particular deployment's
configuration that might contribute to this. For instance, does this
deployment have a glut of ACLs? Are they not using port groups?

They're not using port groups because it's 2.9 and it is not there.
However, I don't think port groups would make a big difference in
terms of ovn-controller computation. I might be wrong but Port Groups
help reduce the number of ACLs in the NB database while the # of
Logical Flows would still remain the same. We'll try to get the
contents of the NB database and figure out what's killing it.



You're right that port groups won't reduce the number of logical flows. 
However, it can reduce the computation in ovn-controller. The reason is 
that the logical flows generated by ACLs that use port groups may result 
in conjunctive matches being used. If you want a bit more information, 
see the "Port groups" section of this blog post I wrote:


https://developers.redhat.com/blog/2019/01/02/performance-improvements-in-ovn-past-and-future/

The TL;DR is that with port groups, I saw the number of OpenFlow flows 
generated by ovn-controller drop by 3 orders of magnitude. And that 
meant that flow processing was 99% faster for large networks.


You may not see the same sort of improvement for this deployment, mainly 
because my test case was tailored to illustrate how port groups help. 
There may be other factors in this deployment that complicate flow 
processing.




This particular deployment's configuration may give us a good scenario
for our testing to improve lflow processing time.

Absolutely!




But to fix this issue urgently, we are thinking of the below approach.

   - pinctrl_thread will locally cache the mac_binding entries (just like
it caches the dns entries). (Please note pinctrl_thread can not access
the SB DB IDL).




- Upon receiving any arp packet (via the put_arp action), pinctrl_thread
will check the local mac_binding cache and will only wake up the main
ovn-controller thread only if the mac_binding update is required.

This approach will solve the issue since the MAC sent by the physical
switches will not change. So there is no need to wake up ovn-controller
main thread.


I think this can work well. We have a lot of what's needed already in
pinctrl at this point. We have the hash table of mac bindings already.
Currently, we flush this table after we write the data to the southbound
database. Instead, we would keep the bindings in memory. We would need
to ensure that the in-memory MAC bindings eventually get deleted if they
become stale.



In the present master/2.12 these GARPs will not cause this 100% cpu loop
issue because incremental processing will not recompute flows.


Another mitigating factor for master is something I'm currently working
on. I've got the beginnings of a patch series going where I am
separating pinctrl into a separate process from 

Re: [ovs-dev] [PATCH v7] Detailed packet drop statistics per dpdk and vhostuser ports

2019-08-30 Thread Ilya Maximets
On 28.08.2019 15:58, Sriram Vatala wrote:
> OVS may be unable to transmit packets for multiple reasons and
> today there is a single counter to track packets dropped due to
> any of those reasons. The most common reason is that a VM is
> unable to read packets fast enough causing the vhostuser port
> transmit queue on the OVS side to become full. This manifests
> as a problem with VNFs not receiving all packets. Having a
> separate drop counter to track packets dropped because the
> transmit queue is full will clearly indicate that the problem
> is on the VM side and not in OVS. Similarly maintaining separate
> counters for all possible drops helps in indicating sensible
> cause for packet drops.
> 
> This patch adds custom stats counters to track packets dropped
> at port level and these counters are displayed along with
> other stats in "ovs-vsctl get interface  statistics"
> command. The detailed stats will be available for both dpdk and
> vhostuser ports.
> 
> Signed-off-by: Sriram Vatala 
> ---

Hi. Thanks for a new version.

It'll be good if you'll include version history right here under
the '---' cut line, so it'll be easier to track changes.
Like this:

Version 7:
  * Implemented something.
  * Fixed something.
  * Patch renamed. Previous name: ...

Version 6:
  * ...


Another general comment is that, IMHO, it's better to rename this
patch to "netdev-dpdk: Detailed packet drop statistics.".
This patch adds statistics for all the dpdk based interfaces, so
there is no need to list them in the commit name. Prefix clearly
describes the patch area. (Please, do not drop the version number
because of patch re-naming. Just add a note about renaming in a
version history.)

More comments inline.

Best regards, Ilya Maximets.

>  lib/netdev-dpdk.c  | 99 
> +++---
>  utilities/bugtool/automake.mk  |  3 +-
>  utilities/bugtool/ovs-bugtool-get-iface-stats  | 25 ++
>  .../bugtool/plugins/network-status/openvswitch.xml |  1 +
>  vswitchd/vswitch.xml   | 24 ++
>  5 files changed, 139 insertions(+), 13 deletions(-)
>  create mode 100755 utilities/bugtool/ovs-bugtool-get-iface-stats
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index bc20d68..a679c5b 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -413,8 +413,17 @@ struct netdev_dpdk {
>  
>  PADDED_MEMBERS(CACHE_LINE_SIZE,
>  struct netdev_stats stats;
> -/* Custom stat for retries when unable to transmit. */
> +/* Custom device stats */
> +/* No. of retries when unable to transmit. */
>  uint64_t tx_retries;
> +/* Pkts dropped when unable to transmit; Probably Tx queue is full */
> +uint64_t tx_failure_drops;
> +/* Pkts len greater than max dev MTU */
> +uint64_t tx_mtu_exceeded_drops;
> +/* Pkt drops in egress policier processing */
> +uint64_t tx_qos_drops;
> +/* Pkts drops in ingress policier processing */
> +uint64_t rx_qos_drops;
>  /* Protects stats */
>  rte_spinlock_t stats_lock;
>  /* 4 pad bytes here. */

Padding still needs re-calculation.
Another option is to move all the counters to separate structure
like 'struct netdev_dpdk_sw_stats' and keep a pointer in 'struct netdev_dpdk'.

Another point is about naming. We, probably, should add some prefix to these
stats to distinguish them from HW/driver stats and avoid possible collisions.
This could be done here in variable names or while reporting them.
I guess 'netdev_dpdk_' prefix might be suitable to clearly highlight the
origin of the counter (It's, probably, better to add prefix like this
while reporting because this is not very useful inside the netdev-dpdk code).
Suggestions are welcome.


> @@ -2171,6 +2180,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>  struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
>  uint16_t nb_rx = 0;
>  uint16_t dropped = 0;
> +uint16_t qos_drops = 0;
>  int qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_TXQ;
>  int vid = netdev_dpdk_get_vid(dev);
>  
> @@ -2202,11 +2212,13 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>  (struct rte_mbuf **) batch->packets,
>  nb_rx, true);
>  dropped -= nb_rx;
> +qos_drops = dropped;
>  }
>  
>  rte_spinlock_lock(>stats_lock);
>  netdev_dpdk_vhost_update_rx_counters(>stats, batch->packets,
>   nb_rx, dropped);
> +dev->rx_qos_drops += qos_drops;
>  rte_spinlock_unlock(>stats_lock);
>  
>  batch->count = nb_rx;
> @@ -2232,6 +2244,7 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct 
> dp_packet_batch *batch,
>  struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
>  int nb_rx;
>  int dropped = 0;
> +int qos_drops = 0;
>  
>  

[ovs-dev] [PATCH ovn] Fix configure help string for ovs source/build path

2019-08-30 Thread Lorenzo Bianconi
Signed-off-by: Lorenzo Bianconi 
---
 acinclude.m4 | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/acinclude.m4 b/acinclude.m4
index d5552d708..31e4f0b54 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -1212,10 +1212,10 @@ dnl
 dnl Check for OVS sources
 AC_DEFUN([OVN_CHECK_OVS], [
   AC_ARG_WITH([ovs-source],
-  [AC_HELP_STRING([--ovs-source=/path/to/ovs/src/dir],
+  [AC_HELP_STRING([--with-ovs-source=/path/to/ovs/src/dir],
   [Specify the OVS src directory])])
   AC_ARG_WITH([ovs-build],
-  [AC_HELP_STRING([--ovs-build=/path/to/ovs/build/dir],
+  [AC_HELP_STRING([--with-ovs-build=/path/to/ovs/build/dir],
   [Specify the OVS build directory])])
 
   AC_MSG_CHECKING([for OVS source directory])
-- 
2.21.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] Make pid_exists() more robust against empty pid argument

2019-08-30 Thread Ilya Maximets
On 29.08.2019 17:46, Ben Pfaff wrote:
> On Tue, Aug 27, 2019 at 02:43:21PM +0300, Ilya Maximets wrote:
>> On 14.08.2019 18:47, Michele Baldessari wrote:
>>> In some of our destructive testing of ovn-dbs inside containers managed
>>> by pacemaker we reached a situation where /var/run/openvswitch had
>>> empty .pid files. The current code does not deal well with them
>>> and pidfile_is_running() returns true in such a case and this confuses
>>> the OCF resource agent.
>>>
>>> - Before this change:
>>> Inside a container run:
>>>   killall ovsdb-server;
>>>   echo -n '' > /var/run/openvswitch/ovnnb_db.pid; echo -n '' > 
>>> /var/run/openvswitch/ovnsb_db.pid
>>
>> What about whitespaces?
>> I mean, if you'll  write ' ' instead of '', the check 'test -n "$1"'
>> will succeed and the test will fail.
> 
> I think that is OK, because test -d "/proc/ " will also fail.

Oh, I see. Good point.

> 
>> To handle this case we need to trim off whitespaces by the 'tr' utility
>> or change the proc checker to something like 'test -f /proc/"$1"/status'.
> 
> I guess to be absolutely certain we'd need something like
> 
> case $1 in
>   '')   false ;;# Reject empty string
>   *[!0-9]*) false ;;# Reject anything with non-digits
>   *)test -d /proc/$1 ;;
> esac
> 
> Anyway, I think that this is an improvement, so I applied it to master
> and backported it.

Agree. Thanks!

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v7] Detailed packet drop statistics per dpdk and vhostuser ports

2019-08-30 Thread Sriram Vatala via dev
Hi All,
Please consider this as a gentle remainder.

Thanks & Regards,
Sriram.

-Original Message-
From: Sriram Vatala  
Sent: 28 August 2019 18:29
To: ovs-dev@openvswitch.org; i.maxim...@samsung.com
Cc: b...@ovn.org; ian.sto...@intel.com; Sriram Vatala

Subject: [PATCH v7] Detailed packet drop statistics per dpdk and vhostuser
ports

OVS may be unable to transmit packets for multiple reasons and today there
is a single counter to track packets dropped due to any of those reasons.
The most common reason is that a VM is unable to read packets fast enough
causing the vhostuser port transmit queue on the OVS side to become full.
This manifests as a problem with VNFs not receiving all packets. Having a
separate drop counter to track packets dropped because the transmit queue is
full will clearly indicate that the problem is on the VM side and not in
OVS. Similarly maintaining separate counters for all possible drops helps in
indicating sensible cause for packet drops.

This patch adds custom stats counters to track packets dropped at port level
and these counters are displayed along with other stats in "ovs-vsctl get
interface  statistics"
command. The detailed stats will be available for both dpdk and vhostuser
ports.

Signed-off-by: Sriram Vatala 
---
 lib/netdev-dpdk.c  | 99
+++---
 utilities/bugtool/automake.mk  |  3 +-
 utilities/bugtool/ovs-bugtool-get-iface-stats  | 25 ++
 .../bugtool/plugins/network-status/openvswitch.xml |  1 +
 vswitchd/vswitch.xml   | 24 ++
 5 files changed, 139 insertions(+), 13 deletions(-)  create mode 100755
utilities/bugtool/ovs-bugtool-get-iface-stats

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index bc20d68..a679c5b
100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -413,8 +413,17 @@ struct netdev_dpdk {
 
 PADDED_MEMBERS(CACHE_LINE_SIZE,
 struct netdev_stats stats;
-/* Custom stat for retries when unable to transmit. */
+/* Custom device stats */
+/* No. of retries when unable to transmit. */
 uint64_t tx_retries;
+/* Pkts dropped when unable to transmit; Probably Tx queue is full
*/
+uint64_t tx_failure_drops;
+/* Pkts len greater than max dev MTU */
+uint64_t tx_mtu_exceeded_drops;
+/* Pkt drops in egress policier processing */
+uint64_t tx_qos_drops;
+/* Pkts drops in ingress policier processing */
+uint64_t rx_qos_drops;
 /* Protects stats */
 rte_spinlock_t stats_lock;
 /* 4 pad bytes here. */
@@ -2171,6 +2180,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
 struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
 uint16_t nb_rx = 0;
 uint16_t dropped = 0;
+uint16_t qos_drops = 0;
 int qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_TXQ;
 int vid = netdev_dpdk_get_vid(dev);
 
@@ -2202,11 +2212,13 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
 (struct rte_mbuf **) batch->packets,
 nb_rx, true);
 dropped -= nb_rx;
+qos_drops = dropped;
 }
 
 rte_spinlock_lock(>stats_lock);
 netdev_dpdk_vhost_update_rx_counters(>stats, batch->packets,
  nb_rx, dropped);
+dev->rx_qos_drops += qos_drops;
 rte_spinlock_unlock(>stats_lock);
 
 batch->count = nb_rx;
@@ -2232,6 +2244,7 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct
dp_packet_batch *batch,
 struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
 int nb_rx;
 int dropped = 0;
+int qos_drops = 0;
 
 if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) {
 return EAGAIN;
@@ -2250,12 +2263,14 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct
dp_packet_batch *batch,
 (struct rte_mbuf **) batch->packets,
 nb_rx, true);
 dropped -= nb_rx;
+qos_drops = dropped;
 }
 
 /* Update stats to reflect dropped packets */
 if (OVS_UNLIKELY(dropped)) {
 rte_spinlock_lock(>stats_lock);
 dev->stats.rx_dropped += dropped;
+dev->rx_qos_drops += qos_drops;
 rte_spinlock_unlock(>stats_lock);
 }
 
@@ -2339,6 +2354,9 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int
qid,
 struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
 unsigned int total_pkts = cnt;
 unsigned int dropped = 0;
+unsigned int tx_failure;
+unsigned int mtu_drops;
+unsigned int qos_drops;
 int i, retries = 0;
 int max_retries = VHOST_ENQ_RETRY_MIN;
 int vid = netdev_dpdk_get_vid(dev); @@ -2356,9 +2374,12 @@
__netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
 rte_spinlock_lock(>tx_q[qid].tx_lock);
 
 cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
+mtu_drops = total_pkts - cnt;
+   

[ovs-dev] ovs kernel sampling has memleak issue

2019-08-30 Thread Gao Zhenyu
Hi

   Found a potential memleak issue when I enable bridge ipfix and transmit
traffic through geneve tunnel.
   The kernel is centos7.5   3.10.0-862.9.1.el7.x86_64

memleak show a lot of  malloc call-stack :

unreferenced object 0x93f473e64000 (size 512):
  comm "softirq", pid 0, jiffies 4461721256 (age 5354.645s)
  hex dump (first 32 bytes):
00 3e e6 73 f4 93 ff ff 00 9e 3f 87 ff ff ff ff  .>.s..?.
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[] __kmalloc+0xeb/0x260
[] metadata_dst_alloc+0x24/0x90
[] dev_fill_metadata_dst+0x83/0x130
[] do_execute_actions+0x769/0xa40 [openvswitch]
[] clone_execute+0xe1/0x200 [openvswitch]
[] do_execute_actions+0x72c/0xa40 [openvswitch]
[] ovs_execute_actions+0x4c/0x140 [openvswitch]
[] ovs_dp_process_packet+0x84/0x120 [openvswitch]
[] ovs_vport_receive+0x73/0xd0 [openvswitch]
[] netdev_frame_hook+0xde/0x180 [openvswitch]
[] __netif_receive_skb_core+0x1fa/0xa20
[] __netif_receive_skb+0x18/0x60
[] process_backlog+0xae/0x180
[] net_rx_action+0x26f/0x390
[] __do_softirq+0xf5/0x280
[] run_ksoftirqd+0x38/0x50


-- 
Thanks
Zhenyu Gao
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] flow: miniflow_extract metadata branchless optimization

2019-08-30 Thread Yanqin Wei (Arm Technology China)
Hi 

I run a basic nic2nic case for compiler branch stripping. The improvement is 
around 0.8%.
It is observed that 0.37% branch miss in case of directly call 
miniflow_extract__(but it is not inline here)  and 0.27% branch miss in case of 
enable multi-instance for miniflow_extract.

But downside is code size gets bigger and is possible to increase I-cache miss. 
In this test, it is not observed because only miniflow_extract_firstpass is 
invoked. For some cases such as tunnel or connection track, it might has some 
impact.  But in these cases, the branch prediction failure should also increase 
if not enable branch strip.

So the impact should be platform(cache size, branch prediction, compiler) and 
deployment(L3 fwding/tunnel/...) dependency. 
On the whole, the greater difference between first pass and recirculation, the 
more benefit to enable multi-instance.

Best Regards,
Wei Yanqin

-Original Message-
From: Ilya Maximets  
Sent: Thursday, August 29, 2019 9:46 PM
To: Yanqin Wei (Arm Technology China) ; Ben Pfaff 

Cc: d...@openvswitch.org; nd ; Gavin Hu (Arm Technology China) 

Subject: Re: [ovs-dev] [PATCH v2] flow: miniflow_extract metadata branchless 
optimization

On 29.08.2019 12:21, Yanqin Wei (Arm Technology China) wrote:
> Hi Ben,
> Thanks for the feedback.  It is indeed related with userspace datapath. 
> 
> Hi Ilya,
> Could you take a look at this patch when you have time?> I knew 
> first-pass and recirculating traffic share the same packet handling. It makes 
> code common and maintainable.
> But if we can introduce some data-oriented and well-defined flag to bypass 
> some time-consuming handling, it can improve performance a lot.

Hi. I had a quick look at the patch.
Few thoughts:
* 'md' is actually always valid inside 'miniflow_extract', so the
   variable 'md_valid' should be renamed to not be misleading.
   Maybe something like 'md_is_full'? I'm not sure about the name.

* How much is the performance benefit of the compiler code stripping?
  I mean, what is the difference between direct call
 miniflow_extract__(packet, dst, md_is_valid);
  where 'md_is_valid == false' and the call to
 miniflow_extract_firstpass(packet, dst);
  ?
  Asking because this complicates dfc_processing() function.
  I'm, actually have a patch locally to combine rss_hash
  calculation functions to reduce code duplication, so I'm trying
  to figure out what are the possibilities here.

Best regards, Ilya Maximets.

> 
> Best Regards,
> Wei Yanqin
> 
> -Original Message-
> From: Ben Pfaff 
> Sent: Thursday, August 29, 2019 6:11 AM
> To: Yanqin Wei (Arm Technology China) ; Ilya 
> Maximets 
> Cc: d...@openvswitch.org; nd ; Gavin Hu (Arm Technology 
> China) 
> Subject: Re: [ovs-dev] [PATCH v2] flow: miniflow_extract metadata 
> branchless optimization
> 
> This fails to apply to current master.
> 
> Whereas most of the time I'd be comfortable with reviewing 'flow'
> patches myself, this one is closed related to the dpif-netdev code and I'd 
> prefer to have someone who understands that code well, as well as the 
> tradeoffs between performance and maintainability, review it.  Ilya (added to 
> the To line) is a good choice.
> 
> On Thu, Aug 22, 2019 at 06:09:16PM +0800, Yanqin Wei wrote:
>> "miniflow_extract" is a branch heavy implementation for packet header 
>> and metadata parsing. There is a lot of meta data handling for all traffic.
>> But this should not be applicable for packets from interface.
>> This patch adds a layer of inline encapsulation to miniflow_extract 
>> and introduces constant "md_valid" input parameter as a branch condition.
>> The new branch will be removed by the compiler at compile time. Two 
>> instances of miniflow_extract with different branches will be generated.
>>
>> This patch is tested on an arm64 platform. It improves more than 3.5% 
>> performance in P2P forwarding cases.
>>
>> Reviewed-by: Gavin Hu 
>> Signed-off-by: Yanqin Wei 
>> ---
>>  lib/dpif-netdev.c |  13 +++---
>>  lib/flow.c| 116 
>> --
>>  lib/flow.h|   2 +
>>  3 files changed, 79 insertions(+), 52 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>> d0a1c58..6686b14 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -6508,12 +6508,15 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>  }
>>  }
>>  
>> -miniflow_extract(packet, >mf);
>> +if (!md_is_valid) {
>> +miniflow_extract_firstpass(packet, >mf);
>> +key->hash =
>> +dpif_netdev_packet_get_rss_hash_orig_pkt(packet, >mf);
>> +} else {
>> +miniflow_extract(packet, >mf);
>> +key->hash = dpif_netdev_packet_get_rss_hash(packet, >mf);
>> +}
>>  key->len = 0; /* Not computed yet. */
>> -key->hash =
>> -(md_is_valid == false)
>> -? 

Re: [ovs-dev] [ovs-discuss] ovn-controller is taking 100% CPU all the time in one deployment

2019-08-30 Thread Daniel Alvarez Sanchez
On Thu, Aug 29, 2019 at 10:01 PM Mark Michelson  wrote:
>
> On 8/29/19 2:39 PM, Numan Siddique wrote:
> > Hello Everyone,
> >
> > In one of the OVN deployments, we are seeing 100% CPU usage by
> > ovn-controllers all the time.
> >
> > After investigations we found the below
> >
> >   - ovn-controller is taking more than 20 seconds to complete full loop
> > (mainly in lflow_run() function)
> >
> >   - The physical switch is sending GARPs periodically every 10 seconds.
> >
> >   - There is ovn-bridge-mappings configured and these GARP packets
> > reaches br-int via the patch port.
> >
> >   - We have a flow in router pipeline which applies the action - put_arp
> > if it is arp packet.
> >
> >   - ovn-controller pinctrl thread receives these garps, stores the
> > learnt mac-ips in the 'put_mac_bindings' hmap and notifies the
> > ovn-controller main thread by incrementing the seq no.
> >
> >   - In the ovn-controller main thread, after lflow_run() finishes,
> > pinctrl_wait() is called. This function calls - poll_immediate_wake() as
> > 'put_mac_bindings' hmap is not empty.
> >
> > - This causes the ovn-controller poll_block() to not sleep at all and
> > this repeats all the time resulting in 100% cpu usage.
> >
> > The deployment has OVS/OVN 2.9.  We have back ported the pinctrl_thread
> > patch.
> >
> > Some time back I had reported an issue about lflow_run() taking lot of
> > time - https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360414.html
> >
> > I think we need to improve the logical processing sooner or later.
>
> I agree that this is very important. I know that logical flow processing
> is the biggest bottleneck for ovn-controller, but 20 seconds is just
> ridiculous. In your scale testing, you found that lflow_run() was taking
> 10 seconds to complete.
I support this statement 100% (20 seconds is just ridiculous). To be
precise, in this deployment we see over 23 seconds for the main loop
to process and I've seen even 30 seconds some times. I've been talking
to Numan these days about this issue and I support profiling this
actual deployment so that we can figure out how incremental processing
would help.

>
> I'm curious if there are any factors in this particular deployment's
> configuration that might contribute to this. For instance, does this
> deployment have a glut of ACLs? Are they not using port groups?
They're not using port groups because it's 2.9 and it is not there.
However, I don't think port groups would make a big difference in
terms of ovn-controller computation. I might be wrong but Port Groups
help reduce the number of ACLs in the NB database while the # of
Logical Flows would still remain the same. We'll try to get the
contents of the NB database and figure out what's killing it.

>
> This particular deployment's configuration may give us a good scenario
> for our testing to improve lflow processing time.
Absolutely!
>
> >
> > But to fix this issue urgently, we are thinking of the below approach.
> >
> >   - pinctrl_thread will locally cache the mac_binding entries (just like
> > it caches the dns entries). (Please note pinctrl_thread can not access
> > the SB DB IDL).
>
> >
> > - Upon receiving any arp packet (via the put_arp action), pinctrl_thread
> > will check the local mac_binding cache and will only wake up the main
> > ovn-controller thread only if the mac_binding update is required.
> >
> > This approach will solve the issue since the MAC sent by the physical
> > switches will not change. So there is no need to wake up ovn-controller
> > main thread.
>
> I think this can work well. We have a lot of what's needed already in
> pinctrl at this point. We have the hash table of mac bindings already.
> Currently, we flush this table after we write the data to the southbound
> database. Instead, we would keep the bindings in memory. We would need
> to ensure that the in-memory MAC bindings eventually get deleted if they
> become stale.
>
> >
> > In the present master/2.12 these GARPs will not cause this 100% cpu loop
> > issue because incremental processing will not recompute flows.
>
> Another mitigating factor for master is something I'm currently working
> on. I've got the beginnings of a patch series going where I am
> separating pinctrl into a separate process from ovn-controller:
> https://github.com/putnopvut/ovn/tree/pinctrl_process
>
> It's in the early stages right now, so please don't judge :)
>
> Separating pinctrl to its own process means that it cannot directly
> cause ovn-controller to wake up like it currently might.
>
> >
> > Even though the above approach is not really required for master/2.12, I
> > think it is still Ok to have this as there is no harm.
> >
> > I would like to know your comments and any concerns if any.
>
> Hm, I don't really understand why we'd want to put this in master/2.12
> if the problem doesn't exist there. The main concern I have is with
> regards to cache lifetime. I don't want to introduce potential memory
> growth concerns 

Re: [ovs-dev] ovn-controller is taking 100% CPU all the time in one deployment

2019-08-30 Thread Numan Siddique
On Fri, Aug 30, 2019 at 1:04 AM Han Zhou  wrote:

>
>
> On Thu, Aug 29, 2019 at 12:16 PM Numan Siddique 
> wrote:
>
>>
>>
>> On Fri, Aug 30, 2019 at 12:37 AM Han Zhou  wrote:
>>
>>>
>>>
>>> On Thu, Aug 29, 2019 at 11:40 AM Numan Siddique 
>>> wrote:
>>> >
>>> > Hello Everyone,
>>> >
>>> > In one of the OVN deployments, we are seeing 100% CPU usage by
>>> ovn-controllers all the time.
>>> >
>>> > After investigations we found the below
>>> >
>>> >  - ovn-controller is taking more than 20 seconds to complete full loop
>>> (mainly in lflow_run() function)
>>> >
>>> >  - The physical switch is sending GARPs periodically every 10 seconds.
>>> >
>>> >  - There is ovn-bridge-mappings configured and these GARP packets
>>> reaches br-int via the patch port.
>>> >
>>> >  - We have a flow in router pipeline which applies the action -
>>> put_arp
>>> > if it is arp packet.
>>> >
>>> >  - ovn-controller pinctrl thread receives these garps, stores the
>>> learnt mac-ips in the 'put_mac_bindings' hmap and notifies the
>>> ovn-controller main thread by incrementing the seq no.
>>> >
>>> >  - In the ovn-controller main thread, after lflow_run() finishes,
>>> pinctrl_wait() is called. This function calls - poll_immediate_wake() as
>>> 'put_mac_bindings' hmap is not empty.
>>> >
>>> > - This causes the ovn-controller poll_block() to not sleep at all and
>>> this repeats all the time resulting in 100% cpu usage.
>>> >
>>> > The deployment has OVS/OVN 2.9.  We have back ported the
>>> pinctrl_thread patch.
>>> >
>>> > Some time back I had reported an issue about lflow_run() taking lot of
>>> time -
>>> https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360414.html
>>> >
>>> > I think we need to improve the logical processing sooner or later.
>>> >
>>> > But to fix this issue urgently, we are thinking of the below approach.
>>> >
>>> >  - pinctrl_thread will locally cache the mac_binding entries (just
>>> like it caches the dns entries). (Please note pinctrl_thread can not access
>>> the SB DB IDL).
>>> >
>>> > - Upon receiving any arp packet (via the put_arp action),
>>> pinctrl_thread will check the local mac_binding cache and will only wake up
>>> the main ovn-controller thread only if the mac_binding update is required.
>>> >
>>> > This approach will solve the issue since the MAC sent by the physical
>>> switches will not change. So there is no need to wake up ovn-controller
>>> main thread.
>>> >
>>> > In the present master/2.12 these GARPs will not cause this 100% cpu
>>> loop issue because incremental processing will not recompute flows.
>>> >
>>> > Even though the above approach is not really required for master/2.12,
>>> I think it is still Ok to have this as there is no harm.
>>> >
>>> > I would like to know your comments and any concerns if any.
>>> >
>>> > Thanks
>>> > Numan
>>> >
>>>
>>> Hi Numan,
>>>
>>> I think this approach should work. Just to make sure, to update the
>>> cache efficiently (to avoid another kind of recompute), it should use ovsdb
>>> change-tracking to update it incrementally.
>>>
>>> Regarding master/2.12, it is not harmful except that it will add some
>>> more code and increase memory footprint. For our current use cases, there
>>> can be easily 10,000s mac_bindings, but it may still be ok because each
>>> entry is very small. However, is there any benefit for doing this in
>>> master/2.12?
>>>
>>
>> I don't see much benefit. But I can't submit a patch to branch 2.9
>> without the fix getting merged in master first right ?
>> May be once it is merged in branch 2.9, we can consider to delete it ?
>>
>> I think it is just about how would you maintain a downstream branch.
> Since it is downstream, I don't think you need a change to be in upstream
> before fixing a problem. In this case it may be *no harm*, but what if the
> upstream is completely changed and incompatible for such a fix any more? It
> shouldn't prevent you from fixing your downstream. (Of course it is better
> to not have downstream at all, but sometimes it is useful to have it for a
> temporary period, and since you (and us, too) are already there ... :)
>

The dowstream 2.9 what we have is - OVS 2.9.0 + a bunch of patches (to fix
issues) which are already merged upstream (preferably upstream branch or at
least upstream master).  Any downstream only patch is frowned upon. When we
updrade to 2.10 or higher versions there is  a risk of functional changes
if the patch is not upstream.

If we have apply the approach I described above to downstream 2.9 then
there is definitely some functional change. When such GARPs are received,
in the case of our downstream 2.9 we will not wake up ovn-controller main
thread
but with 2.12/master, we wake up the ovn-controller main thread.

I still see no harm in having this in upstream master. May be instead of
having a complete clone of mac_bindings, we can have a subset of
mac_bindings cached only if those mac_bindings are learnt by an
ovn-controller.

I will explore more.

Thanks
Numan