Re: [ovs-dev] [PATCH v4] datapath: Add a new action dec_ttl
On Wed, May 26, 2021 at 2:50 PM Eelco Chaudron wrote: > > > > On 26 May 2021, at 14:46, Matteo Croce wrote: > > > On Wed, May 26, 2021 at 2:34 PM Eelco Chaudron wrote: > >> On 18 May 2021, at 20:15, Ilya Maximets wrote: > >>>> +/* Tests whether 'dpif' datapath supports decrement of the IP TTL via > >>>> + * OVS_ACTION_DEC_TTL. */ > >>>> +static bool > >>>> +check_dec_ttl_action(struct dpif *dpif) > >>>> +{ > >>>> +struct odputil_keybuf keybuf; > >>>> +struct flow flow = { 0 }; > >>> > >>> It's probbaly better to just memset it as in other similar functions > >>> to avoid compiler's complains. > >> > >> ACK, will use a memset here. > >> > > > > Which complaint? Memset is a bit slower because it clears also the > > struct padding. > > See the robot’s responses to this patchset: > > ofproto/ofproto-dpif.c:1195:12: error: missing braces around initializer > [-Werror=missing-braces] > struct flow flow = { 0 }; > > But I guess memset is ok, as other similar functions use it. > > > > > -- > > per aspera ad upstream > With C99 you can just do: struct flow flow = { }; -- per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v4] datapath: Add a new action dec_ttl
On Wed, May 26, 2021 at 2:34 PM Eelco Chaudron wrote: > On 18 May 2021, at 20:15, Ilya Maximets wrote: > >> +/* Tests whether 'dpif' datapath supports decrement of the IP TTL via > >> + * OVS_ACTION_DEC_TTL. */ > >> +static bool > >> +check_dec_ttl_action(struct dpif *dpif) > >> +{ > >> +struct odputil_keybuf keybuf; > >> +struct flow flow = { 0 }; > > > > It's probbaly better to just memset it as in other similar functions > > to avoid compiler's complains. > > ACK, will use a memset here. > Which complaint? Memset is a bit slower because it clears also the struct padding. -- per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v3] datapath: Add a new action dec_ttl
On Tue, Nov 24, 2020 at 11:43 AM Eelco Chaudron wrote: > > Add support for the dec_ttl action. Instead of programming the datapath with > a flow that matches the packet TTL and an IP set, use a single dec_ttl action. > > The old behavior is kept if the new action is not supported by the datapath. > > # ovs-ofctl dump-flows br0 >cookie=0x0, duration=12.538s, table=0, n_packets=4, n_bytes=392, ip > actions=dec_ttl,NORMAL >cookie=0x0, duration=12.536s, table=0, n_packets=4, n_bytes=168, > actions=NORMAL > > # ping -c1 -t 20 192.168.0.2 > PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data. > IP (tos 0x0, ttl 19, id 45336, offset 0, flags [DF], proto ICMP (1), length > 84) > 192.168.0.1 > 192.168.0.2: ICMP echo request, id 8865, seq 1, length 64 > > Linux netlink datapath support depends on upstream Linux commit: > 744676e77720 ("openvswitch: add TTL decrement action") > > > Note that in the Linux kernel tree the OVS_ACTION_ATTR_ADD_MPLS has been > defined, and to make sure the IDs are in sync, it had to be added to the > OVS source tree. This required some additional case statements, which > should be revisited once the OVS implementation is added. > > > Co-developed-by: Matteo Croce > Co-developed-by: Bindiya Kurle > Signed-off-by: Eelco Chaudron > Acked-by: Matteo Croce -- per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v10] dpif-netlink: distribute polling to discreet handlers
On Thu, Nov 26, 2020 at 9:25 PM Paolo Valerio wrote: > > Hi Mark, > > Mark Gray writes: > > > From: Aaron Conole > > > > Currently, the channel handlers are polled globally. On some > > systems, this causes a thundering herd issue where multiple > > handler threads become active, only to do no work and immediately > > sleep. > > > > The approach here is to push the netlink socket channels to discreet > > handler threads to process, rather than polling on every thread. > > This will eliminate the need to wake multiple threads. > > > > To check: > > > > ip netns add left > > ip netns add right > > ip link add center-left type veth peer name left0 netns left > > ip link add center-right type veth peer name right0 netns right > > ip link set center-left up > > ip link set center-right up > > ip -n left ip link set left0 up > > ip -n left ip addr add 172.31.110.10/24 dev left0 > > ip -n right ip link set right0 up > > ip -n right ip addr add 172.31.110.11/24 dev right0 > > > > I think the "ip -n" sequence should be like: > > ip -n left link set left0 up > > Paolo > Right, good catch! -- Matteo Croce perl -e 'for($t=0;;$t++){print chr($t*($t>>8|$t>>13)&255)}' |aplay ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH net] net: openvswitch: fix TTL decrement action netlink message format
On Fri, Nov 20, 2020 at 10:12 PM Jakub Kicinski wrote: > > On Thu, 19 Nov 2020 04:04:04 -0500 Eelco Chaudron wrote: > > Currently, the openvswitch module is not accepting the correctly formated > > netlink message for the TTL decrement action. For both setting and getting > > the dec_ttl action, the actions should be nested in the > > OVS_DEC_TTL_ATTR_ACTION attribute as mentioned in the openvswitch.h uapi. > > IOW this change will not break any known user space, correct? > > But existing OvS user space already expects it to work like you > make it work now? > > What's the harm in leaving it as is? > > > Fixes: 744676e77720 ("openvswitch: add TTL decrement action") > > Signed-off-by: Eelco Chaudron > > Can we get a review from OvS folks? Matteo looks good to you (as the > original author)? > Hi, I think that the userspace still has to implement the dec_ttl action; by now dec_ttl is implemented with set_ttl(). So there is no breakage yet. Eelco, with this fix we will encode the netlink attribute in the same way for the kernel and netdev datapath? If so, go for it. > > - err = __ovs_nla_copy_actions(net, attr, key, sfa, eth_type, > > + err = __ovs_nla_copy_actions(net, actions, key, sfa, eth_type, > >vlan_tci, mpls_label_count, log); > > if (err) > > return err; > > You're not canceling any nests on error, I assume this is normal. > > > + add_nested_action_end(*sfa, action_start); > > add_nested_action_end(*sfa, start); > > return 0; > > } > -- per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v2] datapath: Add a new action dec_ttl
On Fri, Nov 13, 2020 at 3:28 PM Eelco Chaudron wrote: > > It is already in the mainline kernel, so changing it now would break the > UAPI. Don't think this is allowed from the kernel side. > > > What I'm suggesting is to send a bugfix to kernel > > to accept only format with nested OVS_DEC_TTL_ATTR_ACTION. Since this > > feature was never implemented in userspace OVS, this change will not > > break anything. On the OVS side we should always format netlink > > messages > > in a correct way. We have a code that checks feature existence in > > kernel > > and it should fail if kernel is broken (as it is right now). In this > > case > > OVS will continue to use old implementation with setting the ttl > > field. > > > > Since the patch for a kernel is a bug fix, it should be likely > > backported > > to stable versions, and distributions will pick it up too. > > If the kernel UAPI breakage is not a problem, I think this would work. > > > Thoughts? > > > > Best regards, Ilya Maximets. > Who is the only user of the kernel dec_ttl action now? I guess that only a patched vswitchd will, unless someone tries to program the datapath by hand. Cheers, -- per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] datapath: Add a new action dec_ttl
On Wed, Nov 11, 2020 at 4:13 PM Eelco Chaudron wrote: > > Add support for the dec_ttl action. Instead of programming the datapath with > a flow that matches the packet TTL and an IP set, use a single dec_ttl action. > > The old behavior is kept if the new action is not supported by the datapath. > > # ovs-ofctl dump-flows br0 >cookie=0x0, duration=12.538s, table=0, n_packets=4, n_bytes=392, ip > actions=dec_ttl,NORMAL >cookie=0x0, duration=12.536s, table=0, n_packets=4, n_bytes=168, > actions=NORMAL > > # ping -c1 -t 20 192.168.0.2 > PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data. > IP (tos 0x0, ttl 19, id 45336, offset 0, flags [DF], proto ICMP (1), length > 84) > 192.168.0.1 > 192.168.0.2: ICMP echo request, id 8865, seq 1, length 64 > > Linux netlink datapath support depends on upstream Linux commit: > 744676e77720 ("openvswitch: add TTL decrement action") > > > Note that in the Linux kernel tree the OVS_ACTION_ATTR_ADD_MPLS has been > defined, and to make sure the IDs are in sync, it had to be added to the > OVS source tree. This required some additional case statements, which > should be revisited once the OVS implementation is added. > > > Co-developed-by: Matteo Croce > Co-developed-by: Bindiya Kurle > Signed-off-by: Eelco Chaudron > --- > datapath/linux/compat/include/linux/openvswitch.h |8 ++ > lib/dpif-netdev.c |4 + > lib/dpif.c|4 + > lib/odp-execute.c | 102 > - > lib/odp-execute.h |2 > lib/odp-util.c| 55 +++ > lib/packets.h | 13 ++- > ofproto/ofproto-dpif-ipfix.c |2 > ofproto/ofproto-dpif-sflow.c |2 > ofproto/ofproto-dpif-xlate.c | 54 +-- > ofproto/ofproto-dpif.c| 37 > ofproto/ofproto-dpif.h|6 + > 12 files changed, 266 insertions(+), 23 deletions(-) > > diff --git a/datapath/linux/compat/include/linux/openvswitch.h > b/datapath/linux/compat/include/linux/openvswitch.h > index 2d884312f..66c386462 100644 > --- a/datapath/linux/compat/include/linux/openvswitch.h > +++ b/datapath/linux/compat/include/linux/openvswitch.h > @@ -1021,6 +1021,8 @@ enum ovs_action_attr { > OVS_ACTION_ATTR_METER,/* u32 meter number. */ > OVS_ACTION_ATTR_CLONE,/* Nested OVS_CLONE_ATTR_*. */ > OVS_ACTION_ATTR_CHECK_PKT_LEN, /* Nested OVS_CHECK_PKT_LEN_ATTR_*. */ > + OVS_ACTION_ATTR_ADD_MPLS, /* struct ovs_action_add_mpls. */ > + OVS_ACTION_ATTR_DEC_TTL, /* Nested OVS_DEC_TTL_ATTR_*. */ > > #ifndef __KERNEL__ > OVS_ACTION_ATTR_TUNNEL_PUSH, /* struct ovs_action_push_tnl*/ > @@ -1040,6 +1042,12 @@ enum ovs_action_attr { > > #define OVS_ACTION_ATTR_MAX (__OVS_ACTION_ATTR_MAX - 1) > > +enum ovs_dec_ttl_attr { > + OVS_DEC_TTL_ATTR_UNSPEC, > + OVS_DEC_TTL_ATTR_ACTION,/* Nested struct nlattr. */ > + __OVS_DEC_TTL_ATTR_MAX > +}; > + > /* Meters. */ > #define OVS_METER_FAMILY "ovs_meter" > #define OVS_METER_MCGROUP "ovs_meter" > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index 300861ca5..35c6560c3 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -7975,6 +7975,8 @@ dp_execute_cb(void *aux_, struct dp_packet_batch > *packets_, > case OVS_ACTION_ATTR_CT_CLEAR: > case OVS_ACTION_ATTR_CHECK_PKT_LEN: > case OVS_ACTION_ATTR_DROP: > +case OVS_ACTION_ATTR_DEC_TTL: > +case OVS_ACTION_ATTR_ADD_MPLS: > case __OVS_ACTION_ATTR_MAX: > OVS_NOT_REACHED(); > } > @@ -7991,7 +7993,7 @@ dp_netdev_execute_actions(struct dp_netdev_pmd_thread > *pmd, > struct dp_netdev_execute_aux aux = { pmd, flow }; > > odp_execute_actions(, packets, should_steal, actions, > -actions_len, dp_execute_cb); > +actions_len, dp_execute_cb, false); > } > > struct dp_netdev_ct_dump { > diff --git a/lib/dpif.c b/lib/dpif.c > index ac2860764..644d32d47 100644 > --- a/lib/dpif.c > +++ b/lib/dpif.c > @@ -1273,6 +1273,8 @@ dpif_execute_helper_cb(void *aux_, struct > dp_packet_batch *packets_, > case OVS_ACTION_ATTR_UNSPEC: > case OVS_ACTION_ATTR_CHECK_PKT_LEN: > case OVS_ACTION_ATTR_DROP: > +case OVS_ACTION_ATTR_ADD_MPLS: > +case OVS_ACTION_ATTR_DEC_TTL: > case __OVS_ACTION_ATTR_MAX: >
Re: [ovs-dev] [PATCH v3 2/3] dpif-netlink: distribute polling to discreet handlers
On Fri, Aug 28, 2020 at 7:00 PM Mark Gray wrote: > v3: one port latency results as per Matteo's comments > > Stock: > min/avg/max/mdev = 21.5/36.5/96.5/1.0 us > With Patch: > min/avg/max/mdev = 5.3/9.7/98.4/0.5 us > Awesome, LGTM! -- Matteo Croce perl -e 'for($t=0;;$t++){print chr($t*($t>>8|$t>>13)&255)}' |aplay ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v2] dpif-netlink: distribute polling to discreet handlers
On Wed, Jul 22, 2020 at 1:27 AM Aaron Conole wrote: > To check: > > ip netns add left > ip netns add right > ip link add center-left type veth peer name left0 > ip link add center-right type veth peer name right0 > ip link set left0 netns left > ip link set right0 netns right Nit: ip can set the peer netns upon veth creation: ip link add center-left type veth peer name left0 netns left ip link add center-right type veth peer name right0 netns right > +static uint32_t > +dpif_handler_port_idx_max(const struct dpif_handler *handler, > + struct dpif_channel **out_chn) > +{ > +uint32_t max = 0; > +size_t i; > + > +for (i = 0; i < handler->n_channels; ++i) { Any reason for using the prefix operator everywhere? They should be equal nowadays. > -- > 2.25.4 > Good job! Did you have the chance to run the same tests I did in 69c51582ff78? I'm curious to see how much it improves. Bye, -- Matteo Croce perl -e 'for($t=0;;$t++){print chr($t*($t>>8|$t>>13)&255)}' |aplay ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH RFC] dpif-netlink: distribute polling to discreet handlers
On Wed, Jun 24, 2020 at 3:48 PM Aaron Conole wrote: > > Currently, the channel handlers are polled globally. On some > systems, this causes a thundering herd issue where multiple > handler threads become active, only to do no work and immediately > sleep. > > The approach here is to push the netlink socket channels to discreet > handler threads to process, rather than polling on every thread. > This will eliminate the need to wake multiple threads. > > To check: > > ip netns add left > ip netns add right > ip link add center-left type veth peer name left0 > ip link add center-right type veth peer name right0 > ip link set left0 netns left > ip link set right0 netns right > ip link set center-left up > ip link set center-right up > ip netns exec left ip link set left0 up > ip netns exec left ip addr add 172.31.110.10/24 dev left0 > ip netns exec right ip link set right0 up > ip netns exec right ip addr add 172.31.110.11/24 dev right0 > > ovs-vsctl add-br br0 > ovs-vsctl add-port br0 center-right > ovs-vsctl add-port br0 center-left > > # in one terminal > perf record -e sched:sched_wakeup,irq:softirq_entry -ag > > # in a separate terminal > ip netns exec left arping -I left0 -c 1 172.31.110.11 > > # in the perf terminal after exiting > perf script > > Look for the number of 'handler' threads which were made active. > > Reported-by: David Ahern > Reported-at: > https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html > Cc: Matteo Croce > Cc: Flavio Leitner > Signed-off-by: Aaron Conole > Great work! Just to understand, this reverts some logic of my patch, and transposes the threads and sockets? Maybe it's the case to add a Fixes tag pointing to 69c51582f ? Nitpick: `iproute -n` instead of `ip netns exec` where possible, e.g. ip netns exec left ip link set left0 up becomes ip -n left link set left0 up Cheers, -- Matteo Croce perl -e 'for($t=0;;$t++){print chr($t*($t>>8|$t>>13)&255)}' |aplay ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH] tests: add unit test for lb-output-action
Extend the balance-tcp one so it tests lb-output-action too. The test checks that that the option is shown in bond/show, and that the lb_output action is programmed in the datapath. Signed-off-by: Matteo Croce --- tests/ofproto-dpif.at | 31 --- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at index 41164d735..dfcfc7161 100644 --- a/tests/ofproto-dpif.at +++ b/tests/ofproto-dpif.at @@ -152,6 +152,8 @@ ovs-appctl time/stop ovs-appctl time/warp 100 ovs-appctl lacp/show > lacp.txt ovs-appctl bond/show > bond.txt +# Check that lb_output is not enabled by default +AT_CHECK([grep -q '^lb-output-action: disabled' bond.txt]) ( for i in `seq 0 255` ; do @@ -164,9 +166,32 @@ AT_CHECK([ovs-appctl dpif/dump-flows br0 |grep tcp > br0_flows.txt]) AT_CHECK([ovs-appctl dpif/dump-flows br1 |grep tcp > br1_flows.txt]) # Make sure there is resonable distribution to all three ports. # We don't want to make this check precise, in case hash function changes. -AT_CHECK([test `grep in_port.4 br1_flows.txt |wc -l` -gt 24]) -AT_CHECK([test `grep in_port.5 br1_flows.txt |wc -l` -gt 24]) -AT_CHECK([test `grep in_port.6 br1_flows.txt |wc -l` -gt 24]) +AT_CHECK([test $(grep -c in_port.4 br1_flows.txt) -gt 24]) +AT_CHECK([test $(grep -c in_port.5 br1_flows.txt) -gt 24]) +AT_CHECK([test $(grep -c in_port.6 br1_flows.txt) -gt 24]) +# Check that bonding is doing dp_hash +AT_CHECK([grep -q dp_hash br0_flows.txt]) +# Enabling lb_output +AT_CHECK([ovs-vsctl set Port bond0 other_config:lb-output-action=true]) +OVS_WAIT_UNTIL([ovs-appctl bond/show | grep -q '^lb-output-action: enabled']) +ovs-appctl time/warp 1 500 +ovs-appctl revalidator/wait +OVS_WAIT_WHILE([ovs-appctl dpif/dump-flows br1 | grep -q tcp]) +( +for i in $(seq 256) ; +do + pkt="in_port(7),eth(src=50:54:00:00:00:05,dst=50:54:00:00:01:00),eth_type(0x0800),ipv4(src=10.0.0.2,dst=10.0.0.1,proto=6,tos=0,ttl=64,frag=no),tcp(src=8,dst=$i),tcp_flags(ack)" +AT_CHECK([ovs-appctl netdev-dummy/receive p7 $pkt]) +done +) +ovs-appctl time/warp 300 100 +AT_CHECK([ovs-appctl dpif/dump-flows br0 | grep tcp > br0_flows.txt]) +AT_CHECK([ovs-appctl dpif/dump-flows br1 | grep tcp > br1_flows.txt]) +# Make sure there is resonable distribution to all three ports, again. +AT_CHECK([test $(grep -c in_port.4 br1_flows.txt) -gt 24]) +AT_CHECK([test $(grep -c in_port.5 br1_flows.txt) -gt 24]) +AT_CHECK([test $(grep -c in_port.6 br1_flows.txt) -gt 24]) +AT_CHECK([grep -q lb_output br0_flows.txt]) OVS_VSWITCHD_STOP() AT_CLEANUP -- 2.26.2 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v12 0/1] Balance-tcp bond mode optimization
On Wed, Mar 11, 2020 at 3:26 PM Vishal Deep Ajmera wrote: > > v11->v12: > Addressed most of comments from Ilya and Eelco. > https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367832.html > https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367842.html > Rebased to OVS master. > Hi Vishal, I tested your patch, but it seems there is a memory leak which depletes all the buffers in a very short time. I've found that reverting a chunk fixed it, this is the change I made: --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -7354,15 +7354,17 @@ dp_execute_lb_output_action(struct dp_netdev_pmd_thread *pmd, struct dp_packet_batch *packets_, bool should_steal, uint32_t bond) { -struct dp_packet *packet; -uint32_t i; -const uint32_t cnt = dp_packet_batch_size(packets_); struct tx_bond *p_bond = tx_bond_lookup(>tx_bonds, bond); if (!p_bond) { return false; } -DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) { + +struct dp_packet_batch del_pkts; +dp_packet_batch_init(_pkts); + +struct dp_packet *packet; +DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { /* * Lookup the bond-hash table using hash to get the slave. */ @@ -7379,11 +7381,16 @@ dp_execute_lb_output_action(struct dp_netdev_pmd_thread *pmd, non_atomic_ullong_add(_entry->n_packets, 1); non_atomic_ullong_add(_entry->n_bytes, size); } else { -dp_packet_batch_refill(packets_, packet, i); +dp_packet_batch_add(_pkts, packet); } } -return dp_packet_batch_is_empty(packets_); +/* Delete packets that failed OUTPUT action */ +COVERAGE_ADD(datapath_drop_invalid_port, + dp_packet_batch_size(_pkts)); +dp_packet_delete_batch(_pkts, should_steal); + +return true; } static void Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH net-next v5] openvswitch: add TTL decrement action
New action to decrement TTL instead of setting it to a fixed value. This action will decrement the TTL and, in case of expired TTL, drop it or execute an action passed via a nested attribute. The default TTL expired action is to drop the packet. Supports both IPv4 and IPv6 via the ttl and hop_limit fields, respectively. Tested with a corresponding change in the userspace: # ovs-dpctl dump-flows in_port(2),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},1 in_port(1),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},2 in_port(1),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:2 in_port(2),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:1 # ping -c1 192.168.0.2 -t 42 IP (tos 0x0, ttl 41, id 61647, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.1 > 192.168.0.2: ICMP echo request, id 386, seq 1, length 64 # ping -c1 192.168.0.2 -t 120 IP (tos 0x0, ttl 119, id 62070, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.1 > 192.168.0.2: ICMP echo request, id 388, seq 1, length 64 # ping -c1 192.168.0.2 -t 1 # Co-developed-by: Bindiya Kurle Signed-off-by: Bindiya Kurle Signed-off-by: Matteo Croce --- include/uapi/linux/openvswitch.h | 7 net/openvswitch/actions.c| 67 ++ net/openvswitch/flow_netlink.c | 70 3 files changed, 144 insertions(+) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index ae2bff14e7e1..9b14519e74d9 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -958,6 +958,7 @@ enum ovs_action_attr { OVS_ACTION_ATTR_CLONE,/* Nested OVS_CLONE_ATTR_*. */ OVS_ACTION_ATTR_CHECK_PKT_LEN, /* Nested OVS_CHECK_PKT_LEN_ATTR_*. */ OVS_ACTION_ATTR_ADD_MPLS, /* struct ovs_action_add_mpls. */ + OVS_ACTION_ATTR_DEC_TTL, /* Nested OVS_DEC_TTL_ATTR_*. */ __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted * from userspace. */ @@ -1050,4 +1051,10 @@ struct ovs_zone_limit { __u32 count; }; +enum ovs_dec_ttl_attr { + OVS_DEC_TTL_ATTR_UNSPEC, + OVS_DEC_TTL_ATTR_ACTION,/* Nested struct nlattr */ + __OVS_DEC_TTL_ATTR_MAX +}; + #endif /* _LINUX_OPENVSWITCH_H */ diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 7fbfe2adfffa..fc0efd8833c8 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -964,6 +964,25 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb, return ovs_dp_upcall(dp, skb, key, , cutlen); } +static int dec_ttl_exception_handler(struct datapath *dp, struct sk_buff *skb, +struct sw_flow_key *key, +const struct nlattr *attr, bool last) +{ + /* The first action is always 'OVS_DEC_TTL_ATTR_ARG'. */ + struct nlattr *dec_ttl_arg = nla_data(attr); + int rem = nla_len(attr); + + if (nla_len(dec_ttl_arg)) { + struct nlattr *actions = nla_next(dec_ttl_arg, ); + + if (actions) + return clone_execute(dp, skb, key, 0, actions, rem, +last, false); + } + consume_skb(skb); + return 0; +} + /* When 'last' is true, sample() should always consume the 'skb'. * Otherwise, sample() should keep 'skb' intact regardless what * actions are executed within sample(). @@ -1180,6 +1199,45 @@ static int execute_check_pkt_len(struct datapath *dp, struct sk_buff *skb, nla_len(actions), last, clone_flow_key); } +static int execute_dec_ttl(struct sk_buff *skb, struct sw_flow_key *key) +{ + int err; + + if (skb->protocol == htons(ETH_P_IPV6)) { + struct ipv6hdr *nh; + + err = skb_ensure_writable(skb, skb_network_offset(skb) + + sizeof(*nh)); + if (unlikely(err)) + return err; + + nh = ipv6_hdr(skb); + + if (nh->hop_limit <= 1) + return -EHOSTUNREACH; + + key->ip.ttl = --nh->hop_limit; + } else { + struct iphdr *nh; + u8 old_ttl; + + err = skb_ensure_writable(skb, skb_network_offset(skb) + + sizeof(*nh)); + if (unlikely(err)) + return err; + + nh = ip_hdr(skb); + if (nh->ttl <= 1) + return -EHOSTUNREACH; + + old_ttl = nh->ttl--; + csum_replace2(>check, htons(old_ttl << 8), + htons(nh-
[ovs-dev] [PATCH net-next v4] openvswitch: add TTL decrement action
New action to decrement TTL instead of setting it to a fixed value. This action will decrement the TTL and, in case of expired TTL, drop it or execute an action passed via a nested attribute. The default TTL expired action is to drop the packet. Supports both IPv4 and IPv6 via the ttl and hop_limit fields, respectively. Tested with a corresponding change in the userspace: # ovs-dpctl dump-flows in_port(2),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},1 in_port(1),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},2 in_port(1),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:2 in_port(2),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:1 # ping -c1 192.168.0.2 -t 42 IP (tos 0x0, ttl 41, id 61647, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.1 > 192.168.0.2: ICMP echo request, id 386, seq 1, length 64 # ping -c1 192.168.0.2 -t 120 IP (tos 0x0, ttl 119, id 62070, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.1 > 192.168.0.2: ICMP echo request, id 388, seq 1, length 64 # ping -c1 192.168.0.2 -t 1 # Co-developed-by: Bindiya Kurle Signed-off-by: Bindiya Kurle Signed-off-by: Matteo Croce --- include/uapi/linux/openvswitch.h | 7 net/openvswitch/actions.c| 67 ++ net/openvswitch/flow_netlink.c | 70 3 files changed, 144 insertions(+) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index ae2bff14e7e1..9b14519e74d9 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -958,6 +958,7 @@ enum ovs_action_attr { OVS_ACTION_ATTR_CLONE,/* Nested OVS_CLONE_ATTR_*. */ OVS_ACTION_ATTR_CHECK_PKT_LEN, /* Nested OVS_CHECK_PKT_LEN_ATTR_*. */ OVS_ACTION_ATTR_ADD_MPLS, /* struct ovs_action_add_mpls. */ + OVS_ACTION_ATTR_DEC_TTL, /* Nested OVS_DEC_TTL_ATTR_*. */ __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted * from userspace. */ @@ -1050,4 +1051,10 @@ struct ovs_zone_limit { __u32 count; }; +enum ovs_dec_ttl_attr { + OVS_DEC_TTL_ATTR_UNSPEC, + OVS_DEC_TTL_ATTR_ACTION,/* Nested struct nlattr */ + __OVS_DEC_TTL_ATTR_MAX +}; + #endif /* _LINUX_OPENVSWITCH_H */ diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 7fbfe2adfffa..fc0efd8833c8 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -964,6 +964,25 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb, return ovs_dp_upcall(dp, skb, key, , cutlen); } +static int dec_ttl_exception_handler(struct datapath *dp, struct sk_buff *skb, +struct sw_flow_key *key, +const struct nlattr *attr, bool last) +{ + /* The first action is always 'OVS_DEC_TTL_ATTR_ARG'. */ + struct nlattr *dec_ttl_arg = nla_data(attr); + int rem = nla_len(attr); + + if (nla_len(dec_ttl_arg)) { + struct nlattr *actions = nla_next(dec_ttl_arg, ); + + if (actions) + return clone_execute(dp, skb, key, 0, actions, rem, +last, false); + } + consume_skb(skb); + return 0; +} + /* When 'last' is true, sample() should always consume the 'skb'. * Otherwise, sample() should keep 'skb' intact regardless what * actions are executed within sample(). @@ -1180,6 +1199,45 @@ static int execute_check_pkt_len(struct datapath *dp, struct sk_buff *skb, nla_len(actions), last, clone_flow_key); } +static int execute_dec_ttl(struct sk_buff *skb, struct sw_flow_key *key) +{ + int err; + + if (skb->protocol == htons(ETH_P_IPV6)) { + struct ipv6hdr *nh; + + err = skb_ensure_writable(skb, skb_network_offset(skb) + + sizeof(*nh)); + if (unlikely(err)) + return err; + + nh = ipv6_hdr(skb); + + if (nh->hop_limit <= 1) + return -EHOSTUNREACH; + + key->ip.ttl = --nh->hop_limit; + } else { + struct iphdr *nh; + u8 old_ttl; + + err = skb_ensure_writable(skb, skb_network_offset(skb) + + sizeof(*nh)); + if (unlikely(err)) + return err; + + nh = ip_hdr(skb); + if (nh->ttl <= 1) + return -EHOSTUNREACH; + + old_ttl = nh->ttl--; + csum_replace2(>check, htons(old_ttl << 8), + htons(nh-
[ovs-dev] [PATCH net-next v3] openvswitch: add TTL decrement action
New action to decrement TTL instead of setting it to a fixed value. This action will decrement the TTL and, in case of expired TTL, drop it or execute an action passed via a nested attribute. The default TTL expired action is to drop the packet. Supports both IPv4 and IPv6 via the ttl and hop_limit fields, respectively. Tested with a corresponding change in the userspace: # ovs-dpctl dump-flows in_port(2),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},1 in_port(1),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},2 in_port(1),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:2 in_port(2),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:1 # ping -c1 192.168.0.2 -t 42 IP (tos 0x0, ttl 41, id 61647, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.1 > 192.168.0.2: ICMP echo request, id 386, seq 1, length 64 # ping -c1 192.168.0.2 -t 120 IP (tos 0x0, ttl 119, id 62070, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.1 > 192.168.0.2: ICMP echo request, id 388, seq 1, length 64 # ping -c1 192.168.0.2 -t 1 # Co-developed-by: Bindiya Kurle Signed-off-by: Bindiya Kurle Signed-off-by: Matteo Croce --- include/uapi/linux/openvswitch.h | 2 + net/openvswitch/actions.c| 67 ++ net/openvswitch/flow_netlink.c | 71 3 files changed, 140 insertions(+) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index ae2bff14e7e1..9d3f040847af 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -958,6 +958,7 @@ enum ovs_action_attr { OVS_ACTION_ATTR_CLONE,/* Nested OVS_CLONE_ATTR_*. */ OVS_ACTION_ATTR_CHECK_PKT_LEN, /* Nested OVS_CHECK_PKT_LEN_ATTR_*. */ OVS_ACTION_ATTR_ADD_MPLS, /* struct ovs_action_add_mpls. */ + OVS_ACTION_ATTR_DEC_TTL, /* Nested OVS_DEC_TTL_ATTR_*. */ __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted * from userspace. */ @@ -1050,4 +1051,5 @@ struct ovs_zone_limit { __u32 count; }; +#define OVS_DEC_TTL_ATTR_EXEC 0 #endif /* _LINUX_OPENVSWITCH_H */ diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 7fbfe2adfffa..1b0afc9bf1ad 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -964,6 +964,26 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb, return ovs_dp_upcall(dp, skb, key, , cutlen); } +static int dec_ttl(struct datapath *dp, struct sk_buff *skb, + struct sw_flow_key *key, + const struct nlattr *attr, bool last) +{ + /* The first action is always 'OVS_DEC_TTL_ATTR_ARG'. */ + struct nlattr *dec_ttl_arg = nla_data(attr); + u32 nested = nla_get_u32(dec_ttl_arg); + int rem = nla_len(attr); + + if (nested) { + struct nlattr *actions = nla_next(dec_ttl_arg, ); + + if (actions) + return clone_execute(dp, skb, key, 0, actions, rem, +last, false); + } + consume_skb(skb); + return 0; +} + /* When 'last' is true, sample() should always consume the 'skb'. * Otherwise, sample() should keep 'skb' intact regardless what * actions are executed within sample(). @@ -1180,6 +1200,45 @@ static int execute_check_pkt_len(struct datapath *dp, struct sk_buff *skb, nla_len(actions), last, clone_flow_key); } +static int execute_dec_ttl(struct sk_buff *skb, struct sw_flow_key *key) +{ + int err; + + if (skb->protocol == htons(ETH_P_IPV6)) { + struct ipv6hdr *nh; + + err = skb_ensure_writable(skb, skb_network_offset(skb) + + sizeof(*nh)); + if (unlikely(err)) + return err; + + nh = ipv6_hdr(skb); + + if (nh->hop_limit <= 1) + return -EHOSTUNREACH; + + key->ip.ttl = --nh->hop_limit; + } else { + struct iphdr *nh; + u8 old_ttl; + + err = skb_ensure_writable(skb, skb_network_offset(skb) + + sizeof(*nh)); + if (unlikely(err)) + return err; + + nh = ip_hdr(skb); + if (nh->ttl <= 1) + return -EHOSTUNREACH; + + old_ttl = nh->ttl--; + csum_replace2(>check, htons(old_ttl << 8), + htons(nh->ttl << 8)); + key->ip.ttl = nh->ttl; + } + return 0; +} + /* Execute a list of actions against 'skb'. *
Re: [ovs-dev] [PATCH net-next v2] openvswitch: add TTL decrement action
On Tue, Dec 17, 2019 at 5:30 PM Nikolay Aleksandrov wrote: > > On 17/12/2019 17:51, Matteo Croce wrote: > > New action to decrement TTL instead of setting it to a fixed value. > > This action will decrement the TTL and, in case of expired TTL, drop it > > or execute an action passed via a nested attribute. > > The default TTL expired action is to drop the packet. > > > > Supports both IPv4 and IPv6 via the ttl and hop_limit fields, respectively. > > > > Tested with a corresponding change in the userspace: > > > > # ovs-dpctl dump-flows > > in_port(2),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, > > actions:dec_ttl{ttl<=1 action:(drop)},1,1 > > in_port(1),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, > > actions:dec_ttl{ttl<=1 action:(drop)},1,2 > > in_port(1),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, > > actions:2 > > in_port(2),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, > > actions:1 > > > > # ping -c1 192.168.0.2 -t 42 > > IP (tos 0x0, ttl 41, id 61647, offset 0, flags [DF], proto ICMP (1), > > length 84) > > 192.168.0.1 > 192.168.0.2: ICMP echo request, id 386, seq 1, length > > 64 > > # ping -c1 192.168.0.2 -t 120 > > IP (tos 0x0, ttl 119, id 62070, offset 0, flags [DF], proto ICMP (1), > > length 84) > > 192.168.0.1 > 192.168.0.2: ICMP echo request, id 388, seq 1, length > > 64 > > # ping -c1 192.168.0.2 -t 1 > > # > > > > Co-authored-by: Bindiya Kurle > > Signed-off-by: Bindiya Kurle > > Signed-off-by: Matteo Croce > > --- > > include/uapi/linux/openvswitch.h | 22 +++ > > net/openvswitch/actions.c| 71 + > > net/openvswitch/flow_netlink.c | 105 +++ > > 3 files changed, 198 insertions(+) > > > > Hi Matteo, > > [snip] > > +} > > + > > /* When 'last' is true, sample() should always consume the 'skb'. > > * Otherwise, sample() should keep 'skb' intact regardless what > > * actions are executed within sample(). > > @@ -1176,6 +1201,44 @@ static int execute_check_pkt_len(struct datapath > > *dp, struct sk_buff *skb, > >nla_len(actions), last, clone_flow_key); > > } > > > > +static int execute_dec_ttl(struct sk_buff *skb, struct sw_flow_key *key) > > +{ > > + int err; > > + > > + if (skb->protocol == htons(ETH_P_IPV6)) { > > + struct ipv6hdr *nh = ipv6_hdr(skb); > > + > > + err = skb_ensure_writable(skb, skb_network_offset(skb) + > > + sizeof(*nh)); > > skb_ensure_writable() calls pskb_may_pull() which may reallocate so nh might > become invalid. > It seems the IPv4 version below is ok as the ptr is reloaded. > Right > One q as I don't know ovs that much - can this action be called only with > skb->protocol == ETH_P_IP/IPV6 ? I.e. Are we sure that if it's not v6, then > it must be v4 ? > I'm adding a check in validate_and_copy_dec_ttl() so only ipv4/ipv6 packet will pass. Thanks, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH net-next v2] openvswitch: add TTL decrement action
On Wed, Dec 18, 2019 at 4:06 AM Pravin Shelar wrote: > > On Tue, Dec 17, 2019 at 7:51 AM Matteo Croce wrote: > > > > New action to decrement TTL instead of setting it to a fixed value. > > This action will decrement the TTL and, in case of expired TTL, drop it > > or execute an action passed via a nested attribute. > > The default TTL expired action is to drop the packet. > > > > Supports both IPv4 and IPv6 via the ttl and hop_limit fields, respectively. > > > > Tested with a corresponding change in the userspace: > > > > # ovs-dpctl dump-flows > > in_port(2),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, > > actions:dec_ttl{ttl<=1 action:(drop)},1,1 > > in_port(1),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, > > actions:dec_ttl{ttl<=1 action:(drop)},1,2 > > in_port(1),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, > > actions:2 > > in_port(2),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, > > actions:1 > > > > # ping -c1 192.168.0.2 -t 42 > > IP (tos 0x0, ttl 41, id 61647, offset 0, flags [DF], proto ICMP (1), > > length 84) > > 192.168.0.1 > 192.168.0.2: ICMP echo request, id 386, seq 1, length > > 64 > > # ping -c1 192.168.0.2 -t 120 > > IP (tos 0x0, ttl 119, id 62070, offset 0, flags [DF], proto ICMP (1), > > length 84) > > 192.168.0.1 > 192.168.0.2: ICMP echo request, id 388, seq 1, length > > 64 > > # ping -c1 192.168.0.2 -t 1 > > # > > > > Co-authored-by: Bindiya Kurle > > Signed-off-by: Bindiya Kurle > > Signed-off-by: Matteo Croce > > --- > > include/uapi/linux/openvswitch.h | 22 +++ > > net/openvswitch/actions.c| 71 + > > net/openvswitch/flow_netlink.c | 105 +++ > > 3 files changed, 198 insertions(+) > > > > diff --git a/include/uapi/linux/openvswitch.h > > b/include/uapi/linux/openvswitch.h > > index a87b44cd5590..b6684bc04883 100644 > > --- a/include/uapi/linux/openvswitch.h > > +++ b/include/uapi/linux/openvswitch.h > > @@ -927,6 +927,7 @@ enum ovs_action_attr { > > OVS_ACTION_ATTR_METER,/* u32 meter ID. */ > > OVS_ACTION_ATTR_CLONE,/* Nested OVS_CLONE_ATTR_*. */ > > OVS_ACTION_ATTR_CHECK_PKT_LEN, /* Nested OVS_CHECK_PKT_LEN_ATTR_*. > > */ > > + OVS_ACTION_ATTR_DEC_TTL, /* Nested OVS_DEC_TTL_ATTR_*. */ > > > > __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted > >* from userspace. */ > > @@ -939,6 +940,23 @@ enum ovs_action_attr { > > }; > > > > #define OVS_ACTION_ATTR_MAX (__OVS_ACTION_ATTR_MAX - 1) > > +enum ovs_dec_ttl_attr { > > + OVS_DEC_TTL_ATTR_UNSPEC, > > + OVS_DEC_TTL_ATTR_ACTION_TYPE,/* Action Type u32 */ > > + OVS_DEC_TTL_ATTR_ACTION, /* nested action */ > > + __OVS_DEC_TTL_ATTR_MAX, > > +#ifdef __KERNEL__ > > + OVS_DEC_TTL_ATTR_ARG /* struct sample_arg */ > > +#endif > > +}; > > + > > I do not see need for type or OVS_DEC_TTL_ACTION_DROP, if there are no > nested action the datapath can drop the packet. > > > +#ifdef __KERNEL__ > > +struct dec_ttl_arg { > > + u32 action_type;/* dec_ttl action type.*/ > > +}; > > +#endif > > + > > +#define OVS_DEC_TTL_ATTR_MAX (__OVS_DEC_TTL_ATTR_MAX - 1) > > > > /* Meters. */ > > #define OVS_METER_FAMILY "ovs_meter" > > @@ -1009,6 +1027,10 @@ enum ovs_ct_limit_attr { > > __OVS_CT_LIMIT_ATTR_MAX > > }; > > > > +enum ovs_dec_ttl_action {/*Actions supported by dec_ttl */ > > + OVS_DEC_TTL_ACTION_DROP, > > + OVS_DEC_TTL_ACTION_USER_SPACE > > +}; > > #define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1) > > > > #define OVS_ZONE_LIMIT_DEFAULT_ZONE -1 > > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > > index 4c8395462303..5329668732b1 100644 > > --- a/net/openvswitch/actions.c > > +++ b/net/openvswitch/actions.c > > @@ -960,6 +960,31 @@ static int output_userspace(struct datapath *dp, > > struct sk_buff *skb, > > return ovs_dp_upcall(dp, skb, key, , cutlen); > > } > > > > +static int dec_ttl(struct datapath *dp, struct sk_buff *skb, > > + struct sw_flow_key *fk, const struct nlattr *attr, bool > > last) > > +{ > > + struct nlattr
Re: [ovs-dev] [PATCH v8 1/1] Avoid dp_hash recirculation for balance-tcp bond selection mode
On Tue, Dec 17, 2019 at 11:03 AM Vishal Deep Ajmera wrote: > > Problem: > > In OVS-DPDK, flows with output over a bond interface of type “balance-tcp” > (using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into > "HASH" and "RECIRC" datapath actions. After recirculation, the packet is > forwarded to the bond member port based on 8-bits of the datapath hash > value computed through dp_hash. This causes performance degradation in the > following ways: > ACK. Just a nitpick: can you remove the non ASCII double quotes from the commit message? They shows in git log as: <80><9C>balance-tcp<80><9D> Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH net-next v2] openvswitch: add TTL decrement action
New action to decrement TTL instead of setting it to a fixed value. This action will decrement the TTL and, in case of expired TTL, drop it or execute an action passed via a nested attribute. The default TTL expired action is to drop the packet. Supports both IPv4 and IPv6 via the ttl and hop_limit fields, respectively. Tested with a corresponding change in the userspace: # ovs-dpctl dump-flows in_port(2),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},1,1 in_port(1),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},1,2 in_port(1),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:2 in_port(2),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:1 # ping -c1 192.168.0.2 -t 42 IP (tos 0x0, ttl 41, id 61647, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.1 > 192.168.0.2: ICMP echo request, id 386, seq 1, length 64 # ping -c1 192.168.0.2 -t 120 IP (tos 0x0, ttl 119, id 62070, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.1 > 192.168.0.2: ICMP echo request, id 388, seq 1, length 64 # ping -c1 192.168.0.2 -t 1 # Co-authored-by: Bindiya Kurle Signed-off-by: Bindiya Kurle Signed-off-by: Matteo Croce --- include/uapi/linux/openvswitch.h | 22 +++ net/openvswitch/actions.c| 71 + net/openvswitch/flow_netlink.c | 105 +++ 3 files changed, 198 insertions(+) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index a87b44cd5590..b6684bc04883 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -927,6 +927,7 @@ enum ovs_action_attr { OVS_ACTION_ATTR_METER,/* u32 meter ID. */ OVS_ACTION_ATTR_CLONE,/* Nested OVS_CLONE_ATTR_*. */ OVS_ACTION_ATTR_CHECK_PKT_LEN, /* Nested OVS_CHECK_PKT_LEN_ATTR_*. */ + OVS_ACTION_ATTR_DEC_TTL, /* Nested OVS_DEC_TTL_ATTR_*. */ __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted * from userspace. */ @@ -939,6 +940,23 @@ enum ovs_action_attr { }; #define OVS_ACTION_ATTR_MAX (__OVS_ACTION_ATTR_MAX - 1) +enum ovs_dec_ttl_attr { + OVS_DEC_TTL_ATTR_UNSPEC, + OVS_DEC_TTL_ATTR_ACTION_TYPE,/* Action Type u32 */ + OVS_DEC_TTL_ATTR_ACTION, /* nested action */ + __OVS_DEC_TTL_ATTR_MAX, +#ifdef __KERNEL__ + OVS_DEC_TTL_ATTR_ARG /* struct sample_arg */ +#endif +}; + +#ifdef __KERNEL__ +struct dec_ttl_arg { + u32 action_type;/* dec_ttl action type.*/ +}; +#endif + +#define OVS_DEC_TTL_ATTR_MAX (__OVS_DEC_TTL_ATTR_MAX - 1) /* Meters. */ #define OVS_METER_FAMILY "ovs_meter" @@ -1009,6 +1027,10 @@ enum ovs_ct_limit_attr { __OVS_CT_LIMIT_ATTR_MAX }; +enum ovs_dec_ttl_action {/*Actions supported by dec_ttl */ + OVS_DEC_TTL_ACTION_DROP, + OVS_DEC_TTL_ACTION_USER_SPACE +}; #define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1) #define OVS_ZONE_LIMIT_DEFAULT_ZONE -1 diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 4c8395462303..5329668732b1 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -960,6 +960,31 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb, return ovs_dp_upcall(dp, skb, key, , cutlen); } +static int dec_ttl(struct datapath *dp, struct sk_buff *skb, + struct sw_flow_key *fk, const struct nlattr *attr, bool last) +{ + struct nlattr *actions; + struct nlattr *dec_ttl_arg; + int rem = nla_len(attr); + const struct dec_ttl_arg *arg; + + /* The first action is always OVS_DEC_TTL_ATTR_ARG. */ + dec_ttl_arg = nla_data(attr); + arg = nla_data(dec_ttl_arg); + actions = nla_next(dec_ttl_arg, ); + + switch (arg->action_type) { + case OVS_DEC_TTL_ACTION_DROP: + consume_skb(skb); + break; + + case OVS_DEC_TTL_ACTION_USER_SPACE: + return clone_execute(dp, skb, fk, 0, actions, rem, last, false); + } + + return 0; +} + /* When 'last' is true, sample() should always consume the 'skb'. * Otherwise, sample() should keep 'skb' intact regardless what * actions are executed within sample(). @@ -1176,6 +1201,44 @@ static int execute_check_pkt_len(struct datapath *dp, struct sk_buff *skb, nla_len(actions), last, clone_flow_key); } +static int execute_dec_ttl(struct sk_buff *skb, struct sw_flow_key *key) +{ + int err; + + if (skb->protocol == htons(ETH_P_IPV6)) { + struct ipv6hdr *nh = ipv6_hdr(skb); + + err = skb_ensure_writable(skb, skb_network_offset(skb) + + sizeof(*nh)); +
Re: [ovs-dev] thundering herd wakeup of handler threads
On Tue, Dec 10, 2019 at 10:00 PM David Ahern wrote: > > [ adding Jason as author of the patch that added the epoll exclusive flag ] > > On 12/10/19 12:37 PM, Matteo Croce wrote: > > On Tue, Dec 10, 2019 at 8:13 PM David Ahern wrote: > >> > >> Hi Matteo: > >> > >> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a > >> thundering herd wake up problem. Every packet punted to userspace wakes > >> up every one of the handler threads. On a box with 96 cpus, there are 71 > >> handler threads which means 71 process wakeups for every packet punted. > >> > >> This is really easy to see, just watch sched:sched_wakeup tracepoints. > >> With a few extra probes: > >> > >> perf probe sock_def_readable sk=%di > >> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx > >> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx > >> wake_flags=%cx key=%8 > >> > >> you can see there is a single netlink socket and its wait queue contains > >> an entry for every handler thread. > >> > >> This does not happen with the 2.7.3 version. Roaming commits it appears > >> that the change in behavior comes from this commit: > >> > >> commit 69c51582ff786a68fc325c1c50624715482bc460 > >> Author: Matteo Croce > >> Date: Tue Sep 25 10:51:05 2018 +0200 > >> > >> dpif-netlink: don't allocate per thread netlink sockets > >> > >> > >> Is this a known problem? > >> > >> David > >> > > > > Hi David, > > > > before my patch, vswitchd created NxM sockets, being N the ports and M > > the active cores, > > because every thread opens a netlink socket per port. > > > > With my patch, a pool is created with N socket, one per port, and all > > the threads polls the same list > > with the EPOLLEXCLUSIVE flag. > > As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one > > of the waiting threads. > > > > I'm not aware of this problem, but it goes against the intended > > behaviour of EPOLLEXCLUSIVE. > > Such flag exists since Linux 4.5, can you check that it's passed > > correctly to epoll()? > > > > This the commit that added the EXCLUSIVE flag: > > commit df0108c5da561c66c333bb46bfe3c1fc65905898 > Author: Jason Baron > Date: Wed Jan 20 14:59:24 2016 -0800 > > epoll: add EPOLLEXCLUSIVE flag > > > The commit message acknowledges that multiple threads can still be awakened: > > "The implementation walks the list of exclusive waiters, and queues an > event to each epfd, until it finds the first waiter that has threads > blocked on it via epoll_wait(). The idea is to search for threads which > are idle and ready to process the wakeup events. Thus, we queue an > event to at least 1 epfd, but may still potentially queue an event to > all epfds that are attached to the shared fd source." > > To me that means all idle handler threads are going to be awakened on > each upcall message even though only 1 is needed to handle the message. > > Jason: What was the rationale behind the exclusive flag that still wakes > up more than 1 waiter? In the case of OVS and vswitchd I am seeing all N > handler threads awakened on every single event which is a horrible > scaling property. > Actually, I didn't look at that commit message, but I read the epoll_ctl manpage which says: "When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using EPOLLEXCLUSIVE, one or more of the epoll file descriptors will receive an event with epoll_wait(2). The default in this scenario (when EPOLLEXCLUSIVE is not set) is for all epoll file descriptors to receive an event. EPOLLEXCLUSIVE is thus useful for avoid‐ ing thundering herd problems in certain scenarios." I'd expect "one or more" to be probably greater than 1, but still much lower than all. Before this patch (which unfortunately is needed to avoid -EMFILE errors with many ports), how many sockets are awakened when an ARP is received? Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On Tue, Dec 10, 2019 at 8:41 PM David Ahern wrote: > > On 12/10/19 12:37 PM, Matteo Croce wrote: > > On Tue, Dec 10, 2019 at 8:13 PM David Ahern wrote: > >> > >> Hi Matteo: > >> > >> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a > >> thundering herd wake up problem. Every packet punted to userspace wakes > >> up every one of the handler threads. On a box with 96 cpus, there are 71 > >> handler threads which means 71 process wakeups for every packet punted. > >> > >> This is really easy to see, just watch sched:sched_wakeup tracepoints. > >> With a few extra probes: > >> > >> perf probe sock_def_readable sk=%di > >> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx > >> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx > >> wake_flags=%cx key=%8 > >> > >> you can see there is a single netlink socket and its wait queue contains > >> an entry for every handler thread. > >> > >> This does not happen with the 2.7.3 version. Roaming commits it appears > >> that the change in behavior comes from this commit: > >> > >> commit 69c51582ff786a68fc325c1c50624715482bc460 > >> Author: Matteo Croce > >> Date: Tue Sep 25 10:51:05 2018 +0200 > >> > >> dpif-netlink: don't allocate per thread netlink sockets > >> > >> > >> Is this a known problem? > >> > >> David > >> > > > > Hi David, > > > > before my patch, vswitchd created NxM sockets, being N the ports and M > > the active cores, > > because every thread opens a netlink socket per port. > > > > With my patch, a pool is created with N socket, one per port, and all > > the threads polls the same list > > with the EPOLLEXCLUSIVE flag. > > As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one > > of the waiting threads. > > > > I'm not aware of this problem, but it goes against the intended > > behaviour of EPOLLEXCLUSIVE. > > Such flag exists since Linux 4.5, can you check that it's passed > > correctly to epoll()? > > > > I get the theory, but the reality is that all threads are awakened. > Also, it is not limited to the 4.14 kernel; I see the same behavior with > 5.4. > So all threads are awakened, even if there is only an upcall packet to read? This is not good, epoll() should wake just one thread at time, as the manpage says. I have to check this, do you have a minimal setup? How do you generate upcalls, it's BUM traffic or via action=userspace? Bye, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] thundering herd wakeup of handler threads
On Tue, Dec 10, 2019 at 8:13 PM David Ahern wrote: > > Hi Matteo: > > On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a > thundering herd wake up problem. Every packet punted to userspace wakes > up every one of the handler threads. On a box with 96 cpus, there are 71 > handler threads which means 71 process wakeups for every packet punted. > > This is really easy to see, just watch sched:sched_wakeup tracepoints. > With a few extra probes: > > perf probe sock_def_readable sk=%di > perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx > perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx > wake_flags=%cx key=%8 > > you can see there is a single netlink socket and its wait queue contains > an entry for every handler thread. > > This does not happen with the 2.7.3 version. Roaming commits it appears > that the change in behavior comes from this commit: > > commit 69c51582ff786a68fc325c1c50624715482bc460 > Author: Matteo Croce > Date: Tue Sep 25 10:51:05 2018 +0200 > > dpif-netlink: don't allocate per thread netlink sockets > > > Is this a known problem? > > David > Hi David, before my patch, vswitchd created NxM sockets, being N the ports and M the active cores, because every thread opens a netlink socket per port. With my patch, a pool is created with N socket, one per port, and all the threads polls the same list with the EPOLLEXCLUSIVE flag. As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one of the waiting threads. I'm not aware of this problem, but it goes against the intended behaviour of EPOLLEXCLUSIVE. Such flag exists since Linux 4.5, can you check that it's passed correctly to epoll()? Bye, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH net-next] openvswitch: add TTL decrement action
On Mon, Nov 18, 2019 at 5:20 PM Ben Pfaff wrote: > > On Tue, Nov 12, 2019 at 04:46:12PM +0100, Matteo Croce wrote: > > On Tue, Nov 12, 2019 at 4:00 PM Simon Horman > > wrote: > > > > > > On Tue, Nov 12, 2019 at 11:25:18AM +0100, Matteo Croce wrote: > > > > New action to decrement TTL instead of setting it to a fixed value. > > > > This action will decrement the TTL and, in case of expired TTL, send the > > > > packet to userspace via output_userspace() to take care of it. > > > > > > > > Supports both IPv4 and IPv6 via the ttl and hop_limit fields, > > > > respectively. > > > > > > > > > > Usually OVS achieves this behaviour by matching on the TTL and > > > setting it to the desired value, pre-calculated as TTL -1. > > > With that in mind could you explain the motivation for this > > > change? > > > > > > > Hi, > > > > the problem is that OVS creates a flow for each ttl it see. I can let > > vswitchd create 255 flows with like this: > > > > $ for i in {2..255}; do ping 192.168.0.2 -t $i -c1 -w1 &>/dev/null & done > > $ ovs-dpctl dump-flows |fgrep -c 'set(ipv4(ttl' > > 255 > > Sure, you can easily invent a situation. In real traffic there's not > usually such a variety of TTLs for a flow that matches on the number of > fields that OVS usually needs to match. Do you see a real problem given > actual traffic in practice? > Hi Ben, yes, my situation was a bit artificious, but you can get a similar situation in practice. Imagine a router with some subnetworks behind it on N levels, with some nodes hosting virtual machines. Windows and Linux have different default TTL values, 128 and 64 respectively, so you could see N*2 different TTL values. Bye, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH net-next] openvswitch: add TTL decrement action
On Tue, Nov 12, 2019 at 4:00 PM Simon Horman wrote: > > On Tue, Nov 12, 2019 at 11:25:18AM +0100, Matteo Croce wrote: > > New action to decrement TTL instead of setting it to a fixed value. > > This action will decrement the TTL and, in case of expired TTL, send the > > packet to userspace via output_userspace() to take care of it. > > > > Supports both IPv4 and IPv6 via the ttl and hop_limit fields, respectively. > > > > Usually OVS achieves this behaviour by matching on the TTL and > setting it to the desired value, pre-calculated as TTL -1. > With that in mind could you explain the motivation for this > change? > Hi, the problem is that OVS creates a flow for each ttl it see. I can let vswitchd create 255 flows with like this: $ for i in {2..255}; do ping 192.168.0.2 -t $i -c1 -w1 &>/dev/null & done $ ovs-dpctl dump-flows |fgrep -c 'set(ipv4(ttl' 255 > > @@ -1174,6 +1174,43 @@ static int execute_check_pkt_len(struct datapath > > *dp, struct sk_buff *skb, > >nla_len(actions), last, clone_flow_key); > > } > > > > +static int execute_dec_ttl(struct sk_buff *skb, struct sw_flow_key *key) > > +{ > > + int err; > > + > > + if (skb->protocol == htons(ETH_P_IPV6)) { > > + struct ipv6hdr *nh = ipv6_hdr(skb); > > + > > + err = skb_ensure_writable(skb, skb_network_offset(skb) + > > + sizeof(*nh)); > > + if (unlikely(err)) > > + return err; > > + > > + if (nh->hop_limit <= 1) > > + return -EHOSTUNREACH; > > + > > + key->ip.ttl = --nh->hop_limit; > > + } else { > > + struct iphdr *nh = ip_hdr(skb); > > + u8 old_ttl; > > + > > + err = skb_ensure_writable(skb, skb_network_offset(skb) + > > + sizeof(*nh)); > > + if (unlikely(err)) > > + return err; > > + > > + if (nh->ttl <= 1) > > + return -EHOSTUNREACH; > > + > > + old_ttl = nh->ttl--; > > + csum_replace2(>check, htons(old_ttl << 8), > > + htons(nh->ttl << 8)); > > + key->ip.ttl = nh->ttl; > > + } > > The above may send packets with TTL = 0, is that desired? > If TTL is 1 or 0, execute_dec_ttl() returns -EHOSTUNREACH, and the caller will just send the packet to userspace and then free it. I think this is enough, am I missing something? Bye, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH net-next] openvswitch: add TTL decrement action
New action to decrement TTL instead of setting it to a fixed value. This action will decrement the TTL and, in case of expired TTL, send the packet to userspace via output_userspace() to take care of it. Supports both IPv4 and IPv6 via the ttl and hop_limit fields, respectively. Tested with a corresponding change in the userspace: # ovs-dpctl dump-flows in_port(2),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl,1 in_port(1),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl,2 in_port(1),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:2 in_port(2),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:1 # ping -c1 192.168.0.2 -t 42 IP (tos 0x0, ttl 41, id 61647, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.1 > 192.168.0.2: ICMP echo request, id 386, seq 1, length 64 # ping -c1 192.168.0.2 -t 120 IP (tos 0x0, ttl 119, id 62070, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.1 > 192.168.0.2: ICMP echo request, id 388, seq 1, length 64 # ping -c1 192.168.0.2 -t 1 # Co-authored-by: Bindiya Kurle Signed-off-by: Bindiya Kurle Signed-off-by: Matteo Croce --- include/uapi/linux/openvswitch.h | 2 ++ net/openvswitch/actions.c| 46 net/openvswitch/flow_netlink.c | 6 + 3 files changed, 54 insertions(+) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 1887a451c388..a3bdb1ecd1e7 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -890,6 +890,7 @@ struct check_pkt_len_arg { * @OVS_ACTION_ATTR_CHECK_PKT_LEN: Check the packet length and execute a set * of actions if greater than the specified packet length, else execute * another set of actions. + * @OVS_ACTION_ATTR_DEC_TTL: Decrement the IP TTL. * * Only a single header can be set with a single %OVS_ACTION_ATTR_SET. Not all * fields within a header are modifiable, e.g. the IPv4 protocol and fragment @@ -925,6 +926,7 @@ enum ovs_action_attr { OVS_ACTION_ATTR_METER,/* u32 meter ID. */ OVS_ACTION_ATTR_CLONE,/* Nested OVS_CLONE_ATTR_*. */ OVS_ACTION_ATTR_CHECK_PKT_LEN, /* Nested OVS_CHECK_PKT_LEN_ATTR_*. */ + OVS_ACTION_ATTR_DEC_TTL, /* Decrement ttl action */ __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted * from userspace. */ diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 12936c151cc0..077b7f309c93 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -1174,6 +1174,43 @@ static int execute_check_pkt_len(struct datapath *dp, struct sk_buff *skb, nla_len(actions), last, clone_flow_key); } +static int execute_dec_ttl(struct sk_buff *skb, struct sw_flow_key *key) +{ + int err; + + if (skb->protocol == htons(ETH_P_IPV6)) { + struct ipv6hdr *nh = ipv6_hdr(skb); + + err = skb_ensure_writable(skb, skb_network_offset(skb) + + sizeof(*nh)); + if (unlikely(err)) + return err; + + if (nh->hop_limit <= 1) + return -EHOSTUNREACH; + + key->ip.ttl = --nh->hop_limit; + } else { + struct iphdr *nh = ip_hdr(skb); + u8 old_ttl; + + err = skb_ensure_writable(skb, skb_network_offset(skb) + + sizeof(*nh)); + if (unlikely(err)) + return err; + + if (nh->ttl <= 1) + return -EHOSTUNREACH; + + old_ttl = nh->ttl--; + csum_replace2(>check, htons(old_ttl << 8), + htons(nh->ttl << 8)); + key->ip.ttl = nh->ttl; + } + + return 0; +} + /* Execute a list of actions against 'skb'. */ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb, struct sw_flow_key *key, @@ -1345,6 +1382,15 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb, break; } + + case OVS_ACTION_ATTR_DEC_TTL: + err = execute_dec_ttl(skb, key); + if (err == -EHOSTUNREACH) { + output_userspace(dp, skb, key, a, attr, +len, OVS_CB(skb)->cutlen); + OVS_CB(skb)->cutlen = 0; + } + break; } if (unlikely(err)) { diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index 65c2e3458ff5..d17f2d4b420f 100644 --- a/net/ope
Re: [ovs-dev] Extending ovs_action_attr to add a new action
On Fri, Nov 8, 2019 at 11:04 PM William Tu wrote: > > On Fri, Nov 08, 2019 at 05:12:55PM +0100, Matteo Croce wrote: > > Hi, > > > > I need to add a field to enum ovs_action_attr, but I see that the > > definition between the upstream header[1] and the one in compat[2] > > differs. > > Upstream enum stops at OVS_ACTION_ATTR_CHECK_PKT_LEN, with an extra > > "hidden" element after __OVS_ACTION_ATTR_MAX (22) > > Our compat version instead, has OVS_ACTION_ATTR_TUNNEL_{PUSH,POP} > > defined only #ifndef __KERNEL__, with __OVS_ACTION_ATTR_MAX being 22 > > for the kernel and 24 for userspace. > > > > If I add a field OVS_ACTION_ATTR_WHATEVER just before > > __OVS_ACTION_ATTR_MAX in the kernel, older userspace will incorrectly > > see the new action as OVS_ACTION_ATTR_TUNNEL_PUSH. > > "older userspace" means you're using userspace datapath (dpif-netdev)? > If true, then it's not using kernel module. > > if "older userspace" means just ovs-vswitchd using kernel module, > and you want to upgrade ovs kernel module with your new action > and without upgrade ovs-vswitchd? > Yes, I mean older vswitchd with a new kernel module. If I add a field after OVS_ACTION_ATTR_TUNNEL_POP, and then I downgrade the userspace utils I end up in this situation: # ovs-dpctl dump-flows in_port(1),eth(),eth_type(0x0800), packets:1, bytes:98, used:605.381s, actions:tnl_push(tnl_port(65544),header(size=252,type=131097,eth(dst=14:00:00:00:84:03,src=00:00:03:01:00:00,dl_type=0x0500),ipv6(src=1400::800:1300:0:0:800,dst=200::800:300:200:0:800,label=21504,proto=8,tclass=0x0,hlimit=0),),out_port(2)),2 while adding it before OVS_ACTION_ATTR_TUNNEL_POP I get this: # ovs-dpctl dump-flows in_port(1),eth(),eth_type(0x0800), packets:1, bytes:98, used:3.661s, actions:bad length 0, expected 4 for: action22,2 > Usually we also upgrade ovs-vswitchd, so I don't know how this can be done. > So it's not a problem, we can assume that the userspace can't be older than the kernel datapath. Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] Extending ovs_action_attr to add a new action
Hi, I need to add a field to enum ovs_action_attr, but I see that the definition between the upstream header[1] and the one in compat[2] differs. Upstream enum stops at OVS_ACTION_ATTR_CHECK_PKT_LEN, with an extra "hidden" element after __OVS_ACTION_ATTR_MAX (22) Our compat version instead, has OVS_ACTION_ATTR_TUNNEL_{PUSH,POP} defined only #ifndef __KERNEL__, with __OVS_ACTION_ATTR_MAX being 22 for the kernel and 24 for userspace. If I add a field OVS_ACTION_ATTR_WHATEVER just before __OVS_ACTION_ATTR_MAX in the kernel, older userspace will incorrectly see the new action as OVS_ACTION_ATTR_TUNNEL_PUSH. How can we extend this enum without breaking compatibility? If OVS_ACTION_ATTR_TUNNEL_* really is unused in the kernel datapath, what if we pad the kernel header with two padding fields? -%<- --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -925,6 +926,8 @@ enum ovs_action_attr { OVS_ACTION_ATTR_METER,/* u32 meter ID. */ OVS_ACTION_ATTR_CLONE,/* Nested OVS_CLONE_ATTR_*. */ OVS_ACTION_ATTR_CHECK_PKT_LEN, /* Nested OVS_CHECK_PKT_LEN_ATTR_*. */ + _OVS_ACTION_ATTR_TUNNEL_PUSH, /* unused in kernel datapath. */ + _OVS_ACTION_ATTR_TUNNEL_POP, /* unused in kernel datapath. */ __OVS_ACTION_ATTR_MAX, /* Nothing past this will be accepted ->%- [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/openvswitch.h?h=v5.3#n899 [2] https://github.com/openvswitch/ovs/blob/v2.12.0/datapath/linux/compat/include/linux/openvswitch.h#L962 Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v7 0/1] Balance-tcp bond mode optimization
On Tue, Sep 17, 2019 at 10:57 AM Vishal Deep Ajmera wrote: > > v6->v7: > Fixed issue reported by Matteo for bond/show. > > v5->v6: > Addressed comments from Ilya Maximets. > https://mail.openvswitch.org/pipermail/ovs-dev/2019-August/362001.html > Rebased to OVS master. > > v4->v5: > Support for stats per hash bucket. > Support for dynamic load balancing. > Rebased to OVS Master. > > v3->v4: > Addressed Ilya Maximets comments. > https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360452.html > > v2->v3: > Rebased to OVS master. > Fixed git merge issue. > > v1->v2: > Updated datapath action to hash + lb-output. > Updated throughput test observations. > Rebased to OVS master. > All unit test passed, plus others I did. Tested-by: Matteo Croce -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v6 0/1] Balance-tcp bond mode optimization
On Wed, Sep 11, 2019 at 8:37 PM Matteo Croce wrote: > > On Tue, Sep 10, 2019 at 11:53 AM Vishal Deep Ajmera > wrote: > > > > v5->v6: > > Addressed comments from Ilya Maximets. > > https://mail.openvswitch.org/pipermail/ovs-dev/2019-August/362001.html > > Rebased to OVS master. > > > > v4->v5: > > Support for stats per hash bucket. > > Support for dynamic load balancing. > > Rebased to OVS Master. > > > > v3->v4: > > Addressed Ilya Maximets comments. > > https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360452.html > > > > v2->v3: > > Rebased to OVS master. > > Fixed git merge issue. > > > > v1->v2: > > Updated datapath action to hash + lb-output. > > Updated throughput test observations. > > Rebased to OVS master. > > > > Vishal Deep Ajmera (1): > > Avoid dp_hash recirculation for balance-tcp bond selection mode > > > > datapath/linux/compat/include/linux/openvswitch.h | 2 + > > lib/dpif-netdev.c | 515 > > -- > > lib/dpif-netlink.c| 3 + > > lib/dpif-provider.h | 8 + > > lib/dpif.c| 48 ++ > > lib/dpif.h| 7 + > > lib/odp-execute.c | 2 + > > lib/odp-util.c| 4 + > > ofproto/bond.c| 52 ++- > > ofproto/bond.h| 9 + > > ofproto/ofproto-dpif-ipfix.c | 1 + > > ofproto/ofproto-dpif-sflow.c | 1 + > > ofproto/ofproto-dpif-xlate.c | 39 +- > > ofproto/ofproto-dpif.c| 32 ++ > > ofproto/ofproto-dpif.h| 12 +- > > tests/lacp.at | 9 + > > vswitchd/bridge.c | 4 + > > vswitchd/vswitch.xml | 10 + > > 18 files changed, 698 insertions(+), 60 deletions(-) > > > > -- > > 1.9.1 > > > > Hi, > > I confirm a decent performance improvement with DPDK and balance-tcp bonding: > > lb-output-action=false > > rx: 740 Mbps 1446 kpps > > lb-output-action=true > > rx: 860 Mbps 1680 kpps > > I'm running a very simple test with a tweaked version of testpmd which > generates 256 L4 flows, I guess that with much flows the improvement > is way higher. > > Tested-by: Matteo Croce > > -- > Matteo Croce > per aspera ad upstream Hi, I found an issue with the patch. It's not 100% reproducible, but sometimes the option gets enabled regardless of the bonding type and the configuration. This breaks the unit tests, as the bond/show output is wrong: # ovs-vsctl add-br br0 [63129.589500] device ovs-system entered promiscuous mode [63129.591802] device br0 entered promiscuous mode # ovs-vsctl add-bond br0 bond dummy0 dummy1 -- set Port bond lacp=active bond-mode=active-backup [63150.700542] device dummy1 entered promiscuous mode [63150.700892] device dummy0 entered promiscuous mode # ovs-vsctl get port bond other_config {} # ovs-appctl bond/show bond bond_mode: active-backup bond may use recirculation: no, Recirc-ID : -1 bond-hash-basis: 0 lb-output-action: enabled updelay: 0 ms downdelay: 0 ms lacp_status: negotiated lacp_fallback_ab: false active slave mac: 00:00:00:00:00:00(none) slave dummy0: disabled may_enable: false slave dummy1: disabled may_enable: false Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v6 0/1] Balance-tcp bond mode optimization
On Tue, Sep 10, 2019 at 11:53 AM Vishal Deep Ajmera wrote: > > v5->v6: > Addressed comments from Ilya Maximets. > https://mail.openvswitch.org/pipermail/ovs-dev/2019-August/362001.html > Rebased to OVS master. > > v4->v5: > Support for stats per hash bucket. > Support for dynamic load balancing. > Rebased to OVS Master. > > v3->v4: > Addressed Ilya Maximets comments. > https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360452.html > > v2->v3: > Rebased to OVS master. > Fixed git merge issue. > > v1->v2: > Updated datapath action to hash + lb-output. > Updated throughput test observations. > Rebased to OVS master. > > Vishal Deep Ajmera (1): > Avoid dp_hash recirculation for balance-tcp bond selection mode > > datapath/linux/compat/include/linux/openvswitch.h | 2 + > lib/dpif-netdev.c | 515 > -- > lib/dpif-netlink.c| 3 + > lib/dpif-provider.h | 8 + > lib/dpif.c| 48 ++ > lib/dpif.h| 7 + > lib/odp-execute.c | 2 + > lib/odp-util.c| 4 + > ofproto/bond.c| 52 ++- > ofproto/bond.h| 9 + > ofproto/ofproto-dpif-ipfix.c | 1 + > ofproto/ofproto-dpif-sflow.c | 1 + > ofproto/ofproto-dpif-xlate.c | 39 +- > ofproto/ofproto-dpif.c| 32 ++ > ofproto/ofproto-dpif.h| 12 +- > tests/lacp.at | 9 + > vswitchd/bridge.c | 4 + > vswitchd/vswitch.xml | 10 + > 18 files changed, 698 insertions(+), 60 deletions(-) > > -- > 1.9.1 > Hi, I confirm a decent performance improvement with DPDK and balance-tcp bonding: lb-output-action=false rx: 740 Mbps 1446 kpps lb-output-action=true rx: 860 Mbps 1680 kpps I'm running a very simple test with a tweaked version of testpmd which generates 256 L4 flows, I guess that with much flows the improvement is way higher. Tested-by: Matteo Croce -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v5] Avoid dp_hash recirculation for balance-tcp bond selection mode
36576, used:0.000s, > actions:hash(l4(0)),lb_output(bond,1) > > A new CLI has been added to dump the per-PMD bond cache as given below. > > “sudo ovs-appctl dpif-netdev/pmd-bond-show” > > root@ubuntu-190:performance_scripts # sudo ovs-appctl > dpif-netdev/pmd-bond-show > pmd thread numa_id 0 core_id 4: > Bond cache: > bond-id 1 : > bucket 0 - slave 2 > bucket 1 - slave 1 > bucket 2 - slave 2 > bucket 3 - slave 1 > > Performance improvement: > > With a prototype of the proposed idea, the following perf improvement is seen > with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the improvement > is even more enhanced (due to reduced number of flows). > > 1 VM: > * > +--+ > | mpps | > +--+ > | Flows master with-opt. %delta| > +--+ > | 1 4.535.8929.96 > | 10 4.165.8941.51 > | 4003.555.5556.22 > | 1k 3.445.4558.30 > | 10k2.504.6385.34 > | 100k 2.294.2786.15 > | 500k 2.254.2789.23 > +--+ > mpps: million packets per second. > > Signed-off-by: Manohar Krishnappa Chidambaraswamy > Co-authored-by: Manohar Krishnappa Chidambaraswamy > Signed-off-by: Vishal Deep Ajmera > > CC: Jan Scheurich > CC: Venkatesan Pradeep > CC: Nitin Katiyar Hi Vishal, I quickly tested your patch on two servers with 2 ixgbe cards each linked via a juniper switch with LACP. With testpmd using a single core the switching rate raised from ~2.0 Mpps to ~2.4+ Mpps, so I read at least a +20% gain. Please add an example command on how to enable it in the commit message, e.g. ovs-vsctl set port bond0 other_config:opt-bond-tcp=true Thanks, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v5] Avoid dp_hash recirculation for balance-tcp bond selection mode
On Thu, Aug 8, 2019 at 10:57 AM Vishal Deep Ajmera wrote: > --- a/ofproto/bond.c > +++ b/ofproto/bond.c > @@ -1939,3 +1959,9 @@ bond_get_changed_active_slave(const char *name, struct > eth_addr *mac, > > return false; > } > + > +bool > +bond_get_cache_mode(const struct bond *bond) > +{ > +return bond->use_bond_cache; > +} Hi, why not a static function in the header file? So it gets inlined. Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v2] dpif-netlink: fix null pointer
In dpif_netlink_port_add__(), socksp could be NULL, because vport_socksp_to_pids() would allocate a new array and return a single zero element. Following vport_socksp_to_pids() removal, a NULL pointer can happen when dpif_netlink_port_add__() is called and dpif->handlers is 0. Restore the old behaviour of using a zero pid when dpif->handlers is 0. Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets") Reported-by: Flavio Leitner Reported-by: Guru Shetty Tested-by: Guru Shetty Signed-off-by: Matteo Croce --- v2: fix checkpatch.py error about coding style lib/dpif-netlink.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c index 21315033c..ac3d2edeb 100644 --- a/lib/dpif-netlink.c +++ b/lib/dpif-netlink.c @@ -712,7 +712,7 @@ dpif_netlink_port_add__(struct dpif_netlink *dpif, const char *name, struct dpif_netlink_vport request, reply; struct ofpbuf *buf; struct nl_sock *socksp = NULL; -uint32_t upcall_pids; +uint32_t upcall_pids = 0; int error = 0; if (dpif->handlers) { @@ -728,7 +728,9 @@ dpif_netlink_port_add__(struct dpif_netlink *dpif, const char *name, request.name = name; request.port_no = *port_nop; -upcall_pids = nl_sock_pid(socksp); +if (socksp) { +upcall_pids = nl_sock_pid(socksp); +} request.n_upcall_pids = 1; request.upcall_pids = _pids; -- 2.17.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] dpif-netlink: fix null pointer
On Mon, Oct 8, 2018 at 3:11 PM Guru Shetty wrote: > > > > On Sat, 6 Oct 2018 at 09:20, Matteo Croce wrote: >> >> In dpif_netlink_port_add__(), socksp could be NULL, because >> vport_socksp_to_pids() would allocate a new array and return a single >> zero element. >> Following vport_socksp_to_pids() removal, a NULL pointer can happen when >> dpif_netlink_port_add__() is called and dpif->handlers is 0. >> >> Restore the old behaviour of using a zero pid when dpif->handlers is 0. >> >> Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets") >> Reported-by: Flavio Leitner >> Reported-by: Guru Shetty >> Signed-off-by: Matteo Croce >> --- > > > Not a review of the code. But I can confirm that the patch does fix the > segmentation fault that I was facing. > >> >> lib/dpif-netlink.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c >> index 21315033c..310bc947d 100644 >> --- a/lib/dpif-netlink.c >> +++ b/lib/dpif-netlink.c >> @@ -712,7 +712,7 @@ dpif_netlink_port_add__(struct dpif_netlink *dpif, const >> char *name, >> struct dpif_netlink_vport request, reply; >> struct ofpbuf *buf; >> struct nl_sock *socksp = NULL; >> -uint32_t upcall_pids; >> +uint32_t upcall_pids = 0; >> int error = 0; >> >> if (dpif->handlers) { >> @@ -728,7 +728,8 @@ dpif_netlink_port_add__(struct dpif_netlink *dpif, const >> char *name, >> request.name = name; >> >> request.port_no = *port_nop; >> -upcall_pids = nl_sock_pid(socksp); >> +if (socksp) >> +upcall_pids = nl_sock_pid(socksp); >> request.n_upcall_pids = 1; >> request.upcall_pids = _pids; >> >> -- >> 2.17.1 >> Ok thanks. I'me sending a v2 with the checkpatch.py warning found by Aaron's bot fixed -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v2] dpif-netlink: don't allocate per thread netlink sockets
On Fri, Oct 5, 2018 at 9:58 AM Markos Chandras wrote: > > On 05/10/2018 09:55, Matteo Croce wrote: > > On Fri, Oct 5, 2018 at 8:32 AM Markos Chandras wrote: > >> > >> On 25/09/2018 22:14, Ben Pfaff wrote: > >>> > >>> Applied to master thanks! > >>> > >>> I sent a patch to remove support for multiple queues in the > >>> infrastructure layer: > >>> https://patchwork.ozlabs.org/patch/974755/ > >>> ___ > >>> dev mailing list > >>> d...@openvswitch.org > >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>> > >> > >> Hello Ben and Matteo, > >> > >> This patch applies cleanly to branch-2.10, branch-2.9, and branch-2.8. > >> Would it make sense to backport it to these as well? > >> > >> -- > >> markos > >> > >> SUSE LINUX GmbH | GF: Felix Imendörffer, Jane Smithard, Graham Norton > >> HRB 21284 (AG Nürnberg) Maxfeldstr. 5, D-90409, Nürnberg > > > > Hi Markos, > > > > the file descriptor limit issue was first reported on OVS 2.8, so it's > > worth it. > > If you backport it, please consider backporting the following commit from > > Ben, > > which removes some code which become unused after my patch: > > > > commit 769b50349f28c5f9e4bff102bc61dadcb9b99c37 > > Author: Ben Pfaff > > Date: Tue Sep 25 15:14:13 2018 -0700 > > > > dpif: Remove support for multiple queues per port. > > > > Regards, > > > > Thank you Matteo, > > Ben, please let me know if you can backport these up to branch-2.8 > otherwise I can send patches for each branch myself. > > -- > markos > > SUSE LINUX GmbH | GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) Maxfeldstr. 5, D-90409, Nürnberg Hi Markos, there is a possible null pointer in my patch, please wait until the possible fix is merged into master and backport it as well. Here is the fix: https://patchwork.ozlabs.org/patch/979946/ Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH] dpif-netlink: fix null pointer
In dpif_netlink_port_add__(), socksp could be NULL, because vport_socksp_to_pids() would allocate a new array and return a single zero element. Following vport_socksp_to_pids() removal, a NULL pointer can happen when dpif_netlink_port_add__() is called and dpif->handlers is 0. Restore the old behaviour of using a zero pid when dpif->handlers is 0. Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets") Reported-by: Flavio Leitner Reported-by: Guru Shetty Signed-off-by: Matteo Croce --- lib/dpif-netlink.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c index 21315033c..310bc947d 100644 --- a/lib/dpif-netlink.c +++ b/lib/dpif-netlink.c @@ -712,7 +712,7 @@ dpif_netlink_port_add__(struct dpif_netlink *dpif, const char *name, struct dpif_netlink_vport request, reply; struct ofpbuf *buf; struct nl_sock *socksp = NULL; -uint32_t upcall_pids; +uint32_t upcall_pids = 0; int error = 0; if (dpif->handlers) { @@ -728,7 +728,8 @@ dpif_netlink_port_add__(struct dpif_netlink *dpif, const char *name, request.name = name; request.port_no = *port_nop; -upcall_pids = nl_sock_pid(socksp); +if (socksp) +upcall_pids = nl_sock_pid(socksp); request.n_upcall_pids = 1; request.upcall_pids = _pids; -- 2.17.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v2] dpif-netlink: don't allocate per thread netlink sockets
On Fri, Oct 5, 2018 at 11:27 PM Guru Shetty wrote: > I get a segfault with this patch with the following backtrace. I have not > investigated. > > Program received signal SIGSEGV, Segmentation fault. > nl_sock_pid (sock=0x0) at lib/netlink-socket.c:1424 > 1424 return sock->pid; > (gdb) where > #0 nl_sock_pid (sock=0x0) at lib/netlink-socket.c:1424 > #1 0x0052fb79 in dpif_netlink_port_add__ (dpif=dpif@entry=0x89de90, > name=name@entry=0x7fffdf20 "genev_sys_6081", > type=type@entry=OVS_VPORT_TYPE_GENEVE, > options=0x7fffdee0, port_nop=0x7fffdf9c) at lib/dpif-netlink.c:731 > #2 0x0052fdc7 in dpif_netlink_port_add_compat > (port_nop=0x7fffdf9c, netdev=, dpif=0x89de90) at > lib/dpif-netlink.c:836 > #3 dpif_netlink_port_add (dpif_=0x89de90, netdev=, > port_nop=0x7fffdf9c) at lib/dpif-netlink.c:882 > #4 0x004778be in dpif_port_add (dpif=0x89de90, > netdev=netdev@entry=0x911830, port_nop=port_nop@entry=0x7fffdffc) at > lib/dpif.c:579 > #5 0x00426efe in port_add (ofproto_=0x8a97f0, netdev=0x911830) at > ofproto/ofproto-dpif.c:3689 > #6 0x0041d951 in ofproto_port_add (ofproto=0x8a97f0, > netdev=0x911830, ofp_portp=ofp_portp@entry=0x7fffe0d8) at > ofproto/ofproto.c:2008 > #7 0x0040c709 in iface_do_create (errp=0x7fffe0e8, > netdevp=0x7fffe0e0, ofp_portp=0x7fffe0d8, iface_cfg=0x8a0620, > br=0x8a8bb0) at vswitchd/bridge.c:1803 > #8 iface_create (port_cfg=0x8a4e60, iface_cfg=0x8a0620, br=0x8a8bb0) at > vswitchd/bridge.c:1841 > #9 bridge_add_ports__ (br=br@entry=0x8a8bb0, > wanted_ports=wanted_ports@entry=0x8a8c90, > with_requested_port=with_requested_port@entry=false) at vswitchd/bridge.c:935 > #10 0x0040e02f in bridge_add_ports (wanted_ports=0x8a8c90, > br=0x8a8bb0) at vswitchd/bridge.c:951 > #11 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x8a7fc0) at > vswitchd/bridge.c:665 > #12 0x004114b9 in bridge_run () at vswitchd/bridge.c:3023 > #13 0x004080c5 in main (argc=2, argv=0x7fffe5e8) at > vswitchd/ovs-vswitchd.c:125 > > Hi Guru, I received a similar report on IRC and I have already investigated it. Can you try if the following patch fixes it? Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets") Signed-off-by: Matteo Croce --- lib/dpif-netlink.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c index 21315033cc..310bc947db 100644 --- a/lib/dpif-netlink.c +++ b/lib/dpif-netlink.c @@ -712,7 +712,7 @@ dpif_netlink_port_add__(struct dpif_netlink *dpif, const char *name, struct dpif_netlink_vport request, reply; struct ofpbuf *buf; struct nl_sock *socksp = NULL; -uint32_t upcall_pids; +uint32_t upcall_pids = 0; int error = 0; if (dpif->handlers) { @@ -728,7 +728,8 @@ dpif_netlink_port_add__(struct dpif_netlink *dpif, const char *name, request.name = name; request.port_no = *port_nop; -upcall_pids = nl_sock_pid(socksp); +if (socksp) +upcall_pids = nl_sock_pid(socksp); request.n_upcall_pids = 1; request.upcall_pids = _pids; -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v2] dpif-netlink: don't allocate per thread netlink sockets
On Fri, Oct 5, 2018 at 8:32 AM Markos Chandras wrote: > > On 25/09/2018 22:14, Ben Pfaff wrote: > > > > Applied to master thanks! > > > > I sent a patch to remove support for multiple queues in the > > infrastructure layer: > > https://patchwork.ozlabs.org/patch/974755/ > > ___ > > dev mailing list > > d...@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > Hello Ben and Matteo, > > This patch applies cleanly to branch-2.10, branch-2.9, and branch-2.8. > Would it make sense to backport it to these as well? > > -- > markos > > SUSE LINUX GmbH | GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) Maxfeldstr. 5, D-90409, Nürnberg Hi Markos, the file descriptor limit issue was first reported on OVS 2.8, so it's worth it. If you backport it, please consider backporting the following commit from Ben, which removes some code which become unused after my patch: commit 769b50349f28c5f9e4bff102bc61dadcb9b99c37 Author: Ben Pfaff Date: Tue Sep 25 15:14:13 2018 -0700 dpif: Remove support for multiple queues per port. Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] dpif: Remove support for multiple queues per port.
On Wed, Sep 26, 2018 at 10:29 PM Yifeng Sun wrote: > > It looks good to me, and testing shows no problem. Thanks. > > Tested-by: Yifeng Sun > Reviewed-by: Yifeng Sun > Works fine here too. Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v2] dpif-netlink: don't allocate per thread netlink sockets
When using the kernel datapath, OVS allocates a pool of sockets to handle netlink events. The number of sockets is: ports * n-handler-threads, where n-handler-threads is user configurable and defaults to 3/4*number of cores. This because vswitchd starts n-handler-threads threads, each one with a netlink socket for every port of the switch. Every thread then, starts listening on events on its set of sockets with epoll(). On setup with lot of CPUs and ports, the number of sockets easily hits the process file descriptor limit, and ovs-vswitchd will exit with -EMFILE. Change the number of allocated sockets to just one per port by moving the socket array from a per handler structure to a per datapath one, and let all the handlers share the same sockets by using EPOLLEXCLUSIVE epoll flag which avoids duplicate events, on systems that support it. The patch was tested on a 56 core machine running Linux 4.18 and latest Open vSwitch. A bridge was created with 2000+ ports, some of them being veth interfaces with the peer outside the bridge. The latency of the upcall is measured by setting a single 'action=controller,local' OpenFlow rule to force all the packets going to the slow path and then to the local port. A tool[1] injects some packets to the veth outside the bridge, and measures the delay until the packet is captured on the local port. The rx timestamp is get from the socket ancillary data in the attribute SO_TIMESTAMPNS, to avoid having the scheduler delay in the measured time. The first test measures the average latency for an upcall generated from a single port. To measure it 100k packets, one every msec, are sent to a single port and the latencies are measured. The second test is meant to check latency fairness among ports, namely if latency is equal between ports or if some ports have lower priority. The previous test is repeated for every port, the average of the average latencies and the standard deviation between averages is measured. The third test serves to measure responsiveness under load. Heavy traffic is sent through all ports, latency and packet loss is measured on a single idle port. The fourth test is all about fairness. Heavy traffic is injected in all ports but one, latency and packet loss is measured on the single idle port. This is the test setup: # nproc 56 # ovs-vsctl show |grep -c Port 2223 # ovs-ofctl dump-flows ovs_upc_br cookie=0x0, duration=4.827s, table=0, n_packets=0, n_bytes=0, actions=CONTROLLER:65535,LOCAL # uname -a Linux fc28 4.18.7-200.fc28.x86_64 #1 SMP Mon Sep 10 15:44:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux And these are the results of the tests: Stock OVS Patched netlink sockets in use by vswitchd lsof -p $(pidof ovs-vswitchd) \ |grep -c GENERIC911872227 Test 1 one port latency min/avg/max/mdev (us) 2.7/6.6/238.7/1.8 1.6/6.8/160.6/1.7 Test 2 all port avg latency/mdev (us) 6.51/0.97 6.86/0.17 Test 3 single port latency under load avg/mdev (us) 7.5/5.9 3.8/4.8 packet loss 95 %62 % Test 4 idle port latency under load min/avg/max/mdev (us) 0.8/1.5/210.5/0.9 1.0/2.1/344.5/1.2 packet loss 94 % 4 % CPU and RAM usage seems not to be affected, the resource usage of vswitchd idle with 2000+ ports is unchanged: # ps u $(pidof ovs-vswitchd) USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND openvsw+ 5430 54.3 0.3 4263964 510968 pts/1 RLl+ 16:20 0:50 ovs-vswitchd Additionally, to check if vswitchd is thread safe with this patch, the following test was run for circa 48 hours: on a 56 core machine, a bridge with kernel datapath is filled with 2200 dummy interfaces and 22 veth, then 22 traffic generators are run in parallel piping traffic into the veths peers outside the bridge. To generate as many upcalls as possible, all packets were forced to the slowpath with an openflow rule like 'action=controller,local' and packet size was set to 64 byte. Also, to avoid overflowing the FDB early and slowing down the upcall processing, generated mac addresses were restricted to a small interval. vswitchd ran without problems for 48+ hours, obviously with all the handler threads with almost 99% CPU usage. [1] https://github.com/teknoraver/network-tools/blob/master/weed.c Signed-off-by: Matteo Croce --- v1 -> v2: - define EPOLLEXCLUSIVE on systems with older kernel headers - explain the thread safety test in the commit message lib/dpif-netlink.c | 311 - 1 file changed, 82 insertions(+), 229 deletions(-) diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c index e6d5a6ec5..bb565ffee 100644 --- a/lib/dpif-netlink.c ++
Re: [ovs-dev] [PATCH] dpif-netlink: don't allocate per thread netlink sockets
On Fri, Sep 21, 2018 at 8:41 PM Ben Pfaff wrote: > > On Wed, Sep 19, 2018 at 02:47:03PM +0200, Matteo Croce wrote: > > When using the kernel datapath, OVS allocates a pool of sockets to handle > > netlink events. The number of sockets is: ports * n-handler-threads, where > > n-handler-threads is user configurable and defaults to 3/4*number of cores. > > > > This because vswitchd starts n-handler-threads threads, each one with a > > netlink socket for every port of the switch. Every thread then, starts > > listening on events on its set of sockets with epoll(). > > > > On setup with lot of CPUs and ports, the number of sockets easily hits > > the process file descriptor limit, and ovs-vswitchd will exit with -EMFILE. > > > > Change the number of allocated sockets to just one per port by moving > > the socket array from a per handler structure to a per datapath one, > > and let all the handlers share the same sockets by using EPOLLEXCLUSIVE > > epoll flag which avoids duplicate events, on systems that support it. > > Thanks a lot for working on this. OVS clearly uses too many fds in some > situations and we need to fix it. > > I noticed that the check for EPOLLEXCLUSIVE is only a build-time check, > using #ifdef. This will only be effective if the build system is where > OVS runs and only if the build system uses its own kernel headers for > the build. Usually, in OVS, we instead try to always enable a feature > at build time, e.g. through something like: > > #ifndef EPOLLEXCLUSIVE > #define EPOLLEXCLUSIVE (whatever) > #endif > > and then at runtime test whether the feature is actually available and > try to use it. > Hi Ben, Thanks for the review. Got it, will do it in a v2 BTW, one of my tests was to run the binary compiled against the latest kernel header on: - RHEL 7.2, which has a 3.10 kernel without EPOLLEXCLUSIVE backported (it was backported in 7.3) - Devuan Jessie, which has a 3.16 kernel. I ran strace on vswitchd and I clearly saw this call to epoll_ctl() with an unknown flag: epoll_ctl(22, EPOLL_CTL_ADD, 23, {EPOLLIN|0x1000, {u32=0, u64=0}}) = 0 which means that the call doesn't fail and the extra flag is ignored. > Is EPOLLEXCLUSIVE safe, that is, can we miss events or end up sleeping > before processing them? If it is safe, then what invariants does OVS > need to maintain to assure them? > What do you mean by safe? I've skimmed the kernel code, the flag will just let the kernel use add_wait_queue_exclusive() instead of add_wait_queue(), so the raised or eventually missed events should be the same, just notified to only one handler. > With this commit, the "hash" argument to dpif_port_get_pid() is never > used. The overall design of the function can be simplified. That could > be folded into this patch or go in a later patch. > Yes I noticed it, but didn't want to mess with generic code, so I simply used OVS_UNUSED to suppress compiler warnings. Such refactor can be done as a second step. > Does this patch maintain the same level of thread safety as before? > I thinked a lot about it. Reading from a socket is thread safe, I highly doubt that a netlink datagram can be split and received half and half by two handlers. Also, the code which adds or removes sockets is protected with OVS_REQ_WRLOCK(dpif->upcall_lock) so everything should be safe as before. I ran this test: on a 56 core machine, I created a bridge with 2200 dummy interfaces and 22 veth, then started 22 mausezahn on every veth peer outside the bridge. All packets were forced to the slowpath with action=controller, and the whole test lasted 48 hours, no errors found so far. As always, four eyes are better than two, if you find something suspicious, I'll try to sort it out. Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH] dpif-netlink: don't allocate per thread netlink sockets
When using the kernel datapath, OVS allocates a pool of sockets to handle netlink events. The number of sockets is: ports * n-handler-threads, where n-handler-threads is user configurable and defaults to 3/4*number of cores. This because vswitchd starts n-handler-threads threads, each one with a netlink socket for every port of the switch. Every thread then, starts listening on events on its set of sockets with epoll(). On setup with lot of CPUs and ports, the number of sockets easily hits the process file descriptor limit, and ovs-vswitchd will exit with -EMFILE. Change the number of allocated sockets to just one per port by moving the socket array from a per handler structure to a per datapath one, and let all the handlers share the same sockets by using EPOLLEXCLUSIVE epoll flag which avoids duplicate events, on systems that support it. The patch was tested on a 56 core machine running Linux 4.18 and latest Open vSwitch. A bridge was created with 2000+ ports, some of them being veth interfaces with the peer outside the bridge. The latency of the upcall is measured by setting a single 'action=controller,local' OpenFlow rule to force all the packets going to the slow path and then to the local port. A tool[1] injects some packets to the veth outside the bridge, and measures the delay until the packet is captured on the local port. The rx timestamp is get from the socket ancillary data in the attribute SO_TIMESTAMPNS, to avoid having the scheduler delay in the measured time. The first test measures the average latency for an upcall generated from a single port. To measure it 100k packets, one every msec, are sent to a single port and the latencies are measured. The second test is meant to check latency fairness among ports, namely if latency is equal between ports or if some ports have lower priority. The previous test is repeated for every port, the average of the average latencies and the standard deviation between averages is measured. The third test serves to measure responsiveness under load. Heavy traffic is sent through all ports, latency and packet loss is measured on a single idle port. The fourth test is all about fairness. Heavy traffic is injected in all ports but one, latency and packet loss is measured on the single idle port. This is the test setup: # nproc 56 # ovs-vsctl show |grep -c Port 2223 # ovs-ofctl dump-flows ovs_upc_br cookie=0x0, duration=4.827s, table=0, n_packets=0, n_bytes=0, actions=CONTROLLER:65535,LOCAL # uname -a Linux fc28 4.18.7-200.fc28.x86_64 #1 SMP Mon Sep 10 15:44:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux And these are the results of the tests: Stock OVS Patched netlink sockets in use by vswitchd lsof -p $(pidof ovs-vswitchd) \ |grep -c GENERIC911872227 Test 1 one port latency min/avg/max/mdev (us) 2.7/6.6/238.7/1.8 1.6/6.8/160.6/1.7 Test 2 all port avg latency/mdev (us) 6.51/0.97 6.86/0.17 Test 3 single port latency under load avg/mdev (us) 7.5/5.9 3.8/4.8 packet loss 95 %62 % Test 4 idle port latency under load min/avg/max/mdev (us) 0.8/1.5/210.5/0.9 1.0/2.1/344.5/1.2 packet loss 94 % 4 % CPU and RAM usage seems not to be affected, the resource usage of vswitchd idle with 2000+ ports is unchanged: # ps u $(pidof ovs-vswitchd) USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND openvsw+ 5430 54.3 0.3 4263964 510968 pts/1 RLl+ 16:20 0:50 ovs-vswitchd [1] https://github.com/teknoraver/network-tools/blob/master/weed.c Signed-off-by: Matteo Croce --- lib/dpif-netlink.c | 308 - 1 file changed, 80 insertions(+), 228 deletions(-) diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c index e6d5a6ec5..79036bf0c 100644 --- a/lib/dpif-netlink.c +++ b/lib/dpif-netlink.c @@ -170,7 +170,6 @@ struct dpif_windows_vport_sock { #endif struct dpif_handler { -struct dpif_channel *channels;/* Array of channels for each handler. */ struct epoll_event *epoll_events; int epoll_fd; /* epoll fd that includes channel socks. */ int n_events; /* Num events returned by epoll_wait(). */ @@ -193,6 +192,7 @@ struct dpif_netlink { struct fat_rwlock upcall_lock; struct dpif_handler *handlers; uint32_t n_handlers; /* Num of upcall handlers. */ +struct dpif_channel *channels; /* Array of channels for each port. */ int uc_array_size; /* Size of 'handler->channels' and */ /* 'handler->epoll_events'. */ @@ -331,43 +331,6 @@ open_dpif(const struct dpif_netlink_dp *dp, struct dpif *
Re: [ovs-dev] [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order
On Mon, Jul 16, 2018 at 4:54 PM Matteo Croce wrote: > > On Tue, Jul 10, 2018 at 6:31 PM Pravin Shelar wrote: > > > > On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce wrote: > > > From: Stefano Brivio > > > > > > Open vSwitch sends to userspace all received packets that have > > > no associated flow (thus doing an "upcall"). Then the userspace > > > program creates a new flow and determines the actions to apply > > > based on its configuration. > > > > > > When a single port generates a high rate of upcalls, it can > > > prevent other ports from dispatching their own upcalls. vswitchd > > > overcomes this problem by creating many netlink sockets for each > > > port, but it quickly exceeds any reasonable maximum number of > > > open files when dealing with huge amounts of ports. > > > > > > This patch queues all the upcalls into a list, ordering them in > > > a per-port round-robin fashion, and schedules a deferred work to > > > queue them to userspace. > > > > > > The algorithm to queue upcalls in a round-robin fashion, > > > provided by Stefano, is based on these two rules: > > > - upcalls for a given port must be inserted after all the other > > >occurrences of upcalls for the same port already in the queue, > > >in order to avoid out-of-order upcalls for a given port > > > - insertion happens once the highest upcall count for any given > > >port (excluding the one currently at hand) is greater than the > > >count for the port we're queuing to -- if this condition is > > >never true, upcall is queued at the tail. This results in a > > >per-port round-robin order. > > > > > > In order to implement a fair round-robin behaviour, a variable > > > queueing delay is introduced. This will be zero if the upcalls > > > rate is below a given threshold, and grows linearly with the > > > queue utilisation (i.e. upcalls rate) otherwise. > > > > > > This ensures fairness among ports under load and with few > > > netlink sockets. > > > > > Thanks for the patch. > > This patch is adding following overhead for upcall handling: > > 1. kmalloc. > > 2. global spin-lock. > > 3. context switch to single worker thread. > > I think this could become bottle neck on most of multi core systems. > > You have mentioned issue with existing fairness mechanism, Can you > > elaborate on those, I think we could improve that before implementing > > heavy weight fairness in upcall handling. > > Hi Pravin, > > vswitchd allocates N * P netlink sockets, where N is the number of > online CPU cores, and P the number of ports. > With some setups, this number can grow quite fast, also exceeding the > system maximum file descriptor limit. > I've seen a 48 core server failing with -EMFILE when trying to create > more than 65535 netlink sockets needed for handling 1800+ ports. > > I made a previous attempt to reduce the sockets to one per CPU, but > this was discussed and rejected on ovs-dev because it would remove > fairness among ports[1]. > I think that the current approach of opening a huge number of sockets > doesn't really work, (it doesn't scale for sure), it still needs some > queueing logic (either in kernel or user space) if we really want to > be sure that low traffic ports gets their upcalls quota when other > ports are doing way more traffic. > > If you are concerned about the kmalloc or spinlock, we can solve them > with kmem_cache or two copies of the list and rcu, I'll happy to > discuss the implementation details, as long as we all agree that the > current implementation doesn't scale well and has an issue. > > [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-February/344279.html > > -- > Matteo Croce > per aspera ad upstream Hi all, any idea on how to solve the file descriptor limit hit by the netlink sockets? I see this issue happen very often, and raising the FD limit to 400k seems not the right way to solve it. Any other suggestion on how to improve the patch, or solve the problem in a different way? Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order
On Tue, Jul 10, 2018 at 6:31 PM Pravin Shelar wrote: > > On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce wrote: > > From: Stefano Brivio > > > > Open vSwitch sends to userspace all received packets that have > > no associated flow (thus doing an "upcall"). Then the userspace > > program creates a new flow and determines the actions to apply > > based on its configuration. > > > > When a single port generates a high rate of upcalls, it can > > prevent other ports from dispatching their own upcalls. vswitchd > > overcomes this problem by creating many netlink sockets for each > > port, but it quickly exceeds any reasonable maximum number of > > open files when dealing with huge amounts of ports. > > > > This patch queues all the upcalls into a list, ordering them in > > a per-port round-robin fashion, and schedules a deferred work to > > queue them to userspace. > > > > The algorithm to queue upcalls in a round-robin fashion, > > provided by Stefano, is based on these two rules: > > - upcalls for a given port must be inserted after all the other > >occurrences of upcalls for the same port already in the queue, > >in order to avoid out-of-order upcalls for a given port > > - insertion happens once the highest upcall count for any given > >port (excluding the one currently at hand) is greater than the > >count for the port we're queuing to -- if this condition is > >never true, upcall is queued at the tail. This results in a > >per-port round-robin order. > > > > In order to implement a fair round-robin behaviour, a variable > > queueing delay is introduced. This will be zero if the upcalls > > rate is below a given threshold, and grows linearly with the > > queue utilisation (i.e. upcalls rate) otherwise. > > > > This ensures fairness among ports under load and with few > > netlink sockets. > > > Thanks for the patch. > This patch is adding following overhead for upcall handling: > 1. kmalloc. > 2. global spin-lock. > 3. context switch to single worker thread. > I think this could become bottle neck on most of multi core systems. > You have mentioned issue with existing fairness mechanism, Can you > elaborate on those, I think we could improve that before implementing > heavy weight fairness in upcall handling. Hi Pravin, vswitchd allocates N * P netlink sockets, where N is the number of online CPU cores, and P the number of ports. With some setups, this number can grow quite fast, also exceeding the system maximum file descriptor limit. I've seen a 48 core server failing with -EMFILE when trying to create more than 65535 netlink sockets needed for handling 1800+ ports. I made a previous attempt to reduce the sockets to one per CPU, but this was discussed and rejected on ovs-dev because it would remove fairness among ports[1]. I think that the current approach of opening a huge number of sockets doesn't really work, (it doesn't scale for sure), it still needs some queueing logic (either in kernel or user space) if we really want to be sure that low traffic ports gets their upcalls quota when other ports are doing way more traffic. If you are concerned about the kmalloc or spinlock, we can solve them with kmem_cache or two copies of the list and rcu, I'll happy to discuss the implementation details, as long as we all agree that the current implementation doesn't scale well and has an issue. [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-February/344279.html -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order
From: Stefano Brivio Open vSwitch sends to userspace all received packets that have no associated flow (thus doing an "upcall"). Then the userspace program creates a new flow and determines the actions to apply based on its configuration. When a single port generates a high rate of upcalls, it can prevent other ports from dispatching their own upcalls. vswitchd overcomes this problem by creating many netlink sockets for each port, but it quickly exceeds any reasonable maximum number of open files when dealing with huge amounts of ports. This patch queues all the upcalls into a list, ordering them in a per-port round-robin fashion, and schedules a deferred work to queue them to userspace. The algorithm to queue upcalls in a round-robin fashion, provided by Stefano, is based on these two rules: - upcalls for a given port must be inserted after all the other occurrences of upcalls for the same port already in the queue, in order to avoid out-of-order upcalls for a given port - insertion happens once the highest upcall count for any given port (excluding the one currently at hand) is greater than the count for the port we're queuing to -- if this condition is never true, upcall is queued at the tail. This results in a per-port round-robin order. In order to implement a fair round-robin behaviour, a variable queueing delay is introduced. This will be zero if the upcalls rate is below a given threshold, and grows linearly with the queue utilisation (i.e. upcalls rate) otherwise. This ensures fairness among ports under load and with few netlink sockets. Signed-off-by: Matteo Croce Co-authored-by: Stefano Brivio --- net/openvswitch/datapath.c | 143 ++--- net/openvswitch/datapath.h | 27 ++- 2 files changed, 161 insertions(+), 9 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 0f5ce77460d4..2cfd504562d8 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -59,6 +59,10 @@ #include "vport-internal_dev.h" #include "vport-netdev.h" +#define UPCALL_QUEUE_TIMEOUT msecs_to_jiffies(10) +#define UPCALL_QUEUE_MAX_DELAY msecs_to_jiffies(10) +#define UPCALL_QUEUE_MAX_LEN 200 + unsigned int ovs_net_id __read_mostly; static struct genl_family dp_packet_genl_family; @@ -225,6 +229,116 @@ void ovs_dp_detach_port(struct vport *p) ovs_vport_del(p); } +static void ovs_dp_upcall_dequeue(struct work_struct *work) +{ + struct datapath *dp = container_of(work, struct datapath, + upcalls.work.work); + struct dp_upcall_info *u, *n; + + spin_lock_bh(>upcalls.lock); + list_for_each_entry_safe(u, n, >upcalls.list, list) { + if (unlikely(ovs_dp_upcall(dp, u->skb, >key, u, 0))) + kfree_skb(u->skb); + else + consume_skb(u->skb); + kfree(u); + } + dp->upcalls.len = 0; + INIT_LIST_HEAD(>upcalls.list); + spin_unlock_bh(>upcalls.lock); +} + +/* Calculate the delay of the deferred work which sends the upcalls. If it ran + * more than UPCALL_QUEUE_TIMEOUT ago, schedule the work immediately. Otherwise + * return a time between 0 and UPCALL_QUEUE_MAX_DELAY, depending linearly on the + * queue utilisation. + */ +static unsigned long ovs_dp_upcall_delay(int queue_len, unsigned long last_run) +{ + if (jiffies - last_run >= UPCALL_QUEUE_TIMEOUT) + return 0; + + return UPCALL_QUEUE_MAX_DELAY - + UPCALL_QUEUE_MAX_DELAY * queue_len / UPCALL_QUEUE_MAX_LEN; +} + +static int ovs_dp_upcall_queue_roundrobin(struct datapath *dp, + struct dp_upcall_info *upcall) +{ + struct list_head *head = >upcalls.list; + struct dp_upcall_info *here = NULL, *pos; + bool find_next = true; + unsigned long delay; + int err = 0; + u8 count; + + spin_lock_bh(>upcalls.lock); + if (dp->upcalls.len > UPCALL_QUEUE_MAX_LEN) { + err = -ENOSPC; + goto out; + } + + /* Insert upcalls in the list in a per-port round-robin fashion, look +* for insertion point: +* - to avoid out-of-order per-port upcalls, we can insert only after +* the last occurrence of upcalls for the same port +* - insert upcall only after we reach a count of occurrences for a +* given port greater than the one we're inserting this upcall for +*/ + list_for_each_entry(pos, head, list) { + /* Count per-port upcalls. */ + if (dp->upcalls.count[pos->port_no] == U8_MAX - 1) { + err = -ENOSPC; + goto out_clear; + } + dp->upcalls.count[pos->port_no]++; + + if (pos->port_no == upcall-
[ovs-dev] [PATCH] docs: fix the documentation of n-handler-threads and n-revalidator-threads
Documentation for both n-handler-threads and n-revalidator-threads was wrong, the spawned threads are per port instead of per datapath. Update the docs according to vswitchd/ovs-vswitchd.8.in and code behaviour. Signed-off-by: Matteo Croce <mcr...@redhat.com> --- vswitchd/vswitch.xml | 16 +--- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 0c6a43d60..4b3d7610a 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -379,15 +379,12 @@ type='{"type": "integer", "minInteger": 1}'> Specifies the number of threads for software datapaths to use for - handling new flows. The default the number of online CPU cores minus + handling new flows. The default is the number of online CPU cores minus the number of revalidators. - This configuration is per datapath. If you have more than one - software datapath (e.g. some system bridges and some - netdev bridges), then the total number of threads is - n-handler-threads times the number of software - datapaths. + This configuration is per port, the total number of threads is + n-handler-threads times the number of ports. @@ -403,11 +400,8 @@ of handler threads. - This configuration is per datapath. If you have more than one - software datapath (e.g. some system bridges and some - netdev bridges), then the total number of threads is - n-handler-threads times the number of software - datapaths. + This configuration is per port: the total number of threads is + n-revalidator-threads times the number of ports. -- 2.14.3 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v4] vswitchd: show DPDK version
On Mon, Jan 15, 2018 at 7:21 PM, Matteo Croce <mcr...@redhat.com> wrote: > Show DPDK version if Open vSwitch is compiled with DPDK support. > Version can be retrieved with `ovs-switchd --version` or from OVS logs. > Small change in ovs-ctl to avoid breakage on output change. > > Signed-off-by: Matteo Croce <mcr...@redhat.com> > --- > v3: > print version in OVS logs too > don't save the version in the DB > use dpdk-stub.c instead of #ifdef > > v4: > fix only patch subject, I left ovs-vsctl from the older v1/v2, sorry > > lib/dpdk-stub.c | 5 + > lib/dpdk.c | 8 > lib/dpdk.h | 1 + > utilities/ovs-ctl.in| 2 +- > vswitchd/ovs-vswitchd.c | 1 + > 5 files changed, 16 insertions(+), 1 deletion(-) > > diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c > index 36021807c..041cd0cbb 100644 > --- a/lib/dpdk-stub.c > +++ b/lib/dpdk-stub.c > @@ -54,3 +54,8 @@ dpdk_vhost_iommu_enabled(void) > { > return false; > } > + > +void > +print_dpdk_version(void) > +{ > +} > diff --git a/lib/dpdk.c b/lib/dpdk.c > index 6710d10fc..3f5a55fc1 100644 > --- a/lib/dpdk.c > +++ b/lib/dpdk.c > @@ -24,6 +24,7 @@ > > #include > #include > +#include > #ifdef DPDK_PDUMP > #include > #include > @@ -471,6 +472,7 @@ dpdk_init(const struct smap *ovs_other_config) > static struct ovsthread_once once_enable = > OVSTHREAD_ONCE_INITIALIZER; > > if (ovsthread_once_start(_enable)) { > +VLOG_INFO("Using %s", rte_version()); > VLOG_INFO("DPDK Enabled - initializing..."); > dpdk_init__(ovs_other_config); > enabled = true; > @@ -501,3 +503,9 @@ dpdk_set_lcore_id(unsigned cpu) > ovs_assert(cpu != NON_PMD_CORE_ID); > RTE_PER_LCORE(_lcore_id) = cpu; > } > + > +void > +print_dpdk_version(void) > +{ > +puts(rte_version()); > +} > diff --git a/lib/dpdk.h b/lib/dpdk.h > index dc58d968a..b04153591 100644 > --- a/lib/dpdk.h > +++ b/lib/dpdk.h > @@ -38,5 +38,6 @@ void dpdk_init(const struct smap *ovs_other_config); > void dpdk_set_lcore_id(unsigned cpu); > const char *dpdk_get_vhost_sock_dir(void); > bool dpdk_vhost_iommu_enabled(void); > +void print_dpdk_version(void); > > #endif /* dpdk.h */ > diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in > index 1df56c4a5..ef06dd967 100755 > --- a/utilities/ovs-ctl.in > +++ b/utilities/ovs-ctl.in > @@ -72,7 +72,7 @@ set_hostname () { > set_system_ids () { > set ovs_vsctl set Open_vSwitch . > > -OVS_VERSION=`ovs-vswitchd --version | sed 's/.*) //;1q'` > +OVS_VERSION=`ovs-vswitchd --version | awk '/Open vSwitch/{print $NF}'` > set "$@" ovs-version="$OVS_VERSION" > > case $SYSTEM_ID in > diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c > index d5e07c037..a5e8f4d39 100644 > --- a/vswitchd/ovs-vswitchd.c > +++ b/vswitchd/ovs-vswitchd.c > @@ -187,6 +187,7 @@ parse_options(int argc, char *argv[], char > **unixctl_pathp) > > case 'V': > ovs_print_version(0, 0); > +print_dpdk_version(); > exit(EXIT_SUCCESS); > > case OPT_MLOCKALL: > -- > 2.14.3 > If you consider applying it, please fix the typo in the commit message: ovs-vswitchd instead of ovs-switchd Thanks -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [RFC] dpif-netlink: don't allocate per port netlink sockets
When using the kernel datapath OVS allocates a pool of sockets to handle netlink events. The number of sockets is: ports * n-handler-threads. n-handler-threads is user configurable but defaults to the number of cores. On machines with lot of CPUs and ports, the number of sockets easily hits the process file descriptor limit, which is 65536, and will cause ovs-vswitchd to abort. Change the number of allocated sockets to just n-handler-threads which, if set slighty greater than the number of cores, is enough to handle to handler the netlink events. Replace the struct dpif_channel array with a single dpif_channel instance and edit the code accordingly. By doing so, the port idx information is lost in upcall event code, so the call to report_loss() must be commented, as it already is in the Windows code path. Signed-off-by: Matteo Croce <mcr...@redhat.com> --- lib/dpif-netlink.c | 167 ++--- 1 file changed, 55 insertions(+), 112 deletions(-) diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c index f8d75eb45..c13d503f8 100644 --- a/lib/dpif-netlink.c +++ b/lib/dpif-netlink.c @@ -167,11 +167,10 @@ struct dpif_windows_vport_sock { #endif struct dpif_handler { -struct dpif_channel *channels;/* Array of channels for each handler. */ -struct epoll_event *epoll_events; +struct dpif_channel channels; +struct epoll_event epoll_events; int epoll_fd; /* epoll fd that includes channel socks. */ int n_events; /* Num events returned by epoll_wait(). */ -int event_offset; /* Offset into 'epoll_events'. */ #ifdef _WIN32 /* Pool of sockets. */ @@ -190,16 +189,16 @@ struct dpif_netlink { struct fat_rwlock upcall_lock; struct dpif_handler *handlers; uint32_t n_handlers; /* Num of upcall handlers. */ -int uc_array_size; /* Size of 'handler->channels' and */ - /* 'handler->epoll_events'. */ /* Change notification. */ struct nl_sock *port_notifier; /* vport multicast group subscriber. */ bool refresh_channels; }; +/* static void report_loss(struct dpif_netlink *, struct dpif_channel *, uint32_t ch_idx, uint32_t handler_id); +*/ static struct vlog_rate_limit error_rl = VLOG_RATE_LIMIT_INIT(, 5); @@ -508,8 +507,7 @@ vport_socksp_to_pids(struct nl_sock **socksp, uint32_t n_socks) /* Given the port number 'port_idx', extracts the pids of netlink sockets * associated to the port and assigns it to 'upcall_pids'. */ static bool -vport_get_pids(struct dpif_netlink *dpif, uint32_t port_idx, - uint32_t **upcall_pids) +vport_get_pids(struct dpif_netlink *dpif, uint32_t **upcall_pids) { uint32_t *pids; size_t i; @@ -517,7 +515,7 @@ vport_get_pids(struct dpif_netlink *dpif, uint32_t port_idx, /* Since the nl_sock can only be assigned in either all * or none "dpif->handlers" channels, the following check * would suffice. */ -if (!dpif->handlers[0].channels[port_idx].sock) { +if (!dpif->handlers[0].channels.sock) { return false; } ovs_assert(!WINDOWS || dpif->n_handlers <= 1); @@ -525,7 +523,7 @@ vport_get_pids(struct dpif_netlink *dpif, uint32_t port_idx, pids = xzalloc(dpif->n_handlers * sizeof *pids); for (i = 0; i < dpif->n_handlers; i++) { -pids[i] = nl_sock_pid(dpif->handlers[i].channels[port_idx].sock); +pids[i] = nl_sock_pid(dpif->handlers[i].channels.sock); } *upcall_pids = pids; @@ -542,37 +540,17 @@ vport_add_channels(struct dpif_netlink *dpif, odp_port_t port_no, size_t i, j; int error; -if (dpif->handlers == NULL) { +/* Since the sock can only be assigned in either all or none + * of "dpif->handlers" channels, the following check would + * suffice. */ +if (dpif->handlers == NULL || dpif->handlers[0].channels.sock) { return 0; } -/* We assume that the datapath densely chooses port numbers, which can - * therefore be used as an index into 'channels' and 'epoll_events' of - * 'dpif->handler'. */ -if (port_idx >= dpif->uc_array_size) { -uint32_t new_size = port_idx + 1; - -if (new_size > MAX_PORTS) { -VLOG_WARN_RL(_rl, "%s: datapath port %"PRIu32" too big", - dpif_name(>dpif), port_no); -return EFBIG; -} - -for (i = 0; i < dpif->n_handlers; i++) { -struct dpif_handler *handler = >handlers[i]; - -handler->channels = xrealloc(handler->channels, - new_size * sizeof *handler->channels); - -for (j = dpif->uc_array_size; j < new_size; j++) { -handler->channels[j].sock = NULL; -} - -
[ovs-dev] [PATCH v4] vswitchd: show DPDK version
Show DPDK version if Open vSwitch is compiled with DPDK support. Version can be retrieved with `ovs-switchd --version` or from OVS logs. Small change in ovs-ctl to avoid breakage on output change. Signed-off-by: Matteo Croce <mcr...@redhat.com> --- v3: print version in OVS logs too don't save the version in the DB use dpdk-stub.c instead of #ifdef v4: fix only patch subject, I left ovs-vsctl from the older v1/v2, sorry lib/dpdk-stub.c | 5 + lib/dpdk.c | 8 lib/dpdk.h | 1 + utilities/ovs-ctl.in| 2 +- vswitchd/ovs-vswitchd.c | 1 + 5 files changed, 16 insertions(+), 1 deletion(-) diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c index 36021807c..041cd0cbb 100644 --- a/lib/dpdk-stub.c +++ b/lib/dpdk-stub.c @@ -54,3 +54,8 @@ dpdk_vhost_iommu_enabled(void) { return false; } + +void +print_dpdk_version(void) +{ +} diff --git a/lib/dpdk.c b/lib/dpdk.c index 6710d10fc..3f5a55fc1 100644 --- a/lib/dpdk.c +++ b/lib/dpdk.c @@ -24,6 +24,7 @@ #include #include +#include #ifdef DPDK_PDUMP #include #include @@ -471,6 +472,7 @@ dpdk_init(const struct smap *ovs_other_config) static struct ovsthread_once once_enable = OVSTHREAD_ONCE_INITIALIZER; if (ovsthread_once_start(_enable)) { +VLOG_INFO("Using %s", rte_version()); VLOG_INFO("DPDK Enabled - initializing..."); dpdk_init__(ovs_other_config); enabled = true; @@ -501,3 +503,9 @@ dpdk_set_lcore_id(unsigned cpu) ovs_assert(cpu != NON_PMD_CORE_ID); RTE_PER_LCORE(_lcore_id) = cpu; } + +void +print_dpdk_version(void) +{ +puts(rte_version()); +} diff --git a/lib/dpdk.h b/lib/dpdk.h index dc58d968a..b04153591 100644 --- a/lib/dpdk.h +++ b/lib/dpdk.h @@ -38,5 +38,6 @@ void dpdk_init(const struct smap *ovs_other_config); void dpdk_set_lcore_id(unsigned cpu); const char *dpdk_get_vhost_sock_dir(void); bool dpdk_vhost_iommu_enabled(void); +void print_dpdk_version(void); #endif /* dpdk.h */ diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in index 1df56c4a5..ef06dd967 100755 --- a/utilities/ovs-ctl.in +++ b/utilities/ovs-ctl.in @@ -72,7 +72,7 @@ set_hostname () { set_system_ids () { set ovs_vsctl set Open_vSwitch . -OVS_VERSION=`ovs-vswitchd --version | sed 's/.*) //;1q'` +OVS_VERSION=`ovs-vswitchd --version | awk '/Open vSwitch/{print $NF}'` set "$@" ovs-version="$OVS_VERSION" case $SYSTEM_ID in diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c index d5e07c037..a5e8f4d39 100644 --- a/vswitchd/ovs-vswitchd.c +++ b/vswitchd/ovs-vswitchd.c @@ -187,6 +187,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp) case 'V': ovs_print_version(0, 0); +print_dpdk_version(); exit(EXIT_SUCCESS); case OPT_MLOCKALL: -- 2.14.3 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v3] ovs-vsctl: show DPDK version
Show DPDK version if Open vSwitch is compiled with DPDK support. Version can be retrieved with `ovs-switchd --version` or from OVS logs. Small change in ovs-ctl to avoid breakage on output change. Signed-off-by: Matteo Croce <mcr...@redhat.com> --- v3: print version in OVS logs too don't save the version in the DB use dpdk-stub.c instead of #ifdef lib/dpdk-stub.c | 5 + lib/dpdk.c | 8 lib/dpdk.h | 1 + utilities/ovs-ctl.in| 2 +- vswitchd/ovs-vswitchd.c | 1 + 5 files changed, 16 insertions(+), 1 deletion(-) diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c index 36021807c..041cd0cbb 100644 --- a/lib/dpdk-stub.c +++ b/lib/dpdk-stub.c @@ -54,3 +54,8 @@ dpdk_vhost_iommu_enabled(void) { return false; } + +void +print_dpdk_version(void) +{ +} diff --git a/lib/dpdk.c b/lib/dpdk.c index 6710d10fc..3f5a55fc1 100644 --- a/lib/dpdk.c +++ b/lib/dpdk.c @@ -24,6 +24,7 @@ #include #include +#include #ifdef DPDK_PDUMP #include #include @@ -471,6 +472,7 @@ dpdk_init(const struct smap *ovs_other_config) static struct ovsthread_once once_enable = OVSTHREAD_ONCE_INITIALIZER; if (ovsthread_once_start(_enable)) { +VLOG_INFO("Using %s", rte_version()); VLOG_INFO("DPDK Enabled - initializing..."); dpdk_init__(ovs_other_config); enabled = true; @@ -501,3 +503,9 @@ dpdk_set_lcore_id(unsigned cpu) ovs_assert(cpu != NON_PMD_CORE_ID); RTE_PER_LCORE(_lcore_id) = cpu; } + +void +print_dpdk_version(void) +{ +puts(rte_version()); +} diff --git a/lib/dpdk.h b/lib/dpdk.h index dc58d968a..b04153591 100644 --- a/lib/dpdk.h +++ b/lib/dpdk.h @@ -38,5 +38,6 @@ void dpdk_init(const struct smap *ovs_other_config); void dpdk_set_lcore_id(unsigned cpu); const char *dpdk_get_vhost_sock_dir(void); bool dpdk_vhost_iommu_enabled(void); +void print_dpdk_version(void); #endif /* dpdk.h */ diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in index 1df56c4a5..ef06dd967 100755 --- a/utilities/ovs-ctl.in +++ b/utilities/ovs-ctl.in @@ -72,7 +72,7 @@ set_hostname () { set_system_ids () { set ovs_vsctl set Open_vSwitch . -OVS_VERSION=`ovs-vswitchd --version | sed 's/.*) //;1q'` +OVS_VERSION=`ovs-vswitchd --version | awk '/Open vSwitch/{print $NF}'` set "$@" ovs-version="$OVS_VERSION" case $SYSTEM_ID in diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c index d5e07c037..a5e8f4d39 100644 --- a/vswitchd/ovs-vswitchd.c +++ b/vswitchd/ovs-vswitchd.c @@ -187,6 +187,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp) case 'V': ovs_print_version(0, 0); +print_dpdk_version(); exit(EXIT_SUCCESS); case OPT_MLOCKALL: -- 2.14.3 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v3] ovs-vsctl: show DPDK version
On Thu, Jan 4, 2018 at 11:50 AM, Vishal Deep Ajmera <vishal.deep.ajm...@ericsson.com> wrote: > Hi Matteo, > > Can we also have DPDK version in the logs (VLOG_INFO) when DPDK is > initialized ? May be in dpdk_init() function. This will help in identifying > the correct DPDK version from the log files while debugging issues. > > Warm Regards, > Vishal Ajmera > Sure, makes sense. Thanks, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v3] ovs-vsctl: show DPDK version
On Tue, Jan 2, 2018 at 6:08 PM, Aaron Conole <acon...@redhat.com> wrote: > Hi Matteo, > > Matteo Croce <mcr...@redhat.com> writes: > >> Show DPDK version if Open vSwitch is compiled with DPDK support. >> Add a `dpdk_version` column in the datamodel, change ovs-ctl and ovs-dev.py >> to populate the field with the right value. >> >> Signed-off-by: Matteo Croce <mcr...@redhat.com> >> --- > > Sorry I missed some of the development for this - I think it's better to > have the dpdk version retrieved via an appctl command. I can't think of > a need for this information to be persistent - if it's for historical > information when debugging, we can put it in with the logs when dpdk is > started. The user cannot influence DPDK version dynamically - it is not > just read-only, but it's read-only from a compile time decision. I > might be misunderstanding the point of putting this in the DB, though? Hi Aaron, I was threating dpdk_version the same way as ovs_version. ovs_version is saved into the DB by ovs-ctl and it isn't read-only either. What about putting it only in `ovs-vswitchd --version` output and not into the DB? -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v3] ovs-vsctl: show DPDK version
On Tue, Dec 26, 2017 at 6:38 PM, Ben Pfaff <b...@ovn.org> wrote: > On Mon, Dec 25, 2017 at 03:31:02PM +0100, Matteo Croce wrote: >> Show DPDK version if Open vSwitch is compiled with DPDK support. >> Add a `dpdk_version` column in the datamodel, change ovs-ctl and ovs-dev.py >> to populate the field with the right value. >> >> Signed-off-by: Matteo Croce <mcr...@redhat.com> > > OK, this is better, but why does ovs-vswitchd need a shell helper for > this? It should just set the key itself. On ovs-vswitchd startup? I think I set dpdk_version the same way ovs_version was set. I can do it in ovs-vswitchd startup but in that case ovs_version should be set in ovs-vswitchd too. Cheers, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v3] ovs-vsctl: show DPDK version
Show DPDK version if Open vSwitch is compiled with DPDK support. Add a `dpdk_version` column in the datamodel, change ovs-ctl and ovs-dev.py to populate the field with the right value. Signed-off-by: Matteo Croce <mcr...@redhat.com> --- utilities/ovs-ctl.in | 9 - utilities/ovs-dev.py | 7 +++ vswitchd/ovs-vswitchd.c| 7 +++ vswitchd/vswitch.ovsschema | 7 +-- vswitchd/vswitch.xml | 4 5 files changed, 31 insertions(+), 3 deletions(-) diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in index f1b01d1d3..1e334de2b 100755 --- a/utilities/ovs-ctl.in +++ b/utilities/ovs-ctl.in @@ -76,9 +76,16 @@ set_hostname () { set_system_ids () { set ovs_vsctl set Open_vSwitch . -OVS_VERSION=`ovs-vswitchd --version | sed 's/.*) //;1q'` +OVS_VERSION=`ovs-vswitchd --version |awk '/Open vSwitch/{print $NF}'` set "$@" ovs-version="$OVS_VERSION" +DPDK_VERSION=`ovs-vswitchd --version |awk '/^DPDK/{print$2}'` +if [ -n "$DPDK_VERSION" ]; then +set "$@" dpdk-version="$DPDK_VERSION" +else +ovs_vsctl clear Open_vSwitch . dpdk_version +fi + case $SYSTEM_ID in random) id_file=$etcdir/system-id.conf diff --git a/utilities/ovs-dev.py b/utilities/ovs-dev.py index 9ce0f04c7..70170e21d 100755 --- a/utilities/ovs-dev.py +++ b/utilities/ovs-dev.py @@ -284,9 +284,16 @@ def run(): "other_config:dpdk-init=true" % root_uuid) _sh("ovs-vsctl --no-wait set Open_vSwitch %s other_config:" "dpdk-extra=\"%s\"" % (root_uuid, ' '.join(options.dpdk))) +dpdk_version = _sh("ovs-vswitchd --version", capture=True) +dpdk_version = [s for s in dpdk_version if "DPDK" in s] +if dpdk_version: +dpdk_version = dpdk_version[0].decode().strip().split()[1] +_sh("ovs-vsctl --no-wait set Open_vSwitch %s dpdk_version=%s" +% (root_uuid, dpdk_version)) else: _sh("ovs-vsctl --no-wait set Open_vSwitch %s " "other_config:dpdk-init=false" % root_uuid) +_sh("ovs-vsctl --no-wait clear Open_vSwitch %s dpdk_version" % root_uuid) if options.gdb: cmd = ["gdb", "--args"] + cmd diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c index d5e07c037..e033a7d1f 100644 --- a/vswitchd/ovs-vswitchd.c +++ b/vswitchd/ovs-vswitchd.c @@ -25,6 +25,10 @@ #include #endif +#ifdef DPDK_NETDEV +#include +#endif + #include "bridge.h" #include "command-line.h" #include "compiler.h" @@ -187,6 +191,9 @@ parse_options(int argc, char *argv[], char **unixctl_pathp) case 'V': ovs_print_version(0, 0); +#ifdef DPDK_NETDEV +puts(rte_version()); +#endif exit(EXIT_SUCCESS); case OPT_MLOCKALL: diff --git a/vswitchd/vswitch.ovsschema b/vswitchd/vswitch.ovsschema index 90e50b626..7015a2687 100644 --- a/vswitchd/vswitch.ovsschema +++ b/vswitchd/vswitch.ovsschema @@ -1,6 +1,6 @@ {"name": "Open_vSwitch", - "version": "7.15.1", - "cksum": "3682332033 23608", + "version": "7.16.0", + "cksum": "654116098 23718", "tables": { "Open_vSwitch": { "columns": { @@ -36,6 +36,9 @@ "db_version": { "type": {"key": {"type": "string"}, "min": 0, "max": 1}}, + "dpdk_version": { + "type": {"key": {"type": "string"}, + "min": 0, "max": 1}}, "system_type": { "type": {"key": {"type": "string"}, "min": 0, "max": 1}}, diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 37d04b7cf..74890f72b 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -611,6 +611,10 @@ + +DPDK version number, e.g. 17.11. + + An identifier for the type of system on top of which Open vSwitch -- 2.14.3 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v2] ovs-vsctl: show DPDK version
On Sun, Dec 24, 2017 at 8:02 PM, Ben Pfaff <b...@ovn.org> wrote: > On Sun, Dec 24, 2017 at 02:58:37PM +0100, Matteo Croce wrote: >> Show DPDK version if Open vSwitch is compiled with DPDK support. >> Add a `dpdk_version` column in the datamodel, change ovs-ctl and ovs-dev.py >> to populate the field with the right value. >> >> Signed-off-by: Matteo Croce <mcr...@redhat.com> >> --- >> v2: edit ovs-ctl as well > > Thank you for the update. > > I think that this is likely to cause confusion eventually because it > reports the version of DPDK that ovs-vsctl is linked against. That > doesn't even make sense because the sole reason that ovs-vsctl is linked > against DPDK is to report the version. > > I think it would be more sensible to make ovs-vswitchd set and report > the dpdk version. That can't go wrong in the same way. > > Thanks, > > Ben. That's true. Initially I tought putting the version info in vswitchd but I didn't want to break anything relying on 'ovs-vswitchd --version' output so I put it in ovs-vsctl as it already contained another field, the DB version. I'm making a v3 with the DPDK version in vswitchd and a safer ovs-ctl/ovs-dev.phy which didn't require the OVS version to be the first line of output. Cheers, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v2] ovs-vsctl: show DPDK version
Show DPDK version if Open vSwitch is compiled with DPDK support. Add a `dpdk_version` column in the datamodel, change ovs-ctl and ovs-dev.py to populate the field with the right value. Signed-off-by: Matteo Croce <mcr...@redhat.com> --- v2: edit ovs-ctl as well utilities/ovs-ctl.in | 7 +++ utilities/ovs-dev.py | 15 --- utilities/ovs-vsctl.c | 7 +++ vswitchd/vswitch.ovsschema | 7 +-- vswitchd/vswitch.xml | 4 5 files changed, 35 insertions(+), 5 deletions(-) diff --git a/utilities/ovs-ctl.in b/utilities/ovs-ctl.in index f1b01d1d3..3ce8e1849 100755 --- a/utilities/ovs-ctl.in +++ b/utilities/ovs-ctl.in @@ -79,6 +79,13 @@ set_system_ids () { OVS_VERSION=`ovs-vswitchd --version | sed 's/.*) //;1q'` set "$@" ovs-version="$OVS_VERSION" +DPDK_VERSION=`ovs-vsctl --version |awk '/^DPDK/{print$2}'` +if [ -n "$DPDK_VERSION" ]; then +set "$@" dpdk-version="$DPDK_VERSION" +else +ovs_vsctl clear Open_vSwitch . dpdk_version +fi + case $SYSTEM_ID in random) id_file=$etcdir/system-id.conf diff --git a/utilities/ovs-dev.py b/utilities/ovs-dev.py index 9ce0f04c7..1bc0ba8c7 100755 --- a/utilities/ovs-dev.py +++ b/utilities/ovs-dev.py @@ -270,11 +270,20 @@ def run(): " %s/ovsclient-cert.pem %s/vswitchd.cacert" % (pki_dir, pki_dir, pki_dir)) version = _sh("ovs-vsctl --no-wait --version", capture=True) -version = version[0].decode().strip().split()[3] root_uuid = _sh("ovs-vsctl --no-wait --bare list Open_vSwitch", capture=True)[0].decode().strip() -_sh("ovs-vsctl --no-wait set Open_vSwitch %s ovs_version=%s" -% (root_uuid, version)) + +ovs_version = [s for s in version if "Open vSwitch" in s] +if ovs_version: +ovs_version = ovs_version[0].decode().strip().split()[3] +_sh("ovs-vsctl --no-wait set Open_vSwitch %s ovs_version=%s" +% (root_uuid, ovs_version)) + +dpdk_version = [s for s in version if "DPDK" in s] +if dpdk_version: +dpdk_version = dpdk_version[0].decode().strip().split()[1] +_sh("ovs-vsctl --no-wait set Open_vSwitch %s dpdk_version=%s" +% (root_uuid, dpdk_version)) build = BUILD_CLANG if options.clang else BUILD_GCC cmd = [build + "/vswitchd/ovs-vswitchd"] diff --git a/utilities/ovs-vsctl.c b/utilities/ovs-vsctl.c index 7b909431d..f830120ad 100644 --- a/utilities/ovs-vsctl.c +++ b/utilities/ovs-vsctl.c @@ -27,6 +27,10 @@ #include #include +#ifdef DPDK_NETDEV +#include +#endif + #include "db-ctl-base.h" #include "command-line.h" @@ -304,6 +308,9 @@ parse_options(int argc, char *argv[], struct shash *local_options) case 'V': ovs_print_version(0, 0); printf("DB Schema %s\n", ovsrec_get_db_version()); +#ifdef DPDK_NETDEV +printf("%s\n", rte_version()); +#endif exit(EXIT_SUCCESS); case 't': diff --git a/vswitchd/vswitch.ovsschema b/vswitchd/vswitch.ovsschema index 90e50b626..7015a2687 100644 --- a/vswitchd/vswitch.ovsschema +++ b/vswitchd/vswitch.ovsschema @@ -1,6 +1,6 @@ {"name": "Open_vSwitch", - "version": "7.15.1", - "cksum": "3682332033 23608", + "version": "7.16.0", + "cksum": "654116098 23718", "tables": { "Open_vSwitch": { "columns": { @@ -36,6 +36,9 @@ "db_version": { "type": {"key": {"type": "string"}, "min": 0, "max": 1}}, + "dpdk_version": { + "type": {"key": {"type": "string"}, + "min": 0, "max": 1}}, "system_type": { "type": {"key": {"type": "string"}, "min": 0, "max": 1}}, diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 37d04b7cf..74890f72b 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -611,6 +611,10 @@ + +DPDK version number, e.g. 17.11. + + An identifier for the type of system on top of which Open vSwitch -- 2.14.3 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH] ovs-vsctl: show DPDK version
Show DPDK version if Open vSwitch is compiled with DPDK support. Add a `dpdk_version` column in the datamodel and also edit ovs-dev.py to populate the field with the right value. Signed-off-by: Matteo Croce <mcr...@redhat.com> --- utilities/ovs-dev.py | 15 --- utilities/ovs-vsctl.c | 7 +++ vswitchd/vswitch.ovsschema | 7 +-- vswitchd/vswitch.xml | 4 4 files changed, 28 insertions(+), 5 deletions(-) diff --git a/utilities/ovs-dev.py b/utilities/ovs-dev.py index 9ce0f04c7..1bc0ba8c7 100755 --- a/utilities/ovs-dev.py +++ b/utilities/ovs-dev.py @@ -270,11 +270,20 @@ def run(): " %s/ovsclient-cert.pem %s/vswitchd.cacert" % (pki_dir, pki_dir, pki_dir)) version = _sh("ovs-vsctl --no-wait --version", capture=True) -version = version[0].decode().strip().split()[3] root_uuid = _sh("ovs-vsctl --no-wait --bare list Open_vSwitch", capture=True)[0].decode().strip() -_sh("ovs-vsctl --no-wait set Open_vSwitch %s ovs_version=%s" -% (root_uuid, version)) + +ovs_version = [s for s in version if "Open vSwitch" in s] +if ovs_version: +ovs_version = ovs_version[0].decode().strip().split()[3] +_sh("ovs-vsctl --no-wait set Open_vSwitch %s ovs_version=%s" +% (root_uuid, ovs_version)) + +dpdk_version = [s for s in version if "DPDK" in s] +if dpdk_version: +dpdk_version = dpdk_version[0].decode().strip().split()[1] +_sh("ovs-vsctl --no-wait set Open_vSwitch %s dpdk_version=%s" +% (root_uuid, dpdk_version)) build = BUILD_CLANG if options.clang else BUILD_GCC cmd = [build + "/vswitchd/ovs-vswitchd"] diff --git a/utilities/ovs-vsctl.c b/utilities/ovs-vsctl.c index 7b909431d..f830120ad 100644 --- a/utilities/ovs-vsctl.c +++ b/utilities/ovs-vsctl.c @@ -27,6 +27,10 @@ #include #include +#ifdef DPDK_NETDEV +#include +#endif + #include "db-ctl-base.h" #include "command-line.h" @@ -304,6 +308,9 @@ parse_options(int argc, char *argv[], struct shash *local_options) case 'V': ovs_print_version(0, 0); printf("DB Schema %s\n", ovsrec_get_db_version()); +#ifdef DPDK_NETDEV +printf("%s\n", rte_version()); +#endif exit(EXIT_SUCCESS); case 't': diff --git a/vswitchd/vswitch.ovsschema b/vswitchd/vswitch.ovsschema index 90e50b626..757d8ec17 100644 --- a/vswitchd/vswitch.ovsschema +++ b/vswitchd/vswitch.ovsschema @@ -1,6 +1,6 @@ {"name": "Open_vSwitch", - "version": "7.15.1", - "cksum": "3682332033 23608", + "version": "7.15.2", + "cksum": "3876676959 23718", "tables": { "Open_vSwitch": { "columns": { @@ -36,6 +36,9 @@ "db_version": { "type": {"key": {"type": "string"}, "min": 0, "max": 1}}, + "dpdk_version": { + "type": {"key": {"type": "string"}, + "min": 0, "max": 1}}, "system_type": { "type": {"key": {"type": "string"}, "min": 0, "max": 1}}, diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 37d04b7cf..74890f72b 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -611,6 +611,10 @@ + +DPDK version number, e.g. 17.11. + + An identifier for the type of system on top of which Open vSwitch -- 2.14.3 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] netdev-dpdk: defer MTU set after interface start
On Fri, Nov 24, 2017 at 9:26 AM, Stokes, Ian <ian.sto...@intel.com> wrote: >> Some PMD assumes that the RX buffers are already allocated when setting >> the device MTU, because the RX buffer size depends on the MTU. >> This worked until 67fe6d635193 added a call to rte_eth_dev_set_mtu() in >> the init code, which would set the MTU before the RX buffer allocation, >> triggering a segmentation fault with some PMD: >> >> Stack trace of thread 20680: >> #0 0x559464396534 qede_set_mtu (ovs-vswitchd) >> #1 0x559464323c41 rte_eth_dev_set_mtu (ovs-vswitchd) >> #2 0x5594645f5e85 dpdk_eth_dev_queue_setup (ovs-vswitchd) >> #3 0x5594645f8ae6 netdev_dpdk_reconfigure (ovs-vswitchd) >> #4 0x55946452225c reconfigure_datapath (ovs-vswitchd) >> #5 0x559464522d07 do_add_port (ovs-vswitchd) >> #6 0x559464522e8d dpif_netdev_port_add (ovs-vswitchd) >> #7 0x559464528dde dpif_port_add (ovs-vswitchd) >> #8 0x5594644dc0e0 port_add (ovs-vswitchd) >> #9 0x5594644d2ab1 ofproto_port_add (ovs-vswitchd) >> #10 0x5594644c0a85 bridge_add_ports__ (ovs-vswitchd) >> #11 0x5594644c26e8 bridge_reconfigure (ovs-vswitchd) >> #12 0x5594644c5c49 bridge_run (ovs-vswitchd) >> #13 0x5594642d4155 main (ovs-vswitchd) >> #14 0x7f0e1444bc05 __libc_start_main (libc.so.6) >> #15 0x5594642d8328 _start (ovs-vswitchd) >> >> A possible solution could be to move the first call to >> rte_eth_dev_set_mtu() just after the device start instead of >> dpdk_eth_dev_queue_setup() which, by the way, set the MTU multiple times >> as the call to rte_eth_dev_set_mtu() was in a loop. >> >> CC: Mark Kavanagh <mark.b.kavan...@intel.com> >> Fixes: 67fe6d635193 ("netdev-dpdk: use rte_eth_dev_set_mtu.") >> Signed-off-by: Matteo Croce <mcr...@redhat.com> >> --- >> lib/netdev-dpdk.c | 14 +++--- >> 1 file changed, 7 insertions(+), 7 deletions(-) >> >> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index >> 76e79be25..229aa4a76 100644 >> --- a/lib/netdev-dpdk.c >> +++ b/lib/netdev-dpdk.c >> @@ -750,13 +750,6 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int >> n_rxq, int n_txq) >> break; >> } >> >> -diag = rte_eth_dev_set_mtu(dev->port_id, dev->mtu); >> -if (diag) { >> -VLOG_ERR("Interface %s MTU (%d) setup error: %s", >> -dev->up.name, dev->mtu, rte_strerror(-diag)); >> -break; >> -} >> - >> for (i = 0; i < n_txq; i++) { >> diag = rte_eth_tx_queue_setup(dev->port_id, i, dev->txq_size, >>dev->socket_id, NULL); @@ - >> 849,6 +842,13 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) >> return -diag; >> } >> >> +diag = rte_eth_dev_set_mtu(dev->port_id, dev->mtu); >> +if (diag) { >> +VLOG_ERR("Interface %s MTU (%d) setup error: %s", >> +dev->up.name, dev->mtu, rte_strerror(-diag)); >> +return -diag; >> +} >> + > > In my testing I didn't see any issues here, but would like to flag it to our > MTU Guru (Mark K). > > Any opinion on this? From the API description it looks like it is ok? > > >> rte_eth_promiscuous_enable(dev->port_id); >> rte_eth_allmulticast_enable(dev->port_id); >> >> -- >> 2.13.6 >> >> ___ >> dev mailing list >> d...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev The issue only arises with the qede PMD and 67fe6d635193 ("netdev-dpdk: use rte_eth_dev_set_mtu.") Regards, -- Matteo Croce per aspera ad upstream ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH] netdev-dpdk: defer MTU set after interface start
Some PMD assumes that the RX buffers are already allocated when setting the device MTU, because the RX buffer size depends on the MTU. This worked until 67fe6d635193 added a call to rte_eth_dev_set_mtu() in the init code, which would set the MTU before the RX buffer allocation, triggering a segmentation fault with some PMD: Stack trace of thread 20680: #0 0x559464396534 qede_set_mtu (ovs-vswitchd) #1 0x559464323c41 rte_eth_dev_set_mtu (ovs-vswitchd) #2 0x5594645f5e85 dpdk_eth_dev_queue_setup (ovs-vswitchd) #3 0x5594645f8ae6 netdev_dpdk_reconfigure (ovs-vswitchd) #4 0x55946452225c reconfigure_datapath (ovs-vswitchd) #5 0x559464522d07 do_add_port (ovs-vswitchd) #6 0x559464522e8d dpif_netdev_port_add (ovs-vswitchd) #7 0x559464528dde dpif_port_add (ovs-vswitchd) #8 0x5594644dc0e0 port_add (ovs-vswitchd) #9 0x5594644d2ab1 ofproto_port_add (ovs-vswitchd) #10 0x5594644c0a85 bridge_add_ports__ (ovs-vswitchd) #11 0x5594644c26e8 bridge_reconfigure (ovs-vswitchd) #12 0x5594644c5c49 bridge_run (ovs-vswitchd) #13 0x5594642d4155 main (ovs-vswitchd) #14 0x7f0e1444bc05 __libc_start_main (libc.so.6) #15 0x5594642d8328 _start (ovs-vswitchd) A possible solution could be to move the first call to rte_eth_dev_set_mtu() just after the device start instead of dpdk_eth_dev_queue_setup() which, by the way, set the MTU multiple times as the call to rte_eth_dev_set_mtu() was in a loop. CC: Mark Kavanagh <mark.b.kavan...@intel.com> Fixes: 67fe6d635193 ("netdev-dpdk: use rte_eth_dev_set_mtu.") Signed-off-by: Matteo Croce <mcr...@redhat.com> --- lib/netdev-dpdk.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 76e79be25..229aa4a76 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -750,13 +750,6 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq) break; } -diag = rte_eth_dev_set_mtu(dev->port_id, dev->mtu); -if (diag) { -VLOG_ERR("Interface %s MTU (%d) setup error: %s", -dev->up.name, dev->mtu, rte_strerror(-diag)); -break; -} - for (i = 0; i < n_txq; i++) { diag = rte_eth_tx_queue_setup(dev->port_id, i, dev->txq_size, dev->socket_id, NULL); @@ -849,6 +842,13 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) return -diag; } +diag = rte_eth_dev_set_mtu(dev->port_id, dev->mtu); +if (diag) { +VLOG_ERR("Interface %s MTU (%d) setup error: %s", +dev->up.name, dev->mtu, rte_strerror(-diag)); +return -diag; +} + rte_eth_promiscuous_enable(dev->port_id); rte_eth_allmulticast_enable(dev->port_id); -- 2.13.6 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH] ovs-lib: dont't purge corrupted DB
In ovs-lib there is a function named upgrade_db which tries to convert a database after OVS {up,down}grades. This function uses ovsdb-tool to check if the DB needs to be upgraded. If the upgrade fails, it purges the DB and create an empty one. ovsdb-tool returns "yes" or "no" to indicate if the DB needs upgrading, but if the DB is corrupted it returns a list of errors. Change a condition from "!= no" to "= yes" because in case of DB corruption upgrade_db would purge the existing DB without writing anything in the logs. Signed-off-by: Matteo Croce <mcr...@redhat.com> --- utilities/ovs-lib.in | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/utilities/ovs-lib.in b/utilities/ovs-lib.in index 8665698bb..ea5a12375 100644 --- a/utilities/ovs-lib.in +++ b/utilities/ovs-lib.in @@ -400,7 +400,7 @@ upgrade_db () { log_warning_msg "$DB_FILE does not exist" install_dir `dirname $DB_FILE` create_db "$DB_FILE" "$DB_SCHEMA" -elif test X"`ovsdb_tool needs-conversion "$DB_FILE" "$DB_SCHEMA"`" != Xno; then +elif test X"`ovsdb_tool needs-conversion "$DB_FILE" "$DB_SCHEMA"`" = Xyes; then # Back up the old version. version=`ovsdb_tool db-version "$DB_FILE"` cksum=`ovsdb_tool db-cksum "$DB_FILE" | awk '{print $1}'` -- 2.13.5 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev