Re: [ovs-dev] [PATCH] route-table: Avoid routes from non-standard routing tables.
Ilya Maximets writes: > On 3/19/24 20:56, Aaron Conole wrote: >> Ilya Maximets writes: >> >>> Currently, ovs-vswitchd is subscribed to all the routing changes in the >>> kernel. On each change, it marks the internal routing table cache as >>> invalid, then resets it and dumps all the routes from the kernel from >>> scratch. The reason for that is kernel routing updates not being >>> reliable in a sense that it's hard to tell which route is getting >>> removed or modified. Userspace application has to track the order in >>> which route entries are dumped from the kernel. Updates can get lost >>> or even duplicated and the kernel doesn't provide a good mechanism to >>> distinguish one route from another. To my knowledge, dumping all the >>> routes from a kernel after each change is the only way to keep the >>> cache consistent. Some more info can be found in the following never >>> addressed issues: >>> https://bugzilla.redhat.com/1337860 >>> https://bugzilla.redhat.com/1337855 >>> >>> It seems to be believed that NetworkManager "mostly" does incremental >>> updates right. But it is still not completely correct, will re-dump >>> the whole table in certain cases, and it takes a huge amount of very >>> complicated code to do the accounting and route comparisons. >>> >>> Going back to ovs-vswitchd, it currently dumps routes from all the >>> routing tables. If it will get conflicting routes from multiple >>> tables, the cache will not be useful. The routing cache in userspace >>> is primarily used for checking the egress port for tunneled traffic >>> and this way also detecting link state changes for a tunnel port. >>> For userspace datapath it is used for actual routing of the packet >>> after sending to a native tunnel. >>> With kernel datapath we don't really have a mechanism to know which >>> routing table will actually be used by the kernel after encapsulation, >>> so our lookups on a cache may be incorrect because of this as well. >>> >>> So, unless all the relevant routes are in the standard tables, the >>> lookup in userspace route cache is unreliable. >>> >>> Luckily, most setups are not using any complicated routing in >>> non-standard tables that OVS has to be aware of. >>> >>> It is possible, but unlikely, that standard routing tables are >>> completely empty while some other custom table is not, and all the OVS >>> tunnel traffic is directed to that table. That would be the only >>> scenario where dumping non-standard tables would make sense. But it >>> seems like this kind of setup will likely need a way to tell OVS from >>> which table the routes should be taken, or we'll need to dump routing >>> rules and keep a separate cache for each table, so we can first match >>> on rules and then lookup correct routes in a specific table. I'm not >>> sure if trying to implement all that is justified. >>> >>> For now, stop considering routes from non-standard tables to avoid >>> mixing different tables together and also wasting CPU resources. >>> >>> This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is >>> running on a same host and in a same network namespace with OVS using >>> its own custom routing table. >>> >>> Unfortunately, there seems to be no way to tell the kernel to send >>> updates only for particular tables. So, we'll still receive and parse >>> all of them. But they will not result in a full cache invalidation in >>> most cases. >>> >>> Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. >>> So, we can make use of it and dump only standard tables when we get a >>> relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on >>> the socket for filtering to work. There is no reason to not enable it >>> by default, if supported. It is not used outside of NETLINK_ROUTE. >>> >>> Fixes: f0e167f0dbad ("route-table: Handle route updates more robustly.") >>> Fixes: ea83a2fcd0d3 ("lib: Show tunnel egress interface in ovsdb") >>> Reported-at: https://github.com/openvswitch/ovs-issues/issues/185 >>> Reported-at: >>> https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html >>> Signed-off-by: Ilya Maximets >>> --- >>> lib/netlink-protocol.h | 10 ++ >>> lib/netlink-socket.c | 8 + >>> lib/route-table.c | 80 +- >>> tests/system-route.at | 64 + >>> 4 files changed, 146 insertions(+), 16 deletions(-) >>> >>> diff --git a/lib/netlink-protocol.h b/lib/netlink-protocol.h >>> index 6eaa7035a..e4bb28ac9 100644 >>> --- a/lib/netlink-protocol.h >>> +++ b/lib/netlink-protocol.h >>> @@ -155,6 +155,11 @@ enum { >>> #define NLA_TYPE_MASK ~(NLA_F_NESTED | NLA_F_NET_BYTEORDER) >>> #endif >>> >>> +/* Introduced in v4.4. */ >>> +#ifndef NLM_F_DUMP_FILTERED >>> +#define NLM_F_DUMP_FILTERED 0x20 >>> +#endif >>> + >>> /* These were introduced all together in 2.6.14. (We want our programs to >>> * support the newer kernel features even if
Re: [ovs-dev] [PATCH] route-table: Avoid routes from non-standard routing tables.
On 3/19/24 20:56, Aaron Conole wrote: > Ilya Maximets writes: > >> Currently, ovs-vswitchd is subscribed to all the routing changes in the >> kernel. On each change, it marks the internal routing table cache as >> invalid, then resets it and dumps all the routes from the kernel from >> scratch. The reason for that is kernel routing updates not being >> reliable in a sense that it's hard to tell which route is getting >> removed or modified. Userspace application has to track the order in >> which route entries are dumped from the kernel. Updates can get lost >> or even duplicated and the kernel doesn't provide a good mechanism to >> distinguish one route from another. To my knowledge, dumping all the >> routes from a kernel after each change is the only way to keep the >> cache consistent. Some more info can be found in the following never >> addressed issues: >> https://bugzilla.redhat.com/1337860 >> https://bugzilla.redhat.com/1337855 >> >> It seems to be believed that NetworkManager "mostly" does incremental >> updates right. But it is still not completely correct, will re-dump >> the whole table in certain cases, and it takes a huge amount of very >> complicated code to do the accounting and route comparisons. >> >> Going back to ovs-vswitchd, it currently dumps routes from all the >> routing tables. If it will get conflicting routes from multiple >> tables, the cache will not be useful. The routing cache in userspace >> is primarily used for checking the egress port for tunneled traffic >> and this way also detecting link state changes for a tunnel port. >> For userspace datapath it is used for actual routing of the packet >> after sending to a native tunnel. >> With kernel datapath we don't really have a mechanism to know which >> routing table will actually be used by the kernel after encapsulation, >> so our lookups on a cache may be incorrect because of this as well. >> >> So, unless all the relevant routes are in the standard tables, the >> lookup in userspace route cache is unreliable. >> >> Luckily, most setups are not using any complicated routing in >> non-standard tables that OVS has to be aware of. >> >> It is possible, but unlikely, that standard routing tables are >> completely empty while some other custom table is not, and all the OVS >> tunnel traffic is directed to that table. That would be the only >> scenario where dumping non-standard tables would make sense. But it >> seems like this kind of setup will likely need a way to tell OVS from >> which table the routes should be taken, or we'll need to dump routing >> rules and keep a separate cache for each table, so we can first match >> on rules and then lookup correct routes in a specific table. I'm not >> sure if trying to implement all that is justified. >> >> For now, stop considering routes from non-standard tables to avoid >> mixing different tables together and also wasting CPU resources. >> >> This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is >> running on a same host and in a same network namespace with OVS using >> its own custom routing table. >> >> Unfortunately, there seems to be no way to tell the kernel to send >> updates only for particular tables. So, we'll still receive and parse >> all of them. But they will not result in a full cache invalidation in >> most cases. >> >> Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. >> So, we can make use of it and dump only standard tables when we get a >> relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on >> the socket for filtering to work. There is no reason to not enable it >> by default, if supported. It is not used outside of NETLINK_ROUTE. >> >> Fixes: f0e167f0dbad ("route-table: Handle route updates more robustly.") >> Fixes: ea83a2fcd0d3 ("lib: Show tunnel egress interface in ovsdb") >> Reported-at: https://github.com/openvswitch/ovs-issues/issues/185 >> Reported-at: >> https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html >> Signed-off-by: Ilya Maximets >> --- >> lib/netlink-protocol.h | 10 ++ >> lib/netlink-socket.c | 8 + >> lib/route-table.c | 80 +- >> tests/system-route.at | 64 + >> 4 files changed, 146 insertions(+), 16 deletions(-) >> >> diff --git a/lib/netlink-protocol.h b/lib/netlink-protocol.h >> index 6eaa7035a..e4bb28ac9 100644 >> --- a/lib/netlink-protocol.h >> +++ b/lib/netlink-protocol.h >> @@ -155,6 +155,11 @@ enum { >> #define NLA_TYPE_MASK ~(NLA_F_NESTED | NLA_F_NET_BYTEORDER) >> #endif >> >> +/* Introduced in v4.4. */ >> +#ifndef NLM_F_DUMP_FILTERED >> +#define NLM_F_DUMP_FILTERED 0x20 >> +#endif >> + >> /* These were introduced all together in 2.6.14. (We want our programs to >> * support the newer kernel features even if compiled with older headers.) >> */ >> #ifndef NETLINK_ADD_MEMBERSHIP >> @@ -168,6 +173,11 @@ enum { >> #define
Re: [ovs-dev] [PATCH] route-table: Avoid routes from non-standard routing tables.
Ilya Maximets writes: > Currently, ovs-vswitchd is subscribed to all the routing changes in the > kernel. On each change, it marks the internal routing table cache as > invalid, then resets it and dumps all the routes from the kernel from > scratch. The reason for that is kernel routing updates not being > reliable in a sense that it's hard to tell which route is getting > removed or modified. Userspace application has to track the order in > which route entries are dumped from the kernel. Updates can get lost > or even duplicated and the kernel doesn't provide a good mechanism to > distinguish one route from another. To my knowledge, dumping all the > routes from a kernel after each change is the only way to keep the > cache consistent. Some more info can be found in the following never > addressed issues: > https://bugzilla.redhat.com/1337860 > https://bugzilla.redhat.com/1337855 > > It seems to be believed that NetworkManager "mostly" does incremental > updates right. But it is still not completely correct, will re-dump > the whole table in certain cases, and it takes a huge amount of very > complicated code to do the accounting and route comparisons. > > Going back to ovs-vswitchd, it currently dumps routes from all the > routing tables. If it will get conflicting routes from multiple > tables, the cache will not be useful. The routing cache in userspace > is primarily used for checking the egress port for tunneled traffic > and this way also detecting link state changes for a tunnel port. > For userspace datapath it is used for actual routing of the packet > after sending to a native tunnel. > With kernel datapath we don't really have a mechanism to know which > routing table will actually be used by the kernel after encapsulation, > so our lookups on a cache may be incorrect because of this as well. > > So, unless all the relevant routes are in the standard tables, the > lookup in userspace route cache is unreliable. > > Luckily, most setups are not using any complicated routing in > non-standard tables that OVS has to be aware of. > > It is possible, but unlikely, that standard routing tables are > completely empty while some other custom table is not, and all the OVS > tunnel traffic is directed to that table. That would be the only > scenario where dumping non-standard tables would make sense. But it > seems like this kind of setup will likely need a way to tell OVS from > which table the routes should be taken, or we'll need to dump routing > rules and keep a separate cache for each table, so we can first match > on rules and then lookup correct routes in a specific table. I'm not > sure if trying to implement all that is justified. > > For now, stop considering routes from non-standard tables to avoid > mixing different tables together and also wasting CPU resources. > > This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is > running on a same host and in a same network namespace with OVS using > its own custom routing table. > > Unfortunately, there seems to be no way to tell the kernel to send > updates only for particular tables. So, we'll still receive and parse > all of them. But they will not result in a full cache invalidation in > most cases. > > Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. > So, we can make use of it and dump only standard tables when we get a > relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on > the socket for filtering to work. There is no reason to not enable it > by default, if supported. It is not used outside of NETLINK_ROUTE. > > Fixes: f0e167f0dbad ("route-table: Handle route updates more robustly.") > Fixes: ea83a2fcd0d3 ("lib: Show tunnel egress interface in ovsdb") > Reported-at: https://github.com/openvswitch/ovs-issues/issues/185 > Reported-at: > https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html > Signed-off-by: Ilya Maximets > --- > lib/netlink-protocol.h | 10 ++ > lib/netlink-socket.c | 8 + > lib/route-table.c | 80 +- > tests/system-route.at | 64 + > 4 files changed, 146 insertions(+), 16 deletions(-) > > diff --git a/lib/netlink-protocol.h b/lib/netlink-protocol.h > index 6eaa7035a..e4bb28ac9 100644 > --- a/lib/netlink-protocol.h > +++ b/lib/netlink-protocol.h > @@ -155,6 +155,11 @@ enum { > #define NLA_TYPE_MASK ~(NLA_F_NESTED | NLA_F_NET_BYTEORDER) > #endif > > +/* Introduced in v4.4. */ > +#ifndef NLM_F_DUMP_FILTERED > +#define NLM_F_DUMP_FILTERED 0x20 > +#endif > + > /* These were introduced all together in 2.6.14. (We want our programs to > * support the newer kernel features even if compiled with older headers.) */ > #ifndef NETLINK_ADD_MEMBERSHIP > @@ -168,6 +173,11 @@ enum { > #define NETLINK_LISTEN_ALL_NSID 8 > #endif > > +/* Strict checking of netlink arguments introduced in Linux kernel v4.20. */ > +#ifndef
[ovs-dev] [PATCH] route-table: Avoid routes from non-standard routing tables.
Currently, ovs-vswitchd is subscribed to all the routing changes in the kernel. On each change, it marks the internal routing table cache as invalid, then resets it and dumps all the routes from the kernel from scratch. The reason for that is kernel routing updates not being reliable in a sense that it's hard to tell which route is getting removed or modified. Userspace application has to track the order in which route entries are dumped from the kernel. Updates can get lost or even duplicated and the kernel doesn't provide a good mechanism to distinguish one route from another. To my knowledge, dumping all the routes from a kernel after each change is the only way to keep the cache consistent. Some more info can be found in the following never addressed issues: https://bugzilla.redhat.com/1337860 https://bugzilla.redhat.com/1337855 It seems to be believed that NetworkManager "mostly" does incremental updates right. But it is still not completely correct, will re-dump the whole table in certain cases, and it takes a huge amount of very complicated code to do the accounting and route comparisons. Going back to ovs-vswitchd, it currently dumps routes from all the routing tables. If it will get conflicting routes from multiple tables, the cache will not be useful. The routing cache in userspace is primarily used for checking the egress port for tunneled traffic and this way also detecting link state changes for a tunnel port. For userspace datapath it is used for actual routing of the packet after sending to a native tunnel. With kernel datapath we don't really have a mechanism to know which routing table will actually be used by the kernel after encapsulation, so our lookups on a cache may be incorrect because of this as well. So, unless all the relevant routes are in the standard tables, the lookup in userspace route cache is unreliable. Luckily, most setups are not using any complicated routing in non-standard tables that OVS has to be aware of. It is possible, but unlikely, that standard routing tables are completely empty while some other custom table is not, and all the OVS tunnel traffic is directed to that table. That would be the only scenario where dumping non-standard tables would make sense. But it seems like this kind of setup will likely need a way to tell OVS from which table the routes should be taken, or we'll need to dump routing rules and keep a separate cache for each table, so we can first match on rules and then lookup correct routes in a specific table. I'm not sure if trying to implement all that is justified. For now, stop considering routes from non-standard tables to avoid mixing different tables together and also wasting CPU resources. This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is running on a same host and in a same network namespace with OVS using its own custom routing table. Unfortunately, there seems to be no way to tell the kernel to send updates only for particular tables. So, we'll still receive and parse all of them. But they will not result in a full cache invalidation in most cases. Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. So, we can make use of it and dump only standard tables when we get a relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on the socket for filtering to work. There is no reason to not enable it by default, if supported. It is not used outside of NETLINK_ROUTE. Fixes: f0e167f0dbad ("route-table: Handle route updates more robustly.") Fixes: ea83a2fcd0d3 ("lib: Show tunnel egress interface in ovsdb") Reported-at: https://github.com/openvswitch/ovs-issues/issues/185 Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html Signed-off-by: Ilya Maximets --- lib/netlink-protocol.h | 10 ++ lib/netlink-socket.c | 8 + lib/route-table.c | 80 +- tests/system-route.at | 64 + 4 files changed, 146 insertions(+), 16 deletions(-) diff --git a/lib/netlink-protocol.h b/lib/netlink-protocol.h index 6eaa7035a..e4bb28ac9 100644 --- a/lib/netlink-protocol.h +++ b/lib/netlink-protocol.h @@ -155,6 +155,11 @@ enum { #define NLA_TYPE_MASK ~(NLA_F_NESTED | NLA_F_NET_BYTEORDER) #endif +/* Introduced in v4.4. */ +#ifndef NLM_F_DUMP_FILTERED +#define NLM_F_DUMP_FILTERED 0x20 +#endif + /* These were introduced all together in 2.6.14. (We want our programs to * support the newer kernel features even if compiled with older headers.) */ #ifndef NETLINK_ADD_MEMBERSHIP @@ -168,6 +173,11 @@ enum { #define NETLINK_LISTEN_ALL_NSID 8 #endif +/* Strict checking of netlink arguments introduced in Linux kernel v4.20. */ +#ifndef NETLINK_GET_STRICT_CHK +#define NETLINK_GET_STRICT_CHK 12 +#endif + /* These were introduced all together in 2.6.23. (We want our programs to * support the newer kernel features even if compiled with older headers.) */