Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On Sat, May 30, 2015 at 2:00 AM, Jiri Pirko j...@resnulli.us wrote: Fri, May 29, 2015 at 05:39:46PM CEST, sfel...@gmail.com wrote: On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko j...@resnulli.us wrote: Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote: On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote: From: Andy Gospodarek go...@cumulusnetworks.com Date: Tue, 19 May 2015 15:47:32 -0400 Are you actually saying that if users complain loudly enough about the current behavior (not the change Roopa has proposed) that you would be open to considering a change the current behavior? I am saying that we have a contract with users not to break existing behavior. Full stop. After rehearing David's argument, we should probably explore option d) which is a refinement on the fib_offload_disable mechanism we have today. fib_offload_disable is global for all routes. Once we hit a HW install problem, the global flag is set and all routes fallback to SW. We did this because we can't allow the failed route to exist in SW and not in HW because it could mess up LPM searches (HW could hit on a lesser prefix even when SW has the true LPM, because HW gets first shot at match). The refinement on fib_offload_disable is this: make it per-related-prefix rather than global, and on a HW install problem, set the flag for the related-prefix and uninstall only those routes from HW. Related-prefix (is there a correct term for this?) are routes to the same dst addr but with different prefix lengths. I haven't parsed the fib_trie structure to see how routes are organized, but I suspect since it's optimized for lookup the related-prefix tracking is already there and we can build on that. This looks interesting. However, I'm not sure that it is acceptable for user to experience this hw evict of random entries. User knows what entries are essential to have in hw. With your solution, I can see no way user can actually say what should be offloaded or not. Kernel just automagically decides. The default eviction policy could be based on RTA_PRIORITY: evict lower priority routes first. It would be up to the device driver to decide between two routes of same priority. To help device driver make the decision, we could have eviction policy options: Priority-base (default) Prefer IPv6 over IPv4 Prefer IPv4 over IPv6 Prefer single path over multipath Prefer longer prefix lengths over shorter Optimize for resource utilization These are portable across different switches. They're in terms a user understands. It's up to the device driver which truly understands the device constraints to translates the user's eviction policy choices into something that makes sense to that device. This sounds tempting... You plan to throw in some patches, or should I take care of that? My L2 backlog keeps from L3 at the moment. What would be nice is if someone would look into the feasibility of option d) at the swithdev/fib level and see if it's doable. The policy/user interface stuff can come later. What do we need? My thoughts: - Remove the check for fi-fib_net-ipv4.fib_offload_disabled in switchdev_fib_ipv4_add() and replace it with a check to see if route is offload_disabled. A route is offload_disabled if the set of related routes is marked disabled. If route is offload_disabled, just return 0 in switchdev_fib_ipv4_add() so that route is only installed in the kernel fib. (It's fine now to have a route in SW but not in HW). - Find/invent in fib trie the structure that relates routes with the same prefix but different prefix lengths. So A/8 and A/16 and A/30 are related. B/16 is not related to the other three. Add a bool offload_disabled to that structure to mark them offload_disabled. We don't want LPM broken with something like this: HW: A/8, A/16, B/8 SW: A/8, A/16, A/30, B/8 Here when HW gets an A pkt, it'll match A/16 when really SW's A/30 should have won. Now LPM is broken. - In the driver's fib obj add handler, if it can't install the route because the device flat out can't support the route, then return -EOPNOTSUPP and switchdev_fib_ipv4_add() will mark the route as offload_disabled. Actually, switchdev_fib_ipv4_add() needs to additionally remove related routes from device. - In the driver's fib obj add handler, if the route can be installed, but at the expense of evicting some other route(s) due to evict policy (driver's choice on which route(s) get evicted), the handler returns 0 and switchdev_fib_ipv4_add() is happy. The driver should send a switchdev notification to the fib layer to mark the evicted route(s) as offload disable. - Routes marked with offload_disabled should have RTNH_F_OFFLOAD cleared. I think I'm being sloppy with making sure once a route is removed from the device, all the related routes are also removed from the device. It's late; brain not working 100%. But I think you have the gist of it. I don't think there is a lot
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On Sat, May 30, 2015 at 9:19 PM, John Fastabend john.fastab...@gmail.com wrote: On 05/30/2015 02:00 AM, Jiri Pirko wrote: Fri, May 29, 2015 at 05:39:46PM CEST, sfel...@gmail.com wrote: On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko j...@resnulli.us wrote: Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote: On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote: From: Andy Gospodarek go...@cumulusnetworks.com Date: Tue, 19 May 2015 15:47:32 -0400 Are you actually saying that if users complain loudly enough about the current behavior (not the change Roopa has proposed) that you would be open to considering a change the current behavior? I am saying that we have a contract with users not to break existing behavior. Full stop. After rehearing David's argument, we should probably explore option d) which is a refinement on the fib_offload_disable mechanism we have today. fib_offload_disable is global for all routes. Once we hit a HW install problem, the global flag is set and all routes fallback to SW. We did this because we can't allow the failed route to exist in SW and not in HW because it could mess up LPM searches (HW could hit on a lesser prefix even when SW has the true LPM, because HW gets first shot at match). The refinement on fib_offload_disable is this: make it per-related-prefix rather than global, and on a HW install problem, set the flag for the related-prefix and uninstall only those routes from HW. Related-prefix (is there a correct term for this?) are routes to the same dst addr but with different prefix lengths. I haven't parsed the fib_trie structure to see how routes are organized, but I suspect since it's optimized for lookup the related-prefix tracking is already there and we can build on that. This looks interesting. However, I'm not sure that it is acceptable for user to experience this hw evict of random entries. User knows what entries are essential to have in hw. With your solution, I can see no way user can actually say what should be offloaded or not. Kernel just automagically decides. The default eviction policy could be based on RTA_PRIORITY: evict lower priority routes first. It would be up to the device driver to decide between two routes of same priority. To help device driver make the decision, we could have eviction policy options: Priority-base (default) Prefer IPv6 over IPv4 Prefer IPv4 over IPv6 Prefer single path over multipath Prefer longer prefix lengths over shorter Optimize for resource utilization These are portable across different switches. They're in terms a user understands. It's up to the device driver which truly understands the device constraints to translates the user's eviction policy choices into something that makes sense to that device. This sounds tempting... You plan to throw in some patches, or should I take care of that? This is encoding specific policies into the kernel. I was hoping to avoid this and let user space develop whatever policy it wants. If you use Jiri's proposed NLM_F_SKIP_{KERNEL|OFFLOAD} flags you get this. Also I don't understand the truly understands the device constraints comment. We can export a model of the device and know how many rules of each type will fit exactly into the table. This doesn't seem like much of a problem to me. In fact the driver developer should know this anyway. Part of my motivation here is I really don't want to get stuck with a case where each driver writer gets to translate the eviction policy onto their device in some device specific and slightly different way. But this is _exactly_ what I want. Here's why: my claim is it will be impossible for us (device vendors) to define a universal set of resource constraints that works for all devices from all vendors. I was kind of hoping some vendor would throw out a set to get us started. Ok, I'll start with rocker: rocker will enforce in the device these constraints listed below. There will be a device command to query the raw constraints. So here goes: VLAN table max entries: 16K // a VLAN on a port takes one entry Term MAC table max entries: no limit Bridging table: Unicast max entries: 12K Multicast max entries: 4K Unicast Routing table (shared for v4 and v6 entries): Prefix max slots: 16K IPv4 route takes one slot IPv6 prefix len = 64 route takes two slots IPv6 prefix len 64 takes four slots Nexthop max slots: 4K Max ECMP width: 32 Each nexthop MAC takes one slot, but there is a stride of 4 slots Multicast Routing table (shared for v4 and v6 entries): (same as unicast routing, except max slots are 1/2 as big) ACL table: IPv4 max entries: 8K IPv6 max entries: 8K Combined IPv4 and IPv6 entries: 12K The table names are OF-DPA-specific. The limits are contrived, obviously, but are representative of real-world devices. The
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On 05/30/2015 02:00 AM, Jiri Pirko wrote: Fri, May 29, 2015 at 05:39:46PM CEST, sfel...@gmail.com wrote: On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko j...@resnulli.us wrote: Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote: On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote: From: Andy Gospodarek go...@cumulusnetworks.com Date: Tue, 19 May 2015 15:47:32 -0400 Are you actually saying that if users complain loudly enough about the current behavior (not the change Roopa has proposed) that you would be open to considering a change the current behavior? I am saying that we have a contract with users not to break existing behavior. Full stop. After rehearing David's argument, we should probably explore option d) which is a refinement on the fib_offload_disable mechanism we have today. fib_offload_disable is global for all routes. Once we hit a HW install problem, the global flag is set and all routes fallback to SW. We did this because we can't allow the failed route to exist in SW and not in HW because it could mess up LPM searches (HW could hit on a lesser prefix even when SW has the true LPM, because HW gets first shot at match). The refinement on fib_offload_disable is this: make it per-related-prefix rather than global, and on a HW install problem, set the flag for the related-prefix and uninstall only those routes from HW. Related-prefix (is there a correct term for this?) are routes to the same dst addr but with different prefix lengths. I haven't parsed the fib_trie structure to see how routes are organized, but I suspect since it's optimized for lookup the related-prefix tracking is already there and we can build on that. This looks interesting. However, I'm not sure that it is acceptable for user to experience this hw evict of random entries. User knows what entries are essential to have in hw. With your solution, I can see no way user can actually say what should be offloaded or not. Kernel just automagically decides. The default eviction policy could be based on RTA_PRIORITY: evict lower priority routes first. It would be up to the device driver to decide between two routes of same priority. To help device driver make the decision, we could have eviction policy options: Priority-base (default) Prefer IPv6 over IPv4 Prefer IPv4 over IPv6 Prefer single path over multipath Prefer longer prefix lengths over shorter Optimize for resource utilization These are portable across different switches. They're in terms a user understands. It's up to the device driver which truly understands the device constraints to translates the user's eviction policy choices into something that makes sense to that device. This sounds tempting... You plan to throw in some patches, or should I take care of that? This is encoding specific policies into the kernel. I was hoping to avoid this and let user space develop whatever policy it wants. If you use Jiri's proposed NLM_F_SKIP_{KERNEL|OFFLOAD} flags you get this. Also I don't understand the truly understands the device constraints comment. We can export a model of the device and know how many rules of each type will fit exactly into the table. This doesn't seem like much of a problem to me. In fact the driver developer should know this anyway. Part of my motivation here is I really don't want to get stuck with a case where each driver writer gets to translate the eviction policy onto their device in some device specific and slightly different way. It means every developer has to write a new mapping and get it correct. At very least we should put a layer in switchdev that reads the table out of the driver and does the mapping so we have it one spot. At least then the kernel is enforcing policy the same on all devices. Better still IMO would be to develop the policy in user space and have a library/tool that does this so we don't end up with a bunch of policy blobs in the kernel. The 6 above is a good start but over time we more policy blobs will surely pop up. I would for example put 'optimize for throughput' on the list. .John -- John Fastabend Intel Corporation -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
Fri, May 29, 2015 at 05:39:46PM CEST, sfel...@gmail.com wrote: On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko j...@resnulli.us wrote: Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote: On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote: From: Andy Gospodarek go...@cumulusnetworks.com Date: Tue, 19 May 2015 15:47:32 -0400 Are you actually saying that if users complain loudly enough about the current behavior (not the change Roopa has proposed) that you would be open to considering a change the current behavior? I am saying that we have a contract with users not to break existing behavior. Full stop. After rehearing David's argument, we should probably explore option d) which is a refinement on the fib_offload_disable mechanism we have today. fib_offload_disable is global for all routes. Once we hit a HW install problem, the global flag is set and all routes fallback to SW. We did this because we can't allow the failed route to exist in SW and not in HW because it could mess up LPM searches (HW could hit on a lesser prefix even when SW has the true LPM, because HW gets first shot at match). The refinement on fib_offload_disable is this: make it per-related-prefix rather than global, and on a HW install problem, set the flag for the related-prefix and uninstall only those routes from HW. Related-prefix (is there a correct term for this?) are routes to the same dst addr but with different prefix lengths. I haven't parsed the fib_trie structure to see how routes are organized, but I suspect since it's optimized for lookup the related-prefix tracking is already there and we can build on that. This looks interesting. However, I'm not sure that it is acceptable for user to experience this hw evict of random entries. User knows what entries are essential to have in hw. With your solution, I can see no way user can actually say what should be offloaded or not. Kernel just automagically decides. The default eviction policy could be based on RTA_PRIORITY: evict lower priority routes first. It would be up to the device driver to decide between two routes of same priority. To help device driver make the decision, we could have eviction policy options: Priority-base (default) Prefer IPv6 over IPv4 Prefer IPv4 over IPv6 Prefer single path over multipath Prefer longer prefix lengths over shorter Optimize for resource utilization These are portable across different switches. They're in terms a user understands. It's up to the device driver which truly understands the device constraints to translates the user's eviction policy choices into something that makes sense to that device. This sounds tempting... You plan to throw in some patches, or should I take care of that? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
Thu, May 28, 2015 at 05:35:05PM CEST, john.fastab...@gmail.com wrote: On 05/28/2015 02:42 AM, Jiri Pirko wrote: Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 I certainly agree that by default, transparency 1:1 sw:hw mapping is what we need for fib. The current code is a good start! I see couple of issues regarding switchdev_fib_ipv4_abort: 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is executed - and, error returned. I would expect that route entry should be added in this case. The next attempt of adding the same entry will be successful. The current behaviour breaks the transparency you are reffering to. 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is disabled for good (until reboot). That is certainly not nice, alhough I understand that is the easiest solution for now. I believe that we all agree that the 1:1 transparency, although it is a default, may not be optimal for real-life usage. HW resources are limited and user does not know them. The danger of hitting _abort and screwing-up the whole system is huge, unacceptable. So here, there are couple of more or less simple things that I suggest to do in order to move a little bit forward: 1) Introduce system-wide option to switch _abort to just plain fail. When HW does not have capacity, do not flush and fallback to sw, but rather just fail to add the entry. This would not break anything. Userspace has to be prepared that entry add could fail. 2) Introduce a way to propagate resources to userspace. Driver knows about resources used/available/potentially_available. Switchdev infra could be extended in order to propagate the info to the user. I currently use the FlowAPI work I presented at netdev conference for this. Perhaps I was a bit reaching by trying to also push it as a replacement for the ethtool flow classification mechanism all in one shot. For what it is worth replacing 'ethtool' flow classifier when I have a pipeline of tables in a NIC is really my first use case for the 'set' operations but that is off-topic probably. The benefits I see of using this interface (or if you want rename it and push it into a different netlink type) is it gives you the entire view of the switch resources and pipeline from a single interface. Also because you are talking about system-wide behaviour above it nicely rolls up into user space software where we can act on it with the flags we have for l2 already and if we pursue your option (3) also l3. I like the single interface vs. scattering the information across many different interfaces this way we can do it once and be done with it. If you scatter it across all the interfaces just l2,l3 for now but we will get more then each interface will have its own mechanism and I have no idea where you put global information such as table ordering. I think that for fib capacities/capabilities, user should be able to use extended existing Netlink interface. Not some parallel one. I'm still not convinced that user should care about the actual hw pipeline. We already have a pipeline in kernel. Switch drivers should just do mapping, easy as that. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote: On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote: From: Andy Gospodarek go...@cumulusnetworks.com Date: Tue, 19 May 2015 15:47:32 -0400 Are you actually saying that if users complain loudly enough about the current behavior (not the change Roopa has proposed) that you would be open to considering a change the current behavior? I am saying that we have a contract with users not to break existing behavior. Full stop. After rehearing David's argument, we should probably explore option d) which is a refinement on the fib_offload_disable mechanism we have today. fib_offload_disable is global for all routes. Once we hit a HW install problem, the global flag is set and all routes fallback to SW. We did this because we can't allow the failed route to exist in SW and not in HW because it could mess up LPM searches (HW could hit on a lesser prefix even when SW has the true LPM, because HW gets first shot at match). The refinement on fib_offload_disable is this: make it per-related-prefix rather than global, and on a HW install problem, set the flag for the related-prefix and uninstall only those routes from HW. Related-prefix (is there a correct term for this?) are routes to the same dst addr but with different prefix lengths. I haven't parsed the fib_trie structure to see how routes are organized, but I suspect since it's optimized for lookup the related-prefix tracking is already there and we can build on that. This looks interesting. However, I'm not sure that it is acceptable for user to experience this hw evict of random entries. User knows what entries are essential to have in hw. With your solution, I can see no way user can actually say what should be offloaded or not. Kernel just automagically decides. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
Thu, May 28, 2015 at 05:40:11PM CEST, sfel...@gmail.com wrote: On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote: Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 I certainly agree that by default, transparency 1:1 sw:hw mapping is what we need for fib. The current code is a good start! I see couple of issues regarding switchdev_fib_ipv4_abort: 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is executed - and, error returned. I would expect that route entry should be added in this case. The next attempt of adding the same entry will be successful. The current behaviour breaks the transparency you are reffering to. 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is disabled for good (until reboot). That is certainly not nice, alhough I understand that is the easiest solution for now. I believe that we all agree that the 1:1 transparency, although it is a default, may not be optimal for real-life usage. HW resources are limited and user does not know them. The danger of hitting _abort and screwing-up the whole system is huge, unacceptable. So here, there are couple of more or less simple things that I suggest to do in order to move a little bit forward: 1) Introduce system-wide option to switch _abort to just plain fail. When HW does not have capacity, do not flush and fallback to sw, but rather just fail to add the entry. This would not break anything. Userspace has to be prepared that entry add could fail. 2) Introduce a way to propagate resources to userspace. Driver knows about resources used/available/potentially_available. Switchdev infra could be extended in order to propagate the info to the user. 3) Introduce couple of flags for entry add that would alter the default behaviour. Something like: NLM_F_SKIP_KERNEL NLM_F_SKIP_OFFLOAD Again, this does not break the current users. On the other hand, this gives new users a leverage to instruct kernel where the entry should be added to (or not added to). Any thoughts? Objections? I don't like these. Breaks transparency and forces the user in a position of having to know hardware failures modes (unique to each Can you please elaborate on this a bit more? I fail to see transparency breaking in my proposal :/ Maybe it is by different understanding of the term? Also I do not understand the remark about user having to know hardware failure modes. Could you please explain? hardware device). I presented an option d) which avoids this issues; was it not understood? I just commented on option d) it other email. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
Fri, May 29, 2015 at 05:12:35PM CEST, sfel...@gmail.com wrote: On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote: Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 I certainly agree that by default, transparency 1:1 sw:hw mapping is what we need for fib. The current code is a good start! I see couple of issues regarding switchdev_fib_ipv4_abort: 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is executed - and, error returned. I would expect that route entry should be added in this case. The next attempt of adding the same entry will be successful. The current behaviour breaks the transparency you are reffering to. 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is disabled for good (until reboot). That is certainly not nice, alhough I understand that is the easiest solution for now. I believe that we all agree that the 1:1 transparency, although it is a default, may not be optimal for real-life usage. HW resources are limited and user does not know them. The danger of hitting _abort and screwing-up the whole system is huge, unacceptable. So here, there are couple of more or less simple things that I suggest to do in order to move a little bit forward: 1) Introduce system-wide option to switch _abort to just plain fail. When HW does not have capacity, do not flush and fallback to sw, but rather just fail to add the entry. This would not break anything. Userspace has to be prepared that entry add could fail. This breaks 1:1 transparency. A route now fails to install and the user is scratching his/her head as to why it failed. It used to work when there was no switch offload. It works with switch offload on this other device. So it must be a failure due to switch offload on this device. But why this route? I just installed 20 IPv4 routes and 10 IPv6 routes. Why did this 11th IPv6 route fail to install? See, now user needs to learn about details of that particular device's limits to understand failure. When they move their application to another device, they need to re-learn failure modes. I don't want this behaviour as the default. Default should be what is at this moment. This would be tunable by user. That, I believe is correct. 2) Introduce a way to propagate resources to userspace. Driver knows about resources used/available/potentially_available. Switchdev infra could be extended in order to propagate the info to the user. 3) Introduce couple of flags for entry add that would alter the default behaviour. Something like: NLM_F_SKIP_KERNEL NLM_F_SKIP_OFFLOAD Again, this does not break the current users. On the other hand, this gives new users a leverage to instruct kernel where the entry should be added to (or not added to). I don't think we want an NLM_F_SKIP_KERNEL option and only have the route installed on the device. We want offload to be an acceleration of the kernel's FIB, not a bypass. Okay, fair enough. Let's have NLM_F_SKIP_OFFLOAD only. SKIP_OFFLOAD can mess up LPM if the user is not really really careful. Any thoughts? Objections? Thanks! Jiri -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote: Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 I certainly agree that by default, transparency 1:1 sw:hw mapping is what we need for fib. The current code is a good start! I see couple of issues regarding switchdev_fib_ipv4_abort: 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is executed - and, error returned. I would expect that route entry should be added in this case. The next attempt of adding the same entry will be successful. The current behaviour breaks the transparency you are reffering to. 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is disabled for good (until reboot). That is certainly not nice, alhough I understand that is the easiest solution for now. I believe that we all agree that the 1:1 transparency, although it is a default, may not be optimal for real-life usage. HW resources are limited and user does not know them. The danger of hitting _abort and screwing-up the whole system is huge, unacceptable. So here, there are couple of more or less simple things that I suggest to do in order to move a little bit forward: 1) Introduce system-wide option to switch _abort to just plain fail. When HW does not have capacity, do not flush and fallback to sw, but rather just fail to add the entry. This would not break anything. Userspace has to be prepared that entry add could fail. This breaks 1:1 transparency. A route now fails to install and the user is scratching his/her head as to why it failed. It used to work when there was no switch offload. It works with switch offload on this other device. So it must be a failure due to switch offload on this device. But why this route? I just installed 20 IPv4 routes and 10 IPv6 routes. Why did this 11th IPv6 route fail to install? See, now user needs to learn about details of that particular device's limits to understand failure. When they move their application to another device, they need to re-learn failure modes. 2) Introduce a way to propagate resources to userspace. Driver knows about resources used/available/potentially_available. Switchdev infra could be extended in order to propagate the info to the user. 3) Introduce couple of flags for entry add that would alter the default behaviour. Something like: NLM_F_SKIP_KERNEL NLM_F_SKIP_OFFLOAD Again, this does not break the current users. On the other hand, this gives new users a leverage to instruct kernel where the entry should be added to (or not added to). I don't think we want an NLM_F_SKIP_KERNEL option and only have the route installed on the device. We want offload to be an acceleration of the kernel's FIB, not a bypass. SKIP_OFFLOAD can mess up LPM if the user is not really really careful. Any thoughts? Objections? Thanks! Jiri -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko j...@resnulli.us wrote: Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote: On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote: From: Andy Gospodarek go...@cumulusnetworks.com Date: Tue, 19 May 2015 15:47:32 -0400 Are you actually saying that if users complain loudly enough about the current behavior (not the change Roopa has proposed) that you would be open to considering a change the current behavior? I am saying that we have a contract with users not to break existing behavior. Full stop. After rehearing David's argument, we should probably explore option d) which is a refinement on the fib_offload_disable mechanism we have today. fib_offload_disable is global for all routes. Once we hit a HW install problem, the global flag is set and all routes fallback to SW. We did this because we can't allow the failed route to exist in SW and not in HW because it could mess up LPM searches (HW could hit on a lesser prefix even when SW has the true LPM, because HW gets first shot at match). The refinement on fib_offload_disable is this: make it per-related-prefix rather than global, and on a HW install problem, set the flag for the related-prefix and uninstall only those routes from HW. Related-prefix (is there a correct term for this?) are routes to the same dst addr but with different prefix lengths. I haven't parsed the fib_trie structure to see how routes are organized, but I suspect since it's optimized for lookup the related-prefix tracking is already there and we can build on that. This looks interesting. However, I'm not sure that it is acceptable for user to experience this hw evict of random entries. User knows what entries are essential to have in hw. With your solution, I can see no way user can actually say what should be offloaded or not. Kernel just automagically decides. The default eviction policy could be based on RTA_PRIORITY: evict lower priority routes first. It would be up to the device driver to decide between two routes of same priority. To help device driver make the decision, we could have eviction policy options: Priority-base (default) Prefer IPv6 over IPv4 Prefer IPv4 over IPv6 Prefer single path over multipath Prefer longer prefix lengths over shorter Optimize for resource utilization These are portable across different switches. They're in terms a user understands. It's up to the device driver which truly understands the device constraints to translates the user's eviction policy choices into something that makes sense to that device. -scott -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On Thu, May 28, 2015 at 08:40:11AM -0700, Scott Feldman wrote: On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote: Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 I certainly agree that by default, transparency 1:1 sw:hw mapping is what we need for fib. The current code is a good start! I see couple of issues regarding switchdev_fib_ipv4_abort: 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is executed - and, error returned. I would expect that route entry should be added in this case. The next attempt of adding the same entry will be successful. The current behaviour breaks the transparency you are reffering to. 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is disabled for good (until reboot). That is certainly not nice, alhough I understand that is the easiest solution for now. I believe that we all agree that the 1:1 transparency, although it is a default, may not be optimal for real-life usage. HW resources are limited and user does not know them. The danger of hitting _abort and screwing-up the whole system is huge, unacceptable. So here, there are couple of more or less simple things that I suggest to do in order to move a little bit forward: 1) Introduce system-wide option to switch _abort to just plain fail. When HW does not have capacity, do not flush and fallback to sw, but rather just fail to add the entry. This would not break anything. Userspace has to be prepared that entry add could fail. 2) Introduce a way to propagate resources to userspace. Driver knows about resources used/available/potentially_available. Switchdev infra could be extended in order to propagate the info to the user. 3) Introduce couple of flags for entry add that would alter the default behaviour. Something like: NLM_F_SKIP_KERNEL NLM_F_SKIP_OFFLOAD Again, this does not break the current users. On the other hand, this gives new users a leverage to instruct kernel where the entry should be added to (or not added to). Any thoughts? Objections? I don't like these. Breaks transparency and forces the user in a position of having to know hardware failures modes (unique to each hardware device). I presented an option d) which avoids this issues; was it not understood? I actually really like the way Jiri succinctly covered the different cases to move us forward from what we have today (Thanks, Jiri!). I completely agree with you on both of your problem statements and the idea that what have is fine for the short-term. I see definite room to improve the the user experience available via upstream kernels. Option 1 has appeal since userspace applications that control FDB, FIB, etc entries could work without modification (the when in this mode the kernel could choose to ignore any NLM_F_* flags Jiri proposed), but I agree that a system-wide (or maybe offload-device-wide?) configuration option needs to exist as this should not be the default behavior. Option 2 could also work as userspace applications could query for space availability before attempting to add a route. This could be nice during bootup as then apps could periodically double check that their view of the world is accurate. Option 3 also has appeal since there exists the ability to allow fine-grained control from userspace applications since less used routes (or routes that could be summarized) could be combined in userspace if needed. The great part about all suggestions is that when combined they can provide a great user experience, but doing all 3 at once is probably too aggressive. My vote would be to see if we can work together on a combination of Option 1 and 3 together as they seem to provide a great first start to this... If an application tried to add a route (called A) to the route table in the kernel and code to support Option 1 existed (similar to what Roopa posted to start this series) then the kernel could fail to add route A. If the user noted that some other route (called B) was lower priority for _any_ reason, the user could delete route B from the kernel and hardware and add route A to hardware and kernel. Then the
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On 5/28/15, 3:35 PM, Andy Gospodarek wrote: On Thu, May 28, 2015 at 08:40:11AM -0700, Scott Feldman wrote: On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote: Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 I certainly agree that by default, transparency 1:1 sw:hw mapping is what we need for fib. The current code is a good start! I see couple of issues regarding switchdev_fib_ipv4_abort: 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is executed - and, error returned. I would expect that route entry should be added in this case. The next attempt of adding the same entry will be successful. The current behaviour breaks the transparency you are reffering to. 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is disabled for good (until reboot). That is certainly not nice, alhough I understand that is the easiest solution for now. I believe that we all agree that the 1:1 transparency, although it is a default, may not be optimal for real-life usage. HW resources are limited and user does not know them. The danger of hitting _abort and screwing-up the whole system is huge, unacceptable. So here, there are couple of more or less simple things that I suggest to do in order to move a little bit forward: 1) Introduce system-wide option to switch _abort to just plain fail. When HW does not have capacity, do not flush and fallback to sw, but rather just fail to add the entry. This would not break anything. Userspace has to be prepared that entry add could fail. 2) Introduce a way to propagate resources to userspace. Driver knows about resources used/available/potentially_available. Switchdev infra could be extended in order to propagate the info to the user. 3) Introduce couple of flags for entry add that would alter the default behaviour. Something like: NLM_F_SKIP_KERNEL NLM_F_SKIP_OFFLOAD Again, this does not break the current users. On the other hand, this gives new users a leverage to instruct kernel where the entry should be added to (or not added to). Any thoughts? Objections? I don't like these. Breaks transparency and forces the user in a position of having to know hardware failures modes (unique to each hardware device). I presented an option d) which avoids this issues; was it not understood? I actually really like the way Jiri succinctly covered the different cases to move us forward from what we have today (Thanks, Jiri!). I completely agree with you on both of your problem statements and the idea that what have is fine for the short-term. I see definite room to improve the the user experience available via upstream kernels. Option 1 has appeal since userspace applications that control FDB, FIB, etc entries could work without modification (the when in this mode the kernel could choose to ignore any NLM_F_* flags Jiri proposed), but I agree that a system-wide (or maybe offload-device-wide?) configuration option needs to exist as this should not be the default behavior. +1...i started with a sysctl for this as an example in my patch. But, instead of adding 10 different sysctls in different places, a switchdev policy infra/api and flags is due. I have been planning to post a v3 of my patch with some of these policy flags. Option 2 could also work as userspace applications could query for space availability before attempting to add a route. This could be nice during bootup as then apps could periodically double check that their view of the world is accurate. Option 3 also has appeal since there exists the ability to allow fine-grained control from userspace applications since less used routes (or routes that could be summarized) could be combined in userspace if needed. The great part about all suggestions is that when combined they can provide a great user experience, but doing all 3 at once is probably too aggressive. My vote would be to see if we can work together on a combination of Option 1 and 3 together as they seem to provide a great first start to this... If an application tried to add a route (called A) to the route table in the kernel and code to support Option 1 existed (similar to what Roopa posted to start this series) then the kernel
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On 5/28/15, 9:10 AM, John Fastabend wrote: On 05/28/2015 08:40 AM, Scott Feldman wrote: On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote: Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 I certainly agree that by default, transparency 1:1 sw:hw mapping is what we need for fib. The current code is a good start! I see couple of issues regarding switchdev_fib_ipv4_abort: 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is executed - and, error returned. I would expect that route entry should be added in this case. The next attempt of adding the same entry will be successful. The current behaviour breaks the transparency you are reffering to. 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is disabled for good (until reboot). That is certainly not nice, alhough I understand that is the easiest solution for now. I believe that we all agree that the 1:1 transparency, although it is a default, may not be optimal for real-life usage. HW resources are limited and user does not know them. The danger of hitting _abort and screwing-up the whole system is huge, unacceptable. So here, there are couple of more or less simple things that I suggest to do in order to move a little bit forward: 1) Introduce system-wide option to switch _abort to just plain fail. When HW does not have capacity, do not flush and fallback to sw, but rather just fail to add the entry. This would not break anything. Userspace has to be prepared that entry add could fail. 2) Introduce a way to propagate resources to userspace. Driver knows about resources used/available/potentially_available. Switchdev infra could be extended in order to propagate the info to the user. 3) Introduce couple of flags for entry add that would alter the default behaviour. Something like: NLM_F_SKIP_KERNEL NLM_F_SKIP_OFFLOAD Again, this does not break the current users. On the other hand, this gives new users a leverage to instruct kernel where the entry should be added to (or not added to). Any thoughts? Objections? I don't like these. Breaks transparency and forces the user in a position of having to know hardware failures modes (unique to each hardware device). I presented an option d) which avoids this issues; was it not understood? Hi Scott, I understood your proposal. One caveat I had is in response to this, Actually, now that I think of it, the device/driver could decide which related-prefix to evict from HW, if driver/device wanted to have a sense of which routes are more important to offload than other hardware/driver/device shouldn't have a sense of which routes are more important than others. correct. The routing daemons know this best. I think this is where the NLM_F_* flags come in. If userspace _wants_ to push policy into the kernel about what is important it can. If it doesn't we get a sensible heuristic that does a reasonable job offloading rules transparently. This is how we did L2 and I think that seems to work fairly well. At least for me but, always interested to hear other use cases though. agree. Also I guess I'm not seeing the multitude of hardware failure modes. I see two either the hardware doesn't support the operation or it is out of resources. Both can be learned if the hardware exports a model of its capabilities and resources. agree, A switchdev api to query hardware resource and capability is due. We can start with rocker. It gives an app the choice to control the policy. But, for our usecase/deployments today, i am more interested in a system wide policy because it is easier on my apps today. thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 I certainly agree that by default, transparency 1:1 sw:hw mapping is what we need for fib. The current code is a good start! I see couple of issues regarding switchdev_fib_ipv4_abort: 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is executed - and, error returned. I would expect that route entry should be added in this case. The next attempt of adding the same entry will be successful. The current behaviour breaks the transparency you are reffering to. 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is disabled for good (until reboot). That is certainly not nice, alhough I understand that is the easiest solution for now. I believe that we all agree that the 1:1 transparency, although it is a default, may not be optimal for real-life usage. HW resources are limited and user does not know them. The danger of hitting _abort and screwing-up the whole system is huge, unacceptable. So here, there are couple of more or less simple things that I suggest to do in order to move a little bit forward: 1) Introduce system-wide option to switch _abort to just plain fail. When HW does not have capacity, do not flush and fallback to sw, but rather just fail to add the entry. This would not break anything. Userspace has to be prepared that entry add could fail. 2) Introduce a way to propagate resources to userspace. Driver knows about resources used/available/potentially_available. Switchdev infra could be extended in order to propagate the info to the user. 3) Introduce couple of flags for entry add that would alter the default behaviour. Something like: NLM_F_SKIP_KERNEL NLM_F_SKIP_OFFLOAD Again, this does not break the current users. On the other hand, this gives new users a leverage to instruct kernel where the entry should be added to (or not added to). Any thoughts? Objections? Thanks! Jiri -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On 05/28/2015 08:40 AM, Scott Feldman wrote: On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote: Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 I certainly agree that by default, transparency 1:1 sw:hw mapping is what we need for fib. The current code is a good start! I see couple of issues regarding switchdev_fib_ipv4_abort: 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is executed - and, error returned. I would expect that route entry should be added in this case. The next attempt of adding the same entry will be successful. The current behaviour breaks the transparency you are reffering to. 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is disabled for good (until reboot). That is certainly not nice, alhough I understand that is the easiest solution for now. I believe that we all agree that the 1:1 transparency, although it is a default, may not be optimal for real-life usage. HW resources are limited and user does not know them. The danger of hitting _abort and screwing-up the whole system is huge, unacceptable. So here, there are couple of more or less simple things that I suggest to do in order to move a little bit forward: 1) Introduce system-wide option to switch _abort to just plain fail. When HW does not have capacity, do not flush and fallback to sw, but rather just fail to add the entry. This would not break anything. Userspace has to be prepared that entry add could fail. 2) Introduce a way to propagate resources to userspace. Driver knows about resources used/available/potentially_available. Switchdev infra could be extended in order to propagate the info to the user. 3) Introduce couple of flags for entry add that would alter the default behaviour. Something like: NLM_F_SKIP_KERNEL NLM_F_SKIP_OFFLOAD Again, this does not break the current users. On the other hand, this gives new users a leverage to instruct kernel where the entry should be added to (or not added to). Any thoughts? Objections? I don't like these. Breaks transparency and forces the user in a position of having to know hardware failures modes (unique to each hardware device). I presented an option d) which avoids this issues; was it not understood? Hi Scott, I understood your proposal. One caveat I had is in response to this, Actually, now that I think of it, the device/driver could decide which related-prefix to evict from HW, if driver/device wanted to have a sense of which routes are more important to offload than other hardware/driver/device shouldn't have a sense of which routes are more important than others. I think this is where the NLM_F_* flags come in. If userspace _wants_ to push policy into the kernel about what is important it can. If it doesn't we get a sensible heuristic that does a reasonable job offloading rules transparently. This is how we did L2 and I think that seems to work fairly well. At least for me but, always interested to hear other use cases though. Also I guess I'm not seeing the multitude of hardware failure modes. I see two either the hardware doesn't support the operation or it is out of resources. Both can be learned if the hardware exports a model of its capabilities and resources. Thanks, John -- John Fastabend Intel Corporation -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote: Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 I certainly agree that by default, transparency 1:1 sw:hw mapping is what we need for fib. The current code is a good start! I see couple of issues regarding switchdev_fib_ipv4_abort: 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is executed - and, error returned. I would expect that route entry should be added in this case. The next attempt of adding the same entry will be successful. The current behaviour breaks the transparency you are reffering to. 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is disabled for good (until reboot). That is certainly not nice, alhough I understand that is the easiest solution for now. I believe that we all agree that the 1:1 transparency, although it is a default, may not be optimal for real-life usage. HW resources are limited and user does not know them. The danger of hitting _abort and screwing-up the whole system is huge, unacceptable. So here, there are couple of more or less simple things that I suggest to do in order to move a little bit forward: 1) Introduce system-wide option to switch _abort to just plain fail. When HW does not have capacity, do not flush and fallback to sw, but rather just fail to add the entry. This would not break anything. Userspace has to be prepared that entry add could fail. 2) Introduce a way to propagate resources to userspace. Driver knows about resources used/available/potentially_available. Switchdev infra could be extended in order to propagate the info to the user. 3) Introduce couple of flags for entry add that would alter the default behaviour. Something like: NLM_F_SKIP_KERNEL NLM_F_SKIP_OFFLOAD Again, this does not break the current users. On the other hand, this gives new users a leverage to instruct kernel where the entry should be added to (or not added to). Any thoughts? Objections? I don't like these. Breaks transparency and forces the user in a position of having to know hardware failures modes (unique to each hardware device). I presented an option d) which avoids this issues; was it not understood? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On 05/28/2015 02:42 AM, Jiri Pirko wrote: Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 I certainly agree that by default, transparency 1:1 sw:hw mapping is what we need for fib. The current code is a good start! I see couple of issues regarding switchdev_fib_ipv4_abort: 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is executed - and, error returned. I would expect that route entry should be added in this case. The next attempt of adding the same entry will be successful. The current behaviour breaks the transparency you are reffering to. 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is disabled for good (until reboot). That is certainly not nice, alhough I understand that is the easiest solution for now. I believe that we all agree that the 1:1 transparency, although it is a default, may not be optimal for real-life usage. HW resources are limited and user does not know them. The danger of hitting _abort and screwing-up the whole system is huge, unacceptable. So here, there are couple of more or less simple things that I suggest to do in order to move a little bit forward: 1) Introduce system-wide option to switch _abort to just plain fail. When HW does not have capacity, do not flush and fallback to sw, but rather just fail to add the entry. This would not break anything. Userspace has to be prepared that entry add could fail. 2) Introduce a way to propagate resources to userspace. Driver knows about resources used/available/potentially_available. Switchdev infra could be extended in order to propagate the info to the user. I currently use the FlowAPI work I presented at netdev conference for this. Perhaps I was a bit reaching by trying to also push it as a replacement for the ethtool flow classification mechanism all in one shot. For what it is worth replacing 'ethtool' flow classifier when I have a pipeline of tables in a NIC is really my first use case for the 'set' operations but that is off-topic probably. The benefits I see of using this interface (or if you want rename it and push it into a different netlink type) is it gives you the entire view of the switch resources and pipeline from a single interface. Also because you are talking about system-wide behaviour above it nicely rolls up into user space software where we can act on it with the flags we have for l2 already and if we pursue your option (3) also l3. I like the single interface vs. scattering the information across many different interfaces this way we can do it once and be done with it. If you scatter it across all the interfaces just l2,l3 for now but we will get more then each interface will have its own mechanism and I have no idea where you put global information such as table ordering. IMO we are going to need at least the base operations I outlined when we want to work on many different pipelines possibly with different ordering of tables, different amounts of resource sharing (l2 vs l3 vs acls vs...), different levels of support (mac/vlan or just mac). And I don't think it fits into an existing netlink structure because its not specific to any one thing but the model of the hardware. Also I believe that match/action tables are a really nice way to work with hardware so this aligns with that. That said I think the interface would need some tweaks to fit into the current code base. The biggest one I would want is to make l2/l3 tables 'well-defined' e.g. give them a #define value so we can always track them down easily, drop the set operation (at least for now because the tables we have already have defined interfaces l2/l3 I'll reopen this in the context of extending flow classification on the NIC), and clean up the action bits so they are well defined. I've pushed an update to my code on github to restrict the hardware from exporting arbitrary actions which should be a reasonable first step. What do you think? I would like to try to make the above updates and resubmit if we can get an agreement that knowing the hardware resources and capabilities is useful. It is at least useful for my software stacks/use cases. 3) Introduce couple of flags for entry add that would alter the default behaviour. Something like:
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On 5/20/15, 10:46 PM, Scott Feldman wrote: On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote: From: Andy Gospodarek go...@cumulusnetworks.com Date: Tue, 19 May 2015 15:47:32 -0400 Are you actually saying that if users complain loudly enough about the current behavior (not the change Roopa has proposed) that you would be open to considering a change the current behavior? I am saying that we have a contract with users not to break existing behavior. Full stop. After rehearing David's argument, we should probably explore option d) which is a refinement on the fib_offload_disable mechanism we have today. fib_offload_disable is global for all routes. Once we hit a HW install problem, the global flag is set and all routes fallback to SW. We did this because we can't allow the failed route to exist in SW and not in HW because it could mess up LPM searches (HW could hit on a lesser prefix even when SW has the true LPM, because HW gets first shot at match). The refinement on fib_offload_disable is this: make it per-related-prefix rather than global, and on a HW install problem, set the flag for the related-prefix and uninstall only those routes from HW. Related-prefix (is there a correct term for this?) are routes to the same dst addr but with different prefix lengths. I haven't parsed the fib_trie structure to see how routes are organized, but I suspect since it's optimized for lookup the related-prefix tracking is already there and we can build on that. Option d) requires no application changes. It requires no additional understanding or input from the user. Kernel fallback happens transparently. In the case where the HW install failure was due to out-of-resource in HW, there may be some oscillation as related-prefixes are removed from HW, freeing up a little space, only to be filled as new routes come in, and so on. Actually, now that I think of it, the device/driver could decide which related-prefix to evict from HW, if driver/device wanted to have a sense of which routes are more important to offload than others, when resources are limited. I think the parts we need are: 1) A new fib_offload_disable bit for related-prefixes. 2) On switchdev fib offload, look up if route is marked as fib_offload_disabled based on it's related-prefix membership 3) A notification mechanism from driver to indicate a related-prefix is fib_offload_disabled. Feasible? neat idea scott. But in practice I would be a bit nervous about getting the distribution of related prefixes between hardware and software right (need to think about the nuances a bit more). I am still leaning towards the possibility of making this a global system policy decision and leave the app and the switch driver less surprised. But this needs a better explicit offload policy/resource manager infra. I don't have any concrete thoughts on it yet. (Understand that it will also need to make sure it does not break the existing contract with Linux apps (that dave was pointing too)). thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote: From: Andy Gospodarek go...@cumulusnetworks.com Date: Tue, 19 May 2015 15:47:32 -0400 Are you actually saying that if users complain loudly enough about the current behavior (not the change Roopa has proposed) that you would be open to considering a change the current behavior? I am saying that we have a contract with users not to break existing behavior. Full stop. After rehearing David's argument, we should probably explore option d) which is a refinement on the fib_offload_disable mechanism we have today. fib_offload_disable is global for all routes. Once we hit a HW install problem, the global flag is set and all routes fallback to SW. We did this because we can't allow the failed route to exist in SW and not in HW because it could mess up LPM searches (HW could hit on a lesser prefix even when SW has the true LPM, because HW gets first shot at match). The refinement on fib_offload_disable is this: make it per-related-prefix rather than global, and on a HW install problem, set the flag for the related-prefix and uninstall only those routes from HW. Related-prefix (is there a correct term for this?) are routes to the same dst addr but with different prefix lengths. I haven't parsed the fib_trie structure to see how routes are organized, but I suspect since it's optimized for lookup the related-prefix tracking is already there and we can build on that. Option d) requires no application changes. It requires no additional understanding or input from the user. Kernel fallback happens transparently. In the case where the HW install failure was due to out-of-resource in HW, there may be some oscillation as related-prefixes are removed from HW, freeing up a little space, only to be filled as new routes come in, and so on. Actually, now that I think of it, the device/driver could decide which related-prefix to evict from HW, if driver/device wanted to have a sense of which routes are more important to offload than others, when resources are limited. I think the parts we need are: 1) A new fib_offload_disable bit for related-prefixes. 2) On switchdev fib offload, look up if route is marked as fib_offload_disabled based on it's related-prefix membership 3) A notification mechanism from driver to indicate a related-prefix is fib_offload_disabled. Feasible? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
From: Andy Gospodarek go...@cumulusnetworks.com Date: Tue, 19 May 2015 15:47:32 -0400 Are you actually saying that if users complain loudly enough about the current behavior (not the change Roopa has proposed) that you would be open to considering a change the current behavior? I am saying that we have a contract with users not to break existing behavior. Full stop. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
From: roopa ro...@cumulusnetworks.com Date: Mon, 18 May 2015 22:58:54 -0700 from where I see, with the limitations on these boxes, this requires every app, every `ip route` cmd running on the box to explicitly specify offload when running on this hardware. Users know what they are doing, and will ask for a policy change. The default should be what happens right now. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
On 5/18/15, 1:19 PM, David Miller wrote: From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 sorry if i missed you emphasize this. Current fib hw offload code aborts hw offload on the first hw offload failure and moves all routes to software and before 4.1 goes out I was trying to revisit this because the existing defaults will not work for most devices. To that effect, i started with an RFC patch to get some current thoughts and hence this patch. I will follow up on the other options mentioned in the comment. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
From: John Fastabend john.r.fastab...@intel.com Date: Mon, 18 May 2015 17:21:29 -0700 So how about having an error strategy sysctl field that we can set at provisioning time. I think this would align to Roopa's option (b). This way we can default to transparent mode and the users where this wont work can set the error mode. This way user land software stacks that work today should continue to work in both modes. Alert: This is not a switch provisioning issue. You can frame it like that all day, and continue to talk about low power cpus or other things which are completely and utterly irrelevant. Stop looking at how some specific piece of hardware is configured, and think about what actually is asking the kernel to do stuff. That's because the real issue is _semantics_ and what a Linux machine is expected to do when you insert a route and valid reasons why a route insertion can fail. That is the _only_ issue. And that has to do with what semantics _applcations_ making these routing change requests expect. There is nothing else that matters. And since it is an issue of what semantics those application want and are able to handle, that is where the request of changed behavior belongs. If we added your suggested sysctl, we'd have to name it sysctl_break_all_my_apps_please because that is exactly what it would be doing. :-) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
From: Roopa Prabhu ro...@cumulusnetworks.com Date: Sun, 17 May 2015 16:42:05 -0700 On most systems where you can offload routes to hardware, doing routing in software is not an option (the cpu limitations make routing impossible in software). You absolutely do not get to determine this policy, none of us do. What matters is that by default the damn switch device being there is %100 transparent to the user. And the way to achieve that default is to do software routes as a fallback. I am not going to entertain changes of this nature which fail route loading by default just because we've exceeded a device's HW capacity to offload. I thought I was _really_ clear about this at netdev 0.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html