Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-31 Thread Scott Feldman
On Sat, May 30, 2015 at 2:00 AM, Jiri Pirko j...@resnulli.us wrote:
 Fri, May 29, 2015 at 05:39:46PM CEST, sfel...@gmail.com wrote:
On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko j...@resnulli.us wrote:
 Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote:
On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote:
 From: Andy Gospodarek go...@cumulusnetworks.com
 Date: Tue, 19 May 2015 15:47:32 -0400

 Are you actually saying that if users complain loudly enough about
 the current behavior (not the change Roopa has proposed) that you
 would be open to considering a change the current behavior?

 I am saying that we have a contract with users not to break existing
 behavior.  Full stop.

After rehearing David's argument, we should probably explore option d)
which is a refinement on the fib_offload_disable mechanism we have
today.  fib_offload_disable is global for all routes.  Once we hit a
HW install problem, the global flag is set and all routes fallback to
SW.  We did this because we can't allow the failed route to exist in
SW and not in HW because it could mess up LPM searches (HW could hit
on a lesser prefix even when SW has the true LPM, because HW gets
first shot at match).  The refinement on fib_offload_disable is this:
make it per-related-prefix rather than global, and on a HW install
problem, set the flag for the related-prefix and uninstall only those
routes from HW.  Related-prefix (is there a correct term for this?)
are routes to the same dst addr but with different prefix lengths.  I
haven't parsed the fib_trie structure to see how routes are organized,
but I suspect since it's optimized for lookup the related-prefix
tracking is already there and we can build on that.

 This looks interesting. However, I'm not sure that it is acceptable for
 user to experience this hw evict of random entries. User knows what
 entries are essential to have in hw. With your solution, I can see no way
 user can actually say what should be offloaded or not. Kernel just
 automagically decides.

The default eviction policy could be based on RTA_PRIORITY: evict
lower priority routes first.  It would be up to the device driver to
decide between two routes of same priority.

To help device driver make the decision, we could have eviction policy 
options:

Priority-base (default)
Prefer IPv6 over IPv4
Prefer IPv4 over IPv6
Prefer single path over multipath
Prefer longer prefix lengths over shorter
Optimize for resource utilization

These are portable across different switches.   They're in terms a
user understands.  It's up to the device driver which truly
understands the device constraints to translates the user's eviction
policy choices into something that makes sense to that device.

 This sounds tempting... You plan to throw in some patches, or should I
 take care of that?

My L2 backlog keeps from L3 at the moment.  What would be nice is if
someone would look into the feasibility of option d) at the
swithdev/fib level and see if it's doable.  The policy/user interface
stuff can come later.  What do we need?  My thoughts:

- Remove the check for fi-fib_net-ipv4.fib_offload_disabled in
switchdev_fib_ipv4_add() and replace it with a check to see if route
is offload_disabled.  A route is offload_disabled if the set of
related routes is marked disabled.  If route is offload_disabled, just
return 0 in switchdev_fib_ipv4_add() so that route is only installed
in the kernel fib.  (It's fine now to have a route in SW but not in
HW).

- Find/invent in fib trie the structure that relates routes with the
same prefix but different prefix lengths.  So A/8 and A/16 and A/30
are related.  B/16 is not related to the other three.  Add a bool
offload_disabled to that structure to mark them offload_disabled.  We
don't want LPM broken with something like this:

   HW: A/8, A/16, B/8
   SW: A/8, A/16, A/30, B/8

Here when HW gets an A pkt, it'll match A/16 when really SW's A/30
should have won.  Now LPM is broken.

- In the driver's fib obj add handler, if it can't install the route
because the device flat out can't support the route, then return
-EOPNOTSUPP and switchdev_fib_ipv4_add() will mark the route as
offload_disabled.  Actually, switchdev_fib_ipv4_add() needs to
additionally remove related routes from device.

- In the driver's fib obj add handler, if the route can be installed,
but at the expense of evicting some other route(s) due to evict policy
(driver's choice on which route(s) get evicted), the handler returns 0
and switchdev_fib_ipv4_add() is happy.  The driver should send a
switchdev notification to the fib layer to mark the evicted route(s)
as offload disable.

- Routes marked with offload_disabled should have RTNH_F_OFFLOAD cleared.

I think I'm being sloppy with making sure once a route is removed from
the device, all the related routes are also removed from the device.
It's late; brain not working 100%.  But I think you have the gist of
it.  I don't think there is a lot 

Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-31 Thread Scott Feldman
On Sat, May 30, 2015 at 9:19 PM, John Fastabend
john.fastab...@gmail.com wrote:
 On 05/30/2015 02:00 AM, Jiri Pirko wrote:

 Fri, May 29, 2015 at 05:39:46PM CEST, sfel...@gmail.com wrote:

 On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko j...@resnulli.us wrote:

 Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote:

 On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net
 wrote:

 From: Andy Gospodarek go...@cumulusnetworks.com
 Date: Tue, 19 May 2015 15:47:32 -0400

 Are you actually saying that if users complain loudly enough about
 the current behavior (not the change Roopa has proposed) that you
 would be open to considering a change the current behavior?


 I am saying that we have a contract with users not to break existing
 behavior.  Full stop.


 After rehearing David's argument, we should probably explore option d)
 which is a refinement on the fib_offload_disable mechanism we have
 today.  fib_offload_disable is global for all routes.  Once we hit a
 HW install problem, the global flag is set and all routes fallback to
 SW.  We did this because we can't allow the failed route to exist in
 SW and not in HW because it could mess up LPM searches (HW could hit
 on a lesser prefix even when SW has the true LPM, because HW gets
 first shot at match).  The refinement on fib_offload_disable is this:
 make it per-related-prefix rather than global, and on a HW install
 problem, set the flag for the related-prefix and uninstall only those
 routes from HW.  Related-prefix (is there a correct term for this?)
 are routes to the same dst addr but with different prefix lengths.  I
 haven't parsed the fib_trie structure to see how routes are organized,
 but I suspect since it's optimized for lookup the related-prefix
 tracking is already there and we can build on that.


 This looks interesting. However, I'm not sure that it is acceptable for
 user to experience this hw evict of random entries. User knows what
 entries are essential to have in hw. With your solution, I can see no
 way
 user can actually say what should be offloaded or not. Kernel just
 automagically decides.


 The default eviction policy could be based on RTA_PRIORITY: evict
 lower priority routes first.  It would be up to the device driver to
 decide between two routes of same priority.

 To help device driver make the decision, we could have eviction policy
 options:

 Priority-base (default)
 Prefer IPv6 over IPv4
 Prefer IPv4 over IPv6
 Prefer single path over multipath
 Prefer longer prefix lengths over shorter
 Optimize for resource utilization

 These are portable across different switches.   They're in terms a
 user understands.  It's up to the device driver which truly
 understands the device constraints to translates the user's eviction
 policy choices into something that makes sense to that device.


 This sounds tempting... You plan to throw in some patches, or should I
 take care of that?


 This is encoding specific policies into the kernel. I was hoping to
 avoid this and let user space develop whatever policy it wants. If you
 use Jiri's proposed NLM_F_SKIP_{KERNEL|OFFLOAD} flags you get this.

 Also I don't understand the truly  understands the device constraints
 comment. We can export a model of the device and know how many rules
 of each type will fit exactly into the table. This doesn't seem like
 much of a problem to me. In fact the driver developer should know this
 anyway.

 Part of my motivation here is I really don't want to get stuck with a
 case where each driver writer gets to translate the eviction policy
 onto their device in some device specific and slightly different way.

But this is _exactly_ what I want.  Here's why: my claim is it will be
impossible for us (device vendors) to define a universal set of
resource constraints that works for all devices from all vendors.  I
was kind of hoping some vendor would throw out a set to get us
started.  Ok, I'll start with rocker: rocker will enforce in the
device these constraints listed below.  There will be a device command
to query the raw constraints.   So here goes:

VLAN table max entries: 16K // a VLAN on a port takes one entry
Term MAC table max entries: no limit
Bridging table:
 Unicast max entries: 12K
 Multicast max entries: 4K
Unicast Routing table (shared for v4 and v6 entries):
 Prefix max slots: 16K
 IPv4 route takes one slot
 IPv6 prefix len = 64 route takes two slots
 IPv6 prefix len  64 takes four slots
Nexthop max slots: 4K
 Max ECMP width: 32
 Each nexthop MAC takes one slot, but there is a stride of 4 slots
Multicast Routing table (shared for v4 and v6 entries):
(same as unicast routing, except max slots are 1/2 as big)
ACL table:
IPv4 max entries: 8K
IPv6 max entries: 8K
Combined IPv4 and IPv6 entries: 12K

The table names are OF-DPA-specific.  The limits are contrived,
obviously, but are representative of real-world devices.  The 

Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-30 Thread John Fastabend

On 05/30/2015 02:00 AM, Jiri Pirko wrote:

Fri, May 29, 2015 at 05:39:46PM CEST, sfel...@gmail.com wrote:

On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko j...@resnulli.us wrote:

Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote:

On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote:

From: Andy Gospodarek go...@cumulusnetworks.com
Date: Tue, 19 May 2015 15:47:32 -0400


Are you actually saying that if users complain loudly enough about
the current behavior (not the change Roopa has proposed) that you
would be open to considering a change the current behavior?


I am saying that we have a contract with users not to break existing
behavior.  Full stop.


After rehearing David's argument, we should probably explore option d)
which is a refinement on the fib_offload_disable mechanism we have
today.  fib_offload_disable is global for all routes.  Once we hit a
HW install problem, the global flag is set and all routes fallback to
SW.  We did this because we can't allow the failed route to exist in
SW and not in HW because it could mess up LPM searches (HW could hit
on a lesser prefix even when SW has the true LPM, because HW gets
first shot at match).  The refinement on fib_offload_disable is this:
make it per-related-prefix rather than global, and on a HW install
problem, set the flag for the related-prefix and uninstall only those
routes from HW.  Related-prefix (is there a correct term for this?)
are routes to the same dst addr but with different prefix lengths.  I
haven't parsed the fib_trie structure to see how routes are organized,
but I suspect since it's optimized for lookup the related-prefix
tracking is already there and we can build on that.


This looks interesting. However, I'm not sure that it is acceptable for
user to experience this hw evict of random entries. User knows what
entries are essential to have in hw. With your solution, I can see no way
user can actually say what should be offloaded or not. Kernel just
automagically decides.


The default eviction policy could be based on RTA_PRIORITY: evict
lower priority routes first.  It would be up to the device driver to
decide between two routes of same priority.

To help device driver make the decision, we could have eviction policy options:

Priority-base (default)
Prefer IPv6 over IPv4
Prefer IPv4 over IPv6
Prefer single path over multipath
Prefer longer prefix lengths over shorter
Optimize for resource utilization

These are portable across different switches.   They're in terms a
user understands.  It's up to the device driver which truly
understands the device constraints to translates the user's eviction
policy choices into something that makes sense to that device.


This sounds tempting... You plan to throw in some patches, or should I
take care of that?



This is encoding specific policies into the kernel. I was hoping to
avoid this and let user space develop whatever policy it wants. If you
use Jiri's proposed NLM_F_SKIP_{KERNEL|OFFLOAD} flags you get this.

Also I don't understand the truly  understands the device constraints
comment. We can export a model of the device and know how many rules
of each type will fit exactly into the table. This doesn't seem like
much of a problem to me. In fact the driver developer should know this
anyway.

Part of my motivation here is I really don't want to get stuck with a
case where each driver writer gets to translate the eviction policy
onto their device in some device specific and slightly different way.
It means every developer has to write a new mapping and get it correct.
At very least we should put a layer in switchdev that reads the table
out of the driver and does the mapping so we have it one spot. At least
then the kernel is enforcing policy the same on all devices. Better
still IMO would be to develop the policy in user space and have a
library/tool that does this so we don't end up with a bunch of policy
blobs in the kernel. The 6 above is a good start but over time we more
policy blobs will surely pop up. I would for example put 'optimize for
throughput' on the list.

.John

--
John Fastabend Intel Corporation
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-30 Thread Jiri Pirko
Fri, May 29, 2015 at 05:39:46PM CEST, sfel...@gmail.com wrote:
On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko j...@resnulli.us wrote:
 Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote:
On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote:
 From: Andy Gospodarek go...@cumulusnetworks.com
 Date: Tue, 19 May 2015 15:47:32 -0400

 Are you actually saying that if users complain loudly enough about
 the current behavior (not the change Roopa has proposed) that you
 would be open to considering a change the current behavior?

 I am saying that we have a contract with users not to break existing
 behavior.  Full stop.

After rehearing David's argument, we should probably explore option d)
which is a refinement on the fib_offload_disable mechanism we have
today.  fib_offload_disable is global for all routes.  Once we hit a
HW install problem, the global flag is set and all routes fallback to
SW.  We did this because we can't allow the failed route to exist in
SW and not in HW because it could mess up LPM searches (HW could hit
on a lesser prefix even when SW has the true LPM, because HW gets
first shot at match).  The refinement on fib_offload_disable is this:
make it per-related-prefix rather than global, and on a HW install
problem, set the flag for the related-prefix and uninstall only those
routes from HW.  Related-prefix (is there a correct term for this?)
are routes to the same dst addr but with different prefix lengths.  I
haven't parsed the fib_trie structure to see how routes are organized,
but I suspect since it's optimized for lookup the related-prefix
tracking is already there and we can build on that.

 This looks interesting. However, I'm not sure that it is acceptable for
 user to experience this hw evict of random entries. User knows what
 entries are essential to have in hw. With your solution, I can see no way
 user can actually say what should be offloaded or not. Kernel just
 automagically decides.

The default eviction policy could be based on RTA_PRIORITY: evict
lower priority routes first.  It would be up to the device driver to
decide between two routes of same priority.

To help device driver make the decision, we could have eviction policy options:

Priority-base (default)
Prefer IPv6 over IPv4
Prefer IPv4 over IPv6
Prefer single path over multipath
Prefer longer prefix lengths over shorter
Optimize for resource utilization

These are portable across different switches.   They're in terms a
user understands.  It's up to the device driver which truly
understands the device constraints to translates the user's eviction
policy choices into something that makes sense to that device.

This sounds tempting... You plan to throw in some patches, or should I
take care of that?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-29 Thread Jiri Pirko
Thu, May 28, 2015 at 05:35:05PM CEST, john.fastab...@gmail.com wrote:
On 05/28/2015 02:42 AM, Jiri Pirko wrote:
Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:
From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700

On most systems where you can offload routes to hardware,
doing routing in software is not an option (the cpu limitations
make routing impossible in software).

You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1

I certainly agree that by default, transparency 1:1 sw:hw mapping is
what we need for fib. The current code is a good start!

I see couple of issues regarding switchdev_fib_ipv4_abort:
1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
executed - and, error returned. I would expect that route entry should
be added in this case. The next attempt of adding the same entry will
be successful.
The current behaviour breaks the transparency you are reffering to.
2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
disabled for good (until reboot). That is certainly not nice, alhough
I understand that is the easiest solution for now.

I believe that we all agree that the 1:1 transparency, although it is a
default, may not be optimal for real-life usage. HW resources are
limited and user does not know them. The danger of hitting _abort and
screwing-up the whole system is huge, unacceptable.

So here, there are couple of more or less simple things that I suggest to
do in order to move a little bit forward:
1) Introduce system-wide option to switch _abort to just plain fail.
When HW does not have capacity, do not flush and fallback to sw, but
rather just fail to add the entry. This would not break anything.
Userspace has to be prepared that entry add could fail.
2) Introduce a way to propagate resources to userspace. Driver knows about
resources used/available/potentially_available. Switchdev infra could
be extended in order to propagate the info to the user.

I currently use the FlowAPI work I presented at netdev conference for
this. Perhaps I was a bit reaching by trying to also push it as a
replacement for the ethtool flow classification mechanism all in one
shot. For what it is worth replacing 'ethtool' flow classifier when
I have a pipeline of tables in a NIC is really my first use case for
the 'set' operations but that is off-topic probably.

The benefits I see of using this interface (or if you want rename
it and push it into a different netlink type) is it gives you the entire
view of the switch resources and pipeline from a single interface. Also
because you are talking about system-wide behaviour above it nicely
rolls up into user space software where we can act on it with the
flags we have for l2 already and if we pursue your option (3) also l3.
I like the single interface vs. scattering the information across many
different interfaces this way we can do it once and be done with it.
If you scatter it across all the interfaces just l2,l3 for now but
we will get more then each interface will have its own mechanism and
I have no idea where you put global information such as table ordering.

I think that for fib capacities/capabilities, user should be able to use
extended existing Netlink interface. Not some parallel one.

I'm still not convinced that user should care about the actual hw
pipeline. We already have a pipeline in kernel. Switch drivers should just
do mapping, easy as that.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-29 Thread Jiri Pirko
Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote:
On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote:
 From: Andy Gospodarek go...@cumulusnetworks.com
 Date: Tue, 19 May 2015 15:47:32 -0400

 Are you actually saying that if users complain loudly enough about
 the current behavior (not the change Roopa has proposed) that you
 would be open to considering a change the current behavior?

 I am saying that we have a contract with users not to break existing
 behavior.  Full stop.

After rehearing David's argument, we should probably explore option d)
which is a refinement on the fib_offload_disable mechanism we have
today.  fib_offload_disable is global for all routes.  Once we hit a
HW install problem, the global flag is set and all routes fallback to
SW.  We did this because we can't allow the failed route to exist in
SW and not in HW because it could mess up LPM searches (HW could hit
on a lesser prefix even when SW has the true LPM, because HW gets
first shot at match).  The refinement on fib_offload_disable is this:
make it per-related-prefix rather than global, and on a HW install
problem, set the flag for the related-prefix and uninstall only those
routes from HW.  Related-prefix (is there a correct term for this?)
are routes to the same dst addr but with different prefix lengths.  I
haven't parsed the fib_trie structure to see how routes are organized,
but I suspect since it's optimized for lookup the related-prefix
tracking is already there and we can build on that.

This looks interesting. However, I'm not sure that it is acceptable for
user to experience this hw evict of random entries. User knows what
entries are essential to have in hw. With your solution, I can see no way
user can actually say what should be offloaded or not. Kernel just
automagically decides.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-29 Thread Jiri Pirko
Thu, May 28, 2015 at 05:40:11PM CEST, sfel...@gmail.com wrote:
On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote:
 Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:
From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700

 On most systems where you can offload routes to hardware,
 doing routing in software is not an option (the cpu limitations
 make routing impossible in software).

You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1

 I certainly agree that by default, transparency 1:1 sw:hw mapping is
 what we need for fib. The current code is a good start!

 I see couple of issues regarding switchdev_fib_ipv4_abort:
 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
executed - and, error returned. I would expect that route entry should
be added in this case. The next attempt of adding the same entry will
be successful.
The current behaviour breaks the transparency you are reffering to.
 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
disabled for good (until reboot). That is certainly not nice, alhough
I understand that is the easiest solution for now.

 I believe that we all agree that the 1:1 transparency, although it is a
 default, may not be optimal for real-life usage. HW resources are
 limited and user does not know them. The danger of hitting _abort and
 screwing-up the whole system is huge, unacceptable.

 So here, there are couple of more or less simple things that I suggest to
 do in order to move a little bit forward:
 1) Introduce system-wide option to switch _abort to just plain fail.
When HW does not have capacity, do not flush and fallback to sw, but
rather just fail to add the entry. This would not break anything.
Userspace has to be prepared that entry add could fail.
 2) Introduce a way to propagate resources to userspace. Driver knows about
resources used/available/potentially_available. Switchdev infra could
be extended in order to propagate the info to the user.
 3) Introduce couple of flags for entry add that would alter the default
behaviour. Something like:
 NLM_F_SKIP_KERNEL
 NLM_F_SKIP_OFFLOAD
Again, this does not break the current users. On the other hand, this
gives new users a leverage to instruct kernel where the entry should
be added to (or not added to).

 Any thoughts? Objections?

I don't like these.  Breaks transparency and forces the user in a
position of having to know hardware failures modes (unique to each


Can you please elaborate on this a bit more? I fail to see transparency
breaking in my proposal :/ Maybe it is by different understanding of the
term?

Also I do not understand the remark about user having to know hardware
failure modes. Could you please explain?


hardware device).  I presented an option d) which avoids this issues;
was it not understood?

I just commented on option d) it other email.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-29 Thread Jiri Pirko
Fri, May 29, 2015 at 05:12:35PM CEST, sfel...@gmail.com wrote:
On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote:
 Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:
From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700

 On most systems where you can offload routes to hardware,
 doing routing in software is not an option (the cpu limitations
 make routing impossible in software).

You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1

 I certainly agree that by default, transparency 1:1 sw:hw mapping is
 what we need for fib. The current code is a good start!

 I see couple of issues regarding switchdev_fib_ipv4_abort:
 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
executed - and, error returned. I would expect that route entry should
be added in this case. The next attempt of adding the same entry will
be successful.
The current behaviour breaks the transparency you are reffering to.
 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
disabled for good (until reboot). That is certainly not nice, alhough
I understand that is the easiest solution for now.

 I believe that we all agree that the 1:1 transparency, although it is a
 default, may not be optimal for real-life usage. HW resources are
 limited and user does not know them. The danger of hitting _abort and
 screwing-up the whole system is huge, unacceptable.

 So here, there are couple of more or less simple things that I suggest to
 do in order to move a little bit forward:
 1) Introduce system-wide option to switch _abort to just plain fail.
When HW does not have capacity, do not flush and fallback to sw, but
rather just fail to add the entry. This would not break anything.
Userspace has to be prepared that entry add could fail.

This breaks 1:1 transparency.  A route now fails to install and the
user is scratching his/her head as to why it failed.  It used to work
when there was no switch offload.  It works with switch offload on
this other device.  So it must be a failure due to switch offload on
this device.  But why this route?  I just installed 20 IPv4 routes and
10 IPv6 routes.  Why did this 11th IPv6 route fail to install?  See,
now user needs to learn about details of that particular device's
limits to understand failure.  When they move their application to
another device, they need to re-learn failure modes.

I don't want this behaviour as the default. Default should be what is at
this moment. This would be tunable by user. That, I believe is correct.



 2) Introduce a way to propagate resources to userspace. Driver knows about
resources used/available/potentially_available. Switchdev infra could
be extended in order to propagate the info to the user.
 3) Introduce couple of flags for entry add that would alter the default
behaviour. Something like:
 NLM_F_SKIP_KERNEL
 NLM_F_SKIP_OFFLOAD
Again, this does not break the current users. On the other hand, this
gives new users a leverage to instruct kernel where the entry should
be added to (or not added to).

I don't think we want an NLM_F_SKIP_KERNEL option and only have the
route installed on the device.  We want offload to be an acceleration
of the kernel's FIB, not a bypass.

Okay, fair enough. Let's have NLM_F_SKIP_OFFLOAD only.


SKIP_OFFLOAD can mess up LPM if the user is not really really careful.

 Any thoughts? Objections?

 Thanks!

 Jiri
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-29 Thread Scott Feldman
On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote:
 Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:
From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700

 On most systems where you can offload routes to hardware,
 doing routing in software is not an option (the cpu limitations
 make routing impossible in software).

You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1

 I certainly agree that by default, transparency 1:1 sw:hw mapping is
 what we need for fib. The current code is a good start!

 I see couple of issues regarding switchdev_fib_ipv4_abort:
 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
executed - and, error returned. I would expect that route entry should
be added in this case. The next attempt of adding the same entry will
be successful.
The current behaviour breaks the transparency you are reffering to.
 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
disabled for good (until reboot). That is certainly not nice, alhough
I understand that is the easiest solution for now.

 I believe that we all agree that the 1:1 transparency, although it is a
 default, may not be optimal for real-life usage. HW resources are
 limited and user does not know them. The danger of hitting _abort and
 screwing-up the whole system is huge, unacceptable.

 So here, there are couple of more or less simple things that I suggest to
 do in order to move a little bit forward:
 1) Introduce system-wide option to switch _abort to just plain fail.
When HW does not have capacity, do not flush and fallback to sw, but
rather just fail to add the entry. This would not break anything.
Userspace has to be prepared that entry add could fail.

This breaks 1:1 transparency.  A route now fails to install and the
user is scratching his/her head as to why it failed.  It used to work
when there was no switch offload.  It works with switch offload on
this other device.  So it must be a failure due to switch offload on
this device.  But why this route?  I just installed 20 IPv4 routes and
10 IPv6 routes.  Why did this 11th IPv6 route fail to install?  See,
now user needs to learn about details of that particular device's
limits to understand failure.  When they move their application to
another device, they need to re-learn failure modes.

 2) Introduce a way to propagate resources to userspace. Driver knows about
resources used/available/potentially_available. Switchdev infra could
be extended in order to propagate the info to the user.
 3) Introduce couple of flags for entry add that would alter the default
behaviour. Something like:
 NLM_F_SKIP_KERNEL
 NLM_F_SKIP_OFFLOAD
Again, this does not break the current users. On the other hand, this
gives new users a leverage to instruct kernel where the entry should
be added to (or not added to).

I don't think we want an NLM_F_SKIP_KERNEL option and only have the
route installed on the device.  We want offload to be an acceleration
of the kernel's FIB, not a bypass.

SKIP_OFFLOAD can mess up LPM if the user is not really really careful.

 Any thoughts? Objections?

 Thanks!

 Jiri
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-29 Thread Scott Feldman
On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko j...@resnulli.us wrote:
 Thu, May 21, 2015 at 07:46:54AM CEST, sfel...@gmail.com wrote:
On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote:
 From: Andy Gospodarek go...@cumulusnetworks.com
 Date: Tue, 19 May 2015 15:47:32 -0400

 Are you actually saying that if users complain loudly enough about
 the current behavior (not the change Roopa has proposed) that you
 would be open to considering a change the current behavior?

 I am saying that we have a contract with users not to break existing
 behavior.  Full stop.

After rehearing David's argument, we should probably explore option d)
which is a refinement on the fib_offload_disable mechanism we have
today.  fib_offload_disable is global for all routes.  Once we hit a
HW install problem, the global flag is set and all routes fallback to
SW.  We did this because we can't allow the failed route to exist in
SW and not in HW because it could mess up LPM searches (HW could hit
on a lesser prefix even when SW has the true LPM, because HW gets
first shot at match).  The refinement on fib_offload_disable is this:
make it per-related-prefix rather than global, and on a HW install
problem, set the flag for the related-prefix and uninstall only those
routes from HW.  Related-prefix (is there a correct term for this?)
are routes to the same dst addr but with different prefix lengths.  I
haven't parsed the fib_trie structure to see how routes are organized,
but I suspect since it's optimized for lookup the related-prefix
tracking is already there and we can build on that.

 This looks interesting. However, I'm not sure that it is acceptable for
 user to experience this hw evict of random entries. User knows what
 entries are essential to have in hw. With your solution, I can see no way
 user can actually say what should be offloaded or not. Kernel just
 automagically decides.

The default eviction policy could be based on RTA_PRIORITY: evict
lower priority routes first.  It would be up to the device driver to
decide between two routes of same priority.

To help device driver make the decision, we could have eviction policy options:

Priority-base (default)
Prefer IPv6 over IPv4
Prefer IPv4 over IPv6
Prefer single path over multipath
Prefer longer prefix lengths over shorter
Optimize for resource utilization

These are portable across different switches.   They're in terms a
user understands.  It's up to the device driver which truly
understands the device constraints to translates the user's eviction
policy choices into something that makes sense to that device.

-scott
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-28 Thread Andy Gospodarek
On Thu, May 28, 2015 at 08:40:11AM -0700, Scott Feldman wrote:
 On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote:
  Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:
 From: Roopa Prabhu ro...@cumulusnetworks.com
 Date: Sun, 17 May 2015 16:42:05 -0700
 
  On most systems where you can offload routes to hardware,
  doing routing in software is not an option (the cpu limitations
  make routing impossible in software).
 
 You absolutely do not get to determine this policy, none of us
 do.
 
 What matters is that by default the damn switch device being there
 is %100 transparent to the user.
 
 And the way to achieve that default is to do software routes as
 a fallback.
 
 I am not going to entertain changes of this nature which fail
 route loading by default just because we've exceeded a device's
 HW capacity to offload.
 
 I thought I was _really_ clear about this at netdev 0.1
 
  I certainly agree that by default, transparency 1:1 sw:hw mapping is
  what we need for fib. The current code is a good start!
 
  I see couple of issues regarding switchdev_fib_ipv4_abort:
  1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
 executed - and, error returned. I would expect that route entry should
 be added in this case. The next attempt of adding the same entry will
 be successful.
 The current behaviour breaks the transparency you are reffering to.
  2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
 disabled for good (until reboot). That is certainly not nice, alhough
 I understand that is the easiest solution for now.
 
  I believe that we all agree that the 1:1 transparency, although it is a
  default, may not be optimal for real-life usage. HW resources are
  limited and user does not know them. The danger of hitting _abort and
  screwing-up the whole system is huge, unacceptable.
 
  So here, there are couple of more or less simple things that I suggest to
  do in order to move a little bit forward:
  1) Introduce system-wide option to switch _abort to just plain fail.
 When HW does not have capacity, do not flush and fallback to sw, but
 rather just fail to add the entry. This would not break anything.
 Userspace has to be prepared that entry add could fail.
  2) Introduce a way to propagate resources to userspace. Driver knows about
 resources used/available/potentially_available. Switchdev infra could
 be extended in order to propagate the info to the user.
  3) Introduce couple of flags for entry add that would alter the default
 behaviour. Something like:
  NLM_F_SKIP_KERNEL
  NLM_F_SKIP_OFFLOAD
 Again, this does not break the current users. On the other hand, this
 gives new users a leverage to instruct kernel where the entry should
 be added to (or not added to).
 
  Any thoughts? Objections?
 
 I don't like these.  Breaks transparency and forces the user in a
 position of having to know hardware failures modes (unique to each
 hardware device).  I presented an option d) which avoids this issues;
 was it not understood?

I actually really like the way Jiri succinctly covered the different
cases to move us forward from what we have today (Thanks, Jiri!).  I
completely agree with you on both of your problem statements and the
idea that what have is fine for the short-term.  I see definite room to
improve the the user experience available via upstream kernels.  

Option 1 has appeal since userspace applications that control FDB, FIB,
etc entries could work without modification (the when in this mode the
kernel could choose to ignore any NLM_F_* flags Jiri proposed), but I
agree that a system-wide (or maybe offload-device-wide?) configuration
option needs to exist as this should not be the default behavior. 

Option 2 could also work as userspace applications could query for
space availability before attempting to add a route.  This could be
nice during bootup as then apps could periodically double check that
their view of the world is accurate.

Option 3 also has appeal since there exists the ability to allow
fine-grained control from userspace applications since less used routes
(or routes that could be summarized) could be combined in userspace if
needed.

The great part about all suggestions is that when combined they can
provide a great user experience, but doing all 3 at once is probably too
aggressive.  My vote would be to see if we can work together on a
combination of Option 1 and 3 together as they seem to provide a great
first start to this...

If an application tried to add a route (called A) to the route table
in the kernel and code to support Option 1 existed (similar to what
Roopa posted to start this series) then the kernel could fail to add
route A.  

If the user noted that some other route (called B) was lower priority
for _any_ reason, the user could delete route B from the kernel and
hardware and add route A to hardware and kernel.  Then the 

Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-28 Thread roopa

On 5/28/15, 3:35 PM, Andy Gospodarek wrote:

On Thu, May 28, 2015 at 08:40:11AM -0700, Scott Feldman wrote:

On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote:

Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:

From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700


On most systems where you can offload routes to hardware,
doing routing in software is not an option (the cpu limitations
make routing impossible in software).

You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1

I certainly agree that by default, transparency 1:1 sw:hw mapping is
what we need for fib. The current code is a good start!

I see couple of issues regarding switchdev_fib_ipv4_abort:
1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
executed - and, error returned. I would expect that route entry should
be added in this case. The next attempt of adding the same entry will
be successful.
The current behaviour breaks the transparency you are reffering to.
2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
disabled for good (until reboot). That is certainly not nice, alhough
I understand that is the easiest solution for now.

I believe that we all agree that the 1:1 transparency, although it is a
default, may not be optimal for real-life usage. HW resources are
limited and user does not know them. The danger of hitting _abort and
screwing-up the whole system is huge, unacceptable.

So here, there are couple of more or less simple things that I suggest to
do in order to move a little bit forward:
1) Introduce system-wide option to switch _abort to just plain fail.
When HW does not have capacity, do not flush and fallback to sw, but
rather just fail to add the entry. This would not break anything.
Userspace has to be prepared that entry add could fail.
2) Introduce a way to propagate resources to userspace. Driver knows about
resources used/available/potentially_available. Switchdev infra could
be extended in order to propagate the info to the user.
3) Introduce couple of flags for entry add that would alter the default
behaviour. Something like:
 NLM_F_SKIP_KERNEL
 NLM_F_SKIP_OFFLOAD
Again, this does not break the current users. On the other hand, this
gives new users a leverage to instruct kernel where the entry should
be added to (or not added to).

Any thoughts? Objections?

I don't like these.  Breaks transparency and forces the user in a
position of having to know hardware failures modes (unique to each
hardware device).  I presented an option d) which avoids this issues;
was it not understood?

I actually really like the way Jiri succinctly covered the different
cases to move us forward from what we have today (Thanks, Jiri!).  I
completely agree with you on both of your problem statements and the
idea that what have is fine for the short-term.  I see definite room to
improve the the user experience available via upstream kernels.

Option 1 has appeal since userspace applications that control FDB, FIB,
etc entries could work without modification (the when in this mode the
kernel could choose to ignore any NLM_F_* flags Jiri proposed), but I
agree that a system-wide (or maybe offload-device-wide?) configuration
option needs to exist as this should not be the default behavior.


+1...i started with a sysctl for this as an example in my patch. But, 
instead of adding 10
different sysctls in different places, a switchdev policy infra/api and 
flags is due.


I have been planning to post a v3 of my patch with some of these policy 
flags.


Option 2 could also work as userspace applications could query for
space availability before attempting to add a route.  This could be
nice during bootup as then apps could periodically double check that
their view of the world is accurate.

Option 3 also has appeal since there exists the ability to allow
fine-grained control from userspace applications since less used routes
(or routes that could be summarized) could be combined in userspace if
needed.

The great part about all suggestions is that when combined they can
provide a great user experience, but doing all 3 at once is probably too
aggressive.  My vote would be to see if we can work together on a
combination of Option 1 and 3 together as they seem to provide a great
first start to this...

If an application tried to add a route (called A) to the route table
in the kernel and code to support Option 1 existed (similar to what
Roopa posted to start this series) then the kernel 

Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-28 Thread roopa

On 5/28/15, 9:10 AM, John Fastabend wrote:

On 05/28/2015 08:40 AM, Scott Feldman wrote:

On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote:

Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:

From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700


On most systems where you can offload routes to hardware,
doing routing in software is not an option (the cpu limitations
make routing impossible in software).


You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1


I certainly agree that by default, transparency 1:1 sw:hw mapping is
what we need for fib. The current code is a good start!

I see couple of issues regarding switchdev_fib_ipv4_abort:
1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
executed - and, error returned. I would expect that route entry 
should
be added in this case. The next attempt of adding the same entry 
will

be successful.
The current behaviour breaks the transparency you are reffering to.
2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
disabled for good (until reboot). That is certainly not nice, 
alhough

I understand that is the easiest solution for now.

I believe that we all agree that the 1:1 transparency, although it is a
default, may not be optimal for real-life usage. HW resources are
limited and user does not know them. The danger of hitting _abort and
screwing-up the whole system is huge, unacceptable.

So here, there are couple of more or less simple things that I 
suggest to

do in order to move a little bit forward:
1) Introduce system-wide option to switch _abort to just plain fail.
When HW does not have capacity, do not flush and fallback to sw, 
but

rather just fail to add the entry. This would not break anything.
Userspace has to be prepared that entry add could fail.
2) Introduce a way to propagate resources to userspace. Driver knows 
about
resources used/available/potentially_available. Switchdev infra 
could

be extended in order to propagate the info to the user.
3) Introduce couple of flags for entry add that would alter the default
behaviour. Something like:
 NLM_F_SKIP_KERNEL
 NLM_F_SKIP_OFFLOAD
Again, this does not break the current users. On the other hand, 
this
gives new users a leverage to instruct kernel where the entry 
should

be added to (or not added to).

Any thoughts? Objections?


I don't like these.  Breaks transparency and forces the user in a
position of having to know hardware failures modes (unique to each
hardware device).  I presented an option d) which avoids this issues;
was it not understood?



Hi Scott,

I understood your proposal. One caveat I had is in response to this,

Actually, now that I think of it, the device/driver could decide which
related-prefix to evict from HW, if driver/device wanted to have a
sense of which routes are more important to offload than other

hardware/driver/device shouldn't have a sense of which routes are more
important than others. 


correct. The routing daemons know this best.

I think this is where the NLM_F_* flags come in.
If userspace _wants_ to push policy into the kernel about what is
important it can. If it doesn't we get a sensible heuristic that does
a reasonable job offloading rules transparently. This is how we did
L2 and I think that seems to work fairly well. At least for me but,
always interested to hear other use cases though.

agree.


Also I guess I'm not seeing the multitude of hardware failure modes. I
see two either the hardware doesn't support the operation or it is out
of resources. Both can be learned if the hardware exports a model of its
capabilities and resources.

agree, A switchdev api to query hardware resource and capability is due.
We can start with rocker. It gives an app the choice to control the policy.

But, for our usecase/deployments today,  i am more interested in a 
system wide policy because

it is easier on my apps today.

thanks.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-28 Thread Jiri Pirko
Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:
From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700

 On most systems where you can offload routes to hardware,
 doing routing in software is not an option (the cpu limitations
 make routing impossible in software).

You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1

I certainly agree that by default, transparency 1:1 sw:hw mapping is
what we need for fib. The current code is a good start!

I see couple of issues regarding switchdev_fib_ipv4_abort:
1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
   executed - and, error returned. I would expect that route entry should
   be added in this case. The next attempt of adding the same entry will
   be successful.
   The current behaviour breaks the transparency you are reffering to.
2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
   disabled for good (until reboot). That is certainly not nice, alhough
   I understand that is the easiest solution for now.

I believe that we all agree that the 1:1 transparency, although it is a
default, may not be optimal for real-life usage. HW resources are
limited and user does not know them. The danger of hitting _abort and
screwing-up the whole system is huge, unacceptable.

So here, there are couple of more or less simple things that I suggest to
do in order to move a little bit forward:
1) Introduce system-wide option to switch _abort to just plain fail.
   When HW does not have capacity, do not flush and fallback to sw, but
   rather just fail to add the entry. This would not break anything.
   Userspace has to be prepared that entry add could fail.
2) Introduce a way to propagate resources to userspace. Driver knows about
   resources used/available/potentially_available. Switchdev infra could
   be extended in order to propagate the info to the user.
3) Introduce couple of flags for entry add that would alter the default
   behaviour. Something like:
NLM_F_SKIP_KERNEL
NLM_F_SKIP_OFFLOAD
   Again, this does not break the current users. On the other hand, this
   gives new users a leverage to instruct kernel where the entry should
   be added to (or not added to).

Any thoughts? Objections?

Thanks!

Jiri
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-28 Thread John Fastabend

On 05/28/2015 08:40 AM, Scott Feldman wrote:

On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote:

Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:

From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700


On most systems where you can offload routes to hardware,
doing routing in software is not an option (the cpu limitations
make routing impossible in software).


You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1


I certainly agree that by default, transparency 1:1 sw:hw mapping is
what we need for fib. The current code is a good start!

I see couple of issues regarding switchdev_fib_ipv4_abort:
1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
executed - and, error returned. I would expect that route entry should
be added in this case. The next attempt of adding the same entry will
be successful.
The current behaviour breaks the transparency you are reffering to.
2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
disabled for good (until reboot). That is certainly not nice, alhough
I understand that is the easiest solution for now.

I believe that we all agree that the 1:1 transparency, although it is a
default, may not be optimal for real-life usage. HW resources are
limited and user does not know them. The danger of hitting _abort and
screwing-up the whole system is huge, unacceptable.

So here, there are couple of more or less simple things that I suggest to
do in order to move a little bit forward:
1) Introduce system-wide option to switch _abort to just plain fail.
When HW does not have capacity, do not flush and fallback to sw, but
rather just fail to add the entry. This would not break anything.
Userspace has to be prepared that entry add could fail.
2) Introduce a way to propagate resources to userspace. Driver knows about
resources used/available/potentially_available. Switchdev infra could
be extended in order to propagate the info to the user.
3) Introduce couple of flags for entry add that would alter the default
behaviour. Something like:
 NLM_F_SKIP_KERNEL
 NLM_F_SKIP_OFFLOAD
Again, this does not break the current users. On the other hand, this
gives new users a leverage to instruct kernel where the entry should
be added to (or not added to).

Any thoughts? Objections?


I don't like these.  Breaks transparency and forces the user in a
position of having to know hardware failures modes (unique to each
hardware device).  I presented an option d) which avoids this issues;
was it not understood?



Hi Scott,

I understood your proposal. One caveat I had is in response to this,

Actually, now that I think of it, the device/driver could decide which
related-prefix to evict from HW, if driver/device wanted to have a
sense of which routes are more important to offload than other

hardware/driver/device shouldn't have a sense of which routes are more
important than others. I think this is where the NLM_F_* flags come in.
If userspace _wants_ to push policy into the kernel about what is
important it can. If it doesn't we get a sensible heuristic that does
a reasonable job offloading rules transparently. This is how we did
L2 and I think that seems to work fairly well. At least for me but,
always interested to hear other use cases though.

Also I guess I'm not seeing the multitude of hardware failure modes. I
see two either the hardware doesn't support the operation or it is out
of resources. Both can be learned if the hardware exports a model of its
capabilities and resources.

Thanks,
John

--
John Fastabend Intel Corporation
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-28 Thread Scott Feldman
On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko j...@resnulli.us wrote:
 Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:
From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700

 On most systems where you can offload routes to hardware,
 doing routing in software is not an option (the cpu limitations
 make routing impossible in software).

You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1

 I certainly agree that by default, transparency 1:1 sw:hw mapping is
 what we need for fib. The current code is a good start!

 I see couple of issues regarding switchdev_fib_ipv4_abort:
 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
executed - and, error returned. I would expect that route entry should
be added in this case. The next attempt of adding the same entry will
be successful.
The current behaviour breaks the transparency you are reffering to.
 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
disabled for good (until reboot). That is certainly not nice, alhough
I understand that is the easiest solution for now.

 I believe that we all agree that the 1:1 transparency, although it is a
 default, may not be optimal for real-life usage. HW resources are
 limited and user does not know them. The danger of hitting _abort and
 screwing-up the whole system is huge, unacceptable.

 So here, there are couple of more or less simple things that I suggest to
 do in order to move a little bit forward:
 1) Introduce system-wide option to switch _abort to just plain fail.
When HW does not have capacity, do not flush and fallback to sw, but
rather just fail to add the entry. This would not break anything.
Userspace has to be prepared that entry add could fail.
 2) Introduce a way to propagate resources to userspace. Driver knows about
resources used/available/potentially_available. Switchdev infra could
be extended in order to propagate the info to the user.
 3) Introduce couple of flags for entry add that would alter the default
behaviour. Something like:
 NLM_F_SKIP_KERNEL
 NLM_F_SKIP_OFFLOAD
Again, this does not break the current users. On the other hand, this
gives new users a leverage to instruct kernel where the entry should
be added to (or not added to).

 Any thoughts? Objections?

I don't like these.  Breaks transparency and forces the user in a
position of having to know hardware failures modes (unique to each
hardware device).  I presented an option d) which avoids this issues;
was it not understood?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-28 Thread John Fastabend

On 05/28/2015 02:42 AM, Jiri Pirko wrote:

Mon, May 18, 2015 at 10:19:16PM CEST, da...@davemloft.net wrote:

From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700


On most systems where you can offload routes to hardware,
doing routing in software is not an option (the cpu limitations
make routing impossible in software).


You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1


I certainly agree that by default, transparency 1:1 sw:hw mapping is
what we need for fib. The current code is a good start!

I see couple of issues regarding switchdev_fib_ipv4_abort:
1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
executed - and, error returned. I would expect that route entry should
be added in this case. The next attempt of adding the same entry will
be successful.
The current behaviour breaks the transparency you are reffering to.
2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
disabled for good (until reboot). That is certainly not nice, alhough
I understand that is the easiest solution for now.

I believe that we all agree that the 1:1 transparency, although it is a
default, may not be optimal for real-life usage. HW resources are
limited and user does not know them. The danger of hitting _abort and
screwing-up the whole system is huge, unacceptable.

So here, there are couple of more or less simple things that I suggest to
do in order to move a little bit forward:
1) Introduce system-wide option to switch _abort to just plain fail.
When HW does not have capacity, do not flush and fallback to sw, but
rather just fail to add the entry. This would not break anything.
Userspace has to be prepared that entry add could fail.
2) Introduce a way to propagate resources to userspace. Driver knows about
resources used/available/potentially_available. Switchdev infra could
be extended in order to propagate the info to the user.


I currently use the FlowAPI work I presented at netdev conference for
this. Perhaps I was a bit reaching by trying to also push it as a
replacement for the ethtool flow classification mechanism all in one
shot. For what it is worth replacing 'ethtool' flow classifier when
I have a pipeline of tables in a NIC is really my first use case for
the 'set' operations but that is off-topic probably.

The benefits I see of using this interface (or if you want rename
it and push it into a different netlink type) is it gives you the entire
view of the switch resources and pipeline from a single interface. Also
because you are talking about system-wide behaviour above it nicely
rolls up into user space software where we can act on it with the
flags we have for l2 already and if we pursue your option (3) also l3.
I like the single interface vs. scattering the information across many
different interfaces this way we can do it once and be done with it.
If you scatter it across all the interfaces just l2,l3 for now but
we will get more then each interface will have its own mechanism and
I have no idea where you put global information such as table ordering.

IMO we are going to need at least the base operations I outlined when
we want to work on many different pipelines possibly with different
ordering of tables, different amounts of resource sharing (l2 vs l3 vs
acls vs...), different levels of support (mac/vlan or just mac). And I
don't think it fits into an existing netlink structure because its not
specific to any one thing but the model of the hardware.

Also I believe that match/action tables are a really nice way to work
with hardware so this aligns with that. That said I think the interface
would need some tweaks to fit into the current code base. The biggest
one I would want is to make l2/l3 tables 'well-defined' e.g. give them
a #define value so we can always track them down easily, drop the set
operation (at least for now because the tables we have already have
defined interfaces l2/l3  I'll reopen this in the context of extending
flow classification on the NIC), and clean up the action bits so they
are well defined. I've pushed an update to my code on github to restrict
the hardware from exporting arbitrary actions which should be a
reasonable first step.

What do you think? I would like to try to make the above updates and
resubmit if we can get an agreement that knowing the hardware
resources and capabilities is useful. It is at least useful for my
software stacks/use cases.


3) Introduce couple of flags for entry add that would alter the default
behaviour. Something like:

Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-21 Thread roopa

On 5/20/15, 10:46 PM, Scott Feldman wrote:

On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote:

From: Andy Gospodarek go...@cumulusnetworks.com
Date: Tue, 19 May 2015 15:47:32 -0400


Are you actually saying that if users complain loudly enough about
the current behavior (not the change Roopa has proposed) that you
would be open to considering a change the current behavior?

I am saying that we have a contract with users not to break existing
behavior.  Full stop.

After rehearing David's argument, we should probably explore option d)
which is a refinement on the fib_offload_disable mechanism we have
today.  fib_offload_disable is global for all routes.  Once we hit a
HW install problem, the global flag is set and all routes fallback to
SW.  We did this because we can't allow the failed route to exist in
SW and not in HW because it could mess up LPM searches (HW could hit
on a lesser prefix even when SW has the true LPM, because HW gets
first shot at match).  The refinement on fib_offload_disable is this:
make it per-related-prefix rather than global, and on a HW install
problem, set the flag for the related-prefix and uninstall only those
routes from HW.  Related-prefix (is there a correct term for this?)
are routes to the same dst addr but with different prefix lengths.  I
haven't parsed the fib_trie structure to see how routes are organized,
but I suspect since it's optimized for lookup the related-prefix
tracking is already there and we can build on that.

Option d) requires no application changes.  It requires no additional
understanding or input from the user.  Kernel fallback happens
transparently.  In the case where the HW install failure was due to
out-of-resource in HW, there may be some oscillation as
related-prefixes are removed from HW, freeing up a little space, only
to be filled as new routes come in, and so on.  Actually, now that I
think of it, the device/driver could decide which related-prefix to
evict from HW, if driver/device wanted to have a sense of which routes
are more important to offload than others, when resources are limited.

I think the parts we need are:

1) A new fib_offload_disable bit for related-prefixes.
2) On switchdev fib offload, look up if route is marked as
fib_offload_disabled based on it's related-prefix membership
3) A notification mechanism from driver to indicate a related-prefix
is fib_offload_disabled.

Feasible?
neat idea scott. But in practice I would be a bit nervous about getting 
the distribution of related prefixes between

hardware and software right (need to think about the nuances a bit more).

I am still leaning towards the possibility of making this a global 
system policy decision and leave the app and the switch driver less 
surprised. But this needs a better explicit offload policy/resource 
manager infra. I don't have any concrete thoughts on it yet.
(Understand that it will also need to make sure it does not break the 
existing contract with Linux apps (that dave was pointing too)).



thanks.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-20 Thread Scott Feldman
On Tue, May 19, 2015 at 1:28 PM, David Miller da...@davemloft.net wrote:
 From: Andy Gospodarek go...@cumulusnetworks.com
 Date: Tue, 19 May 2015 15:47:32 -0400

 Are you actually saying that if users complain loudly enough about
 the current behavior (not the change Roopa has proposed) that you
 would be open to considering a change the current behavior?

 I am saying that we have a contract with users not to break existing
 behavior.  Full stop.

After rehearing David's argument, we should probably explore option d)
which is a refinement on the fib_offload_disable mechanism we have
today.  fib_offload_disable is global for all routes.  Once we hit a
HW install problem, the global flag is set and all routes fallback to
SW.  We did this because we can't allow the failed route to exist in
SW and not in HW because it could mess up LPM searches (HW could hit
on a lesser prefix even when SW has the true LPM, because HW gets
first shot at match).  The refinement on fib_offload_disable is this:
make it per-related-prefix rather than global, and on a HW install
problem, set the flag for the related-prefix and uninstall only those
routes from HW.  Related-prefix (is there a correct term for this?)
are routes to the same dst addr but with different prefix lengths.  I
haven't parsed the fib_trie structure to see how routes are organized,
but I suspect since it's optimized for lookup the related-prefix
tracking is already there and we can build on that.

Option d) requires no application changes.  It requires no additional
understanding or input from the user.  Kernel fallback happens
transparently.  In the case where the HW install failure was due to
out-of-resource in HW, there may be some oscillation as
related-prefixes are removed from HW, freeing up a little space, only
to be filled as new routes come in, and so on.  Actually, now that I
think of it, the device/driver could decide which related-prefix to
evict from HW, if driver/device wanted to have a sense of which routes
are more important to offload than others, when resources are limited.

I think the parts we need are:

1) A new fib_offload_disable bit for related-prefixes.
2) On switchdev fib offload, look up if route is marked as
fib_offload_disabled based on it's related-prefix membership
3) A notification mechanism from driver to indicate a related-prefix
is fib_offload_disabled.

Feasible?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-19 Thread David Miller
From: Andy Gospodarek go...@cumulusnetworks.com
Date: Tue, 19 May 2015 15:47:32 -0400

 Are you actually saying that if users complain loudly enough about
 the current behavior (not the change Roopa has proposed) that you
 would be open to considering a change the current behavior?

I am saying that we have a contract with users not to break existing
behavior.  Full stop.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-19 Thread David Miller
From: roopa ro...@cumulusnetworks.com
Date: Mon, 18 May 2015 22:58:54 -0700

 from where I see, with the limitations on these boxes, this requires
 every app, every `ip route` cmd running on the box to explicitly
 specify offload when running on this hardware.

Users know what they are doing, and will ask for a policy change.

The default should be what happens right now.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-18 Thread roopa

On 5/18/15, 1:19 PM, David Miller wrote:

From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700


On most systems where you can offload routes to hardware,
doing routing in software is not an option (the cpu limitations
make routing impossible in software).

You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1

sorry if i missed you emphasize this.
Current fib hw offload code aborts hw offload on the first hw offload 
failure and moves all routes to software and before 4.1 goes out I was 
trying to revisit this because the existing defaults will not work for 
most devices. To that effect, i started with an RFC patch to get some 
current thoughts and hence this patch.


I will follow up on the other options mentioned in the comment.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-18 Thread David Miller
From: John Fastabend john.r.fastab...@intel.com
Date: Mon, 18 May 2015 17:21:29 -0700

 So how about having an error strategy sysctl field that we can set
 at provisioning time. I think this would align to Roopa's option (b).
 This way we can default to transparent mode and the users where
 this wont work can set the error mode. This way user land software
 stacks that work today should continue to work in both modes.

Alert: This is not a switch provisioning issue.

You can frame it like that all day, and continue to talk about
low power cpus or other things which are completely and utterly
irrelevant.

Stop looking at how some specific piece of hardware is configured,
and think about what actually is asking the kernel to do stuff.

That's because the real issue is _semantics_ and what a Linux machine
is expected to do when you insert a route and valid reasons why a
route insertion can fail.

That is the _only_ issue.

And that has to do with what semantics _applcations_ making these
routing change requests expect.

There is nothing else that matters.

And since it is an issue of what semantics those application want and
are able to handle, that is where the request of changed behavior
belongs.

If we added your suggested sysctl, we'd have to name it
sysctl_break_all_my_apps_please because that is exactly what it
would be doing. :-)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware

2015-05-18 Thread David Miller
From: Roopa Prabhu ro...@cumulusnetworks.com
Date: Sun, 17 May 2015 16:42:05 -0700

 On most systems where you can offload routes to hardware,
 doing routing in software is not an option (the cpu limitations
 make routing impossible in software).

You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html