Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
From: Jiri PirkoDate: Wed, 2 Nov 2016 08:35:02 +0100 > How do you imagine this mode should looks like? Could you draw me some > example? Well, first of all, there is no reason we can't provide a mechanism by which the driver can request and obtain a FIB dump. And it can be designed in a way to be preemptible or at least not require RTNL to be held during the entire operation. Sequence counters or similar can be used to make sure that if the table changes mid-dump due to RTNL being dropped, the dump can be rewound and restarted.
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
Wed, Nov 02, 2016 at 03:35:03PM CET, ro...@cumulusnetworks.com wrote: >On 11/2/16, 6:48 AM, Jiri Pirko wrote: >> Wed, Nov 02, 2016 at 02:29:40PM CET, ro...@cumulusnetworks.com wrote: >>> On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirkowrote: Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote: >>> [snip] >>> > I understand..but, if you are adding some core infrastructure for > switchdev ..it cannot be > based on the number of simple use-cases or data you have today. > > I won't be surprised if tomorrow other switch drivers have a case where > they need to > reset the hw routing table state and reprogram all routes again. > Re-registering the notifier to just > get the routing state of the kernel will not scale. For the long term, > since the driver does not maintain a cache, Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say otherwise. > a pull api with efficient use of rtnl will be useful for other such cases > as well. How do you imagine this "pull API" should look like? >>> >>> Just like you already have added fib notifiers to parallel fib netlink >>> notifications, the pull API is a parallel to 'netlink dump'. >>> Is my imagination too wild ? :) >> Perhaps I'm slow, but I don't understand what you mean. > > > If you don't want to get to the complexity of a new api right away > because of the > simple case of management interface routes you have, Can your driver > register the notifier early ? > (I am sure you have probably already thought about this) Register early? What it would resolve? I must be missing something. We register as early as possible. But the thing is, we cannot register in a past. And that is what this patch resolves. >>> sure, you must be having a valid problem then. I was just curious why >>> your driver is not up and initialized before any of the addresses or >>> routes get configured in the system (even on a management port). Ours >> If you unload the module and load it again for example. This is a valid >> usecase. > >I see, so you are optimizing for this use case. sure it is a valid use-case >but a narrow one It is not an optimization, it's a bug fix. >compared to the rtnl overhead the api may bring > (note that i am not saying you should not solve it). > >> >> >>> does. But i agree there can be races and you cannot always guarantee >>> (I was just responding to ido's comment about adding complexity for a >>> small problem he has to solve for management routes). Our driver does >>> a pull before it starts. This helps when we want to reset the hardware >>> routing table state too. >> Can you point me to you driver in the tree? I would like to see how you >> do "the pull". >:), you know all this... but, if i must explicitly say it, yes, we don't have >a driver in the tree and >we don't own the hardware. My analogy here is of a netlink dump that we use >heavily for the >same scale that you will probably deploy. You are comparing netlink kernel-user api with in kernel api. I don't think that is comparable, at all. Therefore I asked how you imagine "the pull" should look like, in kernel. Stating it should look like some user api part does not help me much :( >i do give you full credit for the hardware and the driver and switchdev >support and all that!. > >> >>> >>> But, my point was, when you are defining an API, you cannot quantify >>> the 'past' to be just the very 'close past' or 'the past is just the >>> management routes that were added' . Tomorrow the 'past' can be the >>> full routing table if you need to reset the hardware state. >> Sure. > >This pull api was a suggestion for an efficient use of rtnl ...similar to how >the netlink >routing dump handles it. If you cannot imagine an api like that..., sure, your >call. No, that's why I'm asking, because I was under impression you can imagine that :)
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
On 11/2/16, 6:48 AM, Jiri Pirko wrote: > Wed, Nov 02, 2016 at 02:29:40PM CET, ro...@cumulusnetworks.com wrote: >> On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirkowrote: >>> Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote: >> [snip] >> I understand..but, if you are adding some core infrastructure for switchdev ..it cannot be based on the number of simple use-cases or data you have today. I won't be surprised if tomorrow other switch drivers have a case where they need to reset the hw routing table state and reprogram all routes again. Re-registering the notifier to just get the routing state of the kernel will not scale. For the long term, since the driver does not maintain a cache, >>> Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say >>> otherwise. >>> >>> a pull api with efficient use of rtnl will be useful for other such cases as well. >>> How do you imagine this "pull API" should look like? >> >> Just like you already have added fib notifiers to parallel fib netlink >> notifications, the pull API is a parallel to 'netlink dump'. >> Is my imagination too wild ? :) > Perhaps I'm slow, but I don't understand what you mean. If you don't want to get to the complexity of a new api right away because of the simple case of management interface routes you have, Can your driver register the notifier early ? (I am sure you have probably already thought about this) >>> Register early? What it would resolve? I must be missing something. We >>> register as early as possible. But the thing is, we cannot register >>> in a past. And that is what this patch resolves. >> sure, you must be having a valid problem then. I was just curious why >> your driver is not up and initialized before any of the addresses or >> routes get configured in the system (even on a management port). Ours > If you unload the module and load it again for example. This is a valid > usecase. I see, so you are optimizing for this use case. sure it is a valid use-case but a narrow one compared to the rtnl overhead the api may bring (note that i am not saying you should not solve it). > > >> does. But i agree there can be races and you cannot always guarantee >> (I was just responding to ido's comment about adding complexity for a >> small problem he has to solve for management routes). Our driver does >> a pull before it starts. This helps when we want to reset the hardware >> routing table state too. > Can you point me to you driver in the tree? I would like to see how you > do "the pull". :), you know all this... but, if i must explicitly say it, yes, we don't have a driver in the tree and we don't own the hardware. My analogy here is of a netlink dump that we use heavily for the same scale that you will probably deploy. i do give you full credit for the hardware and the driver and switchdev support and all that!. > >> >> But, my point was, when you are defining an API, you cannot quantify >> the 'past' to be just the very 'close past' or 'the past is just the >> management routes that were added' . Tomorrow the 'past' can be the >> full routing table if you need to reset the hardware state. > Sure. This pull api was a suggestion for an efficient use of rtnl ...similar to how the netlink routing dump handles it. If you cannot imagine an api like that..., sure, your call.
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
Wed, Nov 02, 2016 at 02:29:40PM CET, ro...@cumulusnetworks.com wrote: >On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirkowrote: >> Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote: >>> >[snip] > >>>I understand..but, if you are adding some core infrastructure for switchdev >>>..it cannot be >>>based on the number of simple use-cases or data you have today. >>> >>>I won't be surprised if tomorrow other switch drivers have a case where they >>>need to >>>reset the hw routing table state and reprogram all routes again. >>>Re-registering the notifier to just >>>get the routing state of the kernel will not scale. For the long term, since >>>the driver does not maintain a cache, >> >> Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say >> otherwise. >> >> >>>a pull api with efficient use of rtnl will be useful for other such cases as >>>well. >> >> How do you imagine this "pull API" should look like? > > >Just like you already have added fib notifiers to parallel fib netlink >notifications, the pull API is a parallel to 'netlink dump'. >Is my imagination too wild ? :) Perhaps I'm slow, but I don't understand what you mean. > > >> >> >>> >>> >>>If you don't want to get to the complexity of a new api right away because >>>of the >>>simple case of management interface routes you have, Can your driver >>>register the notifier early ? >>>(I am sure you have probably already thought about this) >> >> Register early? What it would resolve? I must be missing something. We >> register as early as possible. But the thing is, we cannot register >> in a past. And that is what this patch resolves. > >sure, you must be having a valid problem then. I was just curious why >your driver is not up and initialized before any of the addresses or >routes get configured in the system (even on a management port). Ours If you unload the module and load it again for example. This is a valid usecase. >does. But i agree there can be races and you cannot always guarantee >(I was just responding to ido's comment about adding complexity for a >small problem he has to solve for management routes). Our driver does >a pull before it starts. This helps when we want to reset the hardware >routing table state too. Can you point me to you driver in the tree? I would like to see how you do "the pull". > > >But, my point was, when you are defining an API, you cannot quantify >the 'past' to be just the very 'close past' or 'the past is just the >management routes that were added' . Tomorrow the 'past' can be the >full routing table if you need to reset the hardware state. Sure.
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
On Wed, Nov 02, 2016 at 06:29:40AM -0700, Roopa Prabhu wrote: > On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirkowrote: > > Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote: > >> > [snip] > > >>I understand..but, if you are adding some core infrastructure for switchdev > >>..it cannot be > >>based on the number of simple use-cases or data you have today. > >> > >>I won't be surprised if tomorrow other switch drivers have a case where > >>they need to > >>reset the hw routing table state and reprogram all routes again. > >>Re-registering the notifier to just > >>get the routing state of the kernel will not scale. For the long term, > >>since the driver does not maintain a cache, > > > > Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say > > otherwise. > > > > > >>a pull api with efficient use of rtnl will be useful for other such cases > >>as well. > > > > How do you imagine this "pull API" should look like? > > > Just like you already have added fib notifiers to parallel fib netlink > notifications, the pull API is a parallel to 'netlink dump'. > Is my imagination too wild ? :) The question is more about the mechanics of this pull API, because it's not very clear to me how that should look like. You want consumers to dump the tables in batches, so that rtnl is held only during the batch but not in between them? How are the routes passed down? Does the fib code fill up some struct or does it use the fib chain? > >>If you don't want to get to the complexity of a new api right away because > >>of the > >>simple case of management interface routes you have, Can your driver > >>register the notifier early ? > >>(I am sure you have probably already thought about this) > > > > Register early? What it would resolve? I must be missing something. We > > register as early as possible. But the thing is, we cannot register > > in a past. And that is what this patch resolves. > > sure, you must be having a valid problem then. I was just curious why > your driver is not up and initialized before any of the addresses or > routes get configured in the system (even on a management port). Ours > does. But i agree there can be races and you cannot always guarantee > (I was just responding to ido's comment about adding complexity for a > small problem he has to solve for management routes). Our driver does > a pull before it starts. This helps when we want to reset the hardware > routing table state too. One can modprobe the module after routes are already present on other netdevs. That's actually how I tested the patch.
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirkowrote: > Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote: >> [snip] >>I understand..but, if you are adding some core infrastructure for switchdev >>..it cannot be >>based on the number of simple use-cases or data you have today. >> >>I won't be surprised if tomorrow other switch drivers have a case where they >>need to >>reset the hw routing table state and reprogram all routes again. >>Re-registering the notifier to just >>get the routing state of the kernel will not scale. For the long term, since >>the driver does not maintain a cache, > > Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say > otherwise. > > >>a pull api with efficient use of rtnl will be useful for other such cases as >>well. > > How do you imagine this "pull API" should look like? Just like you already have added fib notifiers to parallel fib netlink notifications, the pull API is a parallel to 'netlink dump'. Is my imagination too wild ? :) > > >> >> >>If you don't want to get to the complexity of a new api right away because of >>the >>simple case of management interface routes you have, Can your driver register >>the notifier early ? >>(I am sure you have probably already thought about this) > > Register early? What it would resolve? I must be missing something. We > register as early as possible. But the thing is, we cannot register > in a past. And that is what this patch resolves. sure, you must be having a valid problem then. I was just curious why your driver is not up and initialized before any of the addresses or routes get configured in the system (even on a management port). Ours does. But i agree there can be races and you cannot always guarantee (I was just responding to ido's comment about adding complexity for a small problem he has to solve for management routes). Our driver does a pull before it starts. This helps when we want to reset the hardware routing table state too. But, my point was, when you are defining an API, you cannot quantify the 'past' to be just the very 'close past' or 'the past is just the management routes that were added' . Tomorrow the 'past' can be the full routing table if you need to reset the hardware state.
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
Tue, Nov 01, 2016 at 04:36:50PM CET, da...@davemloft.net wrote: >From: Roopa Prabhu>Date: Tue, 01 Nov 2016 08:14:14 -0700 > >> On 11/1/16, 7:19 AM, Eric Dumazet wrote: >>> On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote: On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote: > How well will this work for large FIB tables ? > > Holding rtnl while sending thousands of skb will prevent consumers to > make progress ? Can you please clarify what do you mean by "while sending thousands of skb"? This patch doesn't generate notifications to user space, but instead invokes notification routines inside the kernel. I probably misunderstood you. Are you suggesting this be done using RCU instead? Well, there are a couple of reasons why I took RTNL here: >>> No, I do not believe RCU is wanted here, in control path where we might >>> sleep anyway. >>> 1) The FIB notification chain is blocking, so listeners are expected to be able to sleep. This isn't possible if we use RCU. Note that this chain is mainly useful for drivers that reflect the FIB table into a capable device and hardware operations usually involve sleeping. 2) The insertion of a single route is done with RTNL held. I didn't want to differentiate between both cases. This property is really useful for listeners, as they don't need to worry about locking in writer-side. Access to data structs is serialized by RTNL. >>> My concern was that for large iterations, you might hold RTNL and/or >>> current cpu for hundred of ms or even seconds... >>> >> I have the same concern as Eric here. >> >> I understand why you need it, but can the driver request for an initial dump >> and that >> dump be made more efficient somehow ie not hold rtnl for the whole dump ?. >> instead of making the fib notifier registration code doing it. >> >> these routing table sizes can be huge and an analogy for this in user-space: >> We do request a netlink dump of routing tables at initialization (on driver >> starts or resets)... >> but, existing netlink routing table dumps for that scale don't hold rtnl for >> the whole dump. >> The dump is split into multiple responses to the user and hence it does not >> starve other rtnl users. >> >> In-fact I don't think netlink routing table dumps from user-space hold >> rtnl_lock for the whole dump. >> IIRC this was done to allow route add/dels to be allowed in parallel for >> performance reasons. >> (I will need to double check to confirm this). > >I've always had some reservations about using notifiers for getting >the FIB entries down to the offloaded device. Yeah, me too. But there is no really other way to do it. I thought about it quite long. But maybe I missed something. > >And this problem is just another symptom that it is the wrong >mechanism for propagating this information. > >As suggested by Roopa here, perhaps we're looking at the problem from >the wrong direction. We tend to design NDO ops and notifiers, to >"push" things to the driver, but maybe something more like a push+pull >model is better. How do you imagine this mode should looks like? Could you draw me some example? Thanks!
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote: >On 11/1/16, 10:03 AM, Ido Schimmel wrote: >> Hi Roopa, >> >> On Tue, Nov 01, 2016 at 08:14:14AM -0700, Roopa Prabhu wrote: >>> >[snip] >>> I have the same concern as Eric here. >>> >>> I understand why you need it, but can the driver request for an initial >>> dump and that >>> dump be made more efficient somehow ie not hold rtnl for the whole dump ?. >>> instead of making the fib notifier registration code doing it. >> We can do what we suggested in the last bi-weekly meeting, which is >> still holding rtnl, but moving the hardware operation to delayed work. >> This is possible because upper layers always assume operation was >> successful and driver is responsible for invoking its abort mechanism in >> case of failure. >> >>> these routing table sizes can be huge and an analogy for this in user-space: >>> We do request a netlink dump of routing tables at initialization (on >>> driver starts or resets)... >>> but, existing netlink routing table dumps for that scale don't hold rtnl >>> for the whole dump. >>> The dump is split into multiple responses to the user and hence it does not >>> starve other rtnl users. >> In my reply to Eric I mentioned that when we register and unregister >> from this chain the tables aren't really huge, but instead quite small. >> I understand your concerns, but I don't wish to make things more >> complicated than they should be only to address concerns that aren't >> really realistic. > >I understand..but, if you are adding some core infrastructure for switchdev >..it cannot be >based on the number of simple use-cases or data you have today. > >I won't be surprised if tomorrow other switch drivers have a case where they >need to >reset the hw routing table state and reprogram all routes again. >Re-registering the notifier to just >get the routing state of the kernel will not scale. For the long term, since >the driver does not maintain a cache, Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say otherwise. >a pull api with efficient use of rtnl will be useful for other such cases as >well. How do you imagine this "pull API" should look like? > > >If you don't want to get to the complexity of a new api right away because of >the >simple case of management interface routes you have, Can your driver register >the notifier early ? >(I am sure you have probably already thought about this) Register early? What it would resolve? I must be missing something. We register as early as possible. But the thing is, we cannot register in a past. And that is what this patch resolves. > >> >> I believe current patch is quite simple and also consistent with other >> notification chains in the kernel, such as the netdevice, where rtnl is >> held during replay of events. >> http://lxr.free-electrons.com/source/net/core/dev.c#L1535 >as you know, netdev and routing scale are not the same thing. >Looking at the current code for netdevices (replay and rollback on failure), >a pull api (equivalent to the netlink dump api) may end up being less >complex...with an >ability to batch in the future. > > > > > > >
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
On 11/1/16, 10:03 AM, Ido Schimmel wrote: > Hi Roopa, > > On Tue, Nov 01, 2016 at 08:14:14AM -0700, Roopa Prabhu wrote: >> [snip] >> I have the same concern as Eric here. >> >> I understand why you need it, but can the driver request for an initial dump >> and that >> dump be made more efficient somehow ie not hold rtnl for the whole dump ?. >> instead of making the fib notifier registration code doing it. > We can do what we suggested in the last bi-weekly meeting, which is > still holding rtnl, but moving the hardware operation to delayed work. > This is possible because upper layers always assume operation was > successful and driver is responsible for invoking its abort mechanism in > case of failure. > >> these routing table sizes can be huge and an analogy for this in user-space: >> We do request a netlink dump of routing tables at initialization (on driver >> starts or resets)... >> but, existing netlink routing table dumps for that scale don't hold rtnl for >> the whole dump. >> The dump is split into multiple responses to the user and hence it does not >> starve other rtnl users. > In my reply to Eric I mentioned that when we register and unregister > from this chain the tables aren't really huge, but instead quite small. > I understand your concerns, but I don't wish to make things more > complicated than they should be only to address concerns that aren't > really realistic. I understand..but, if you are adding some core infrastructure for switchdev ..it cannot be based on the number of simple use-cases or data you have today. I won't be surprised if tomorrow other switch drivers have a case where they need to reset the hw routing table state and reprogram all routes again. Re-registering the notifier to just get the routing state of the kernel will not scale. For the long term, since the driver does not maintain a cache, a pull api with efficient use of rtnl will be useful for other such cases as well. If you don't want to get to the complexity of a new api right away because of the simple case of management interface routes you have, Can your driver register the notifier early ? (I am sure you have probably already thought about this) > > I believe current patch is quite simple and also consistent with other > notification chains in the kernel, such as the netdevice, where rtnl is > held during replay of events. > http://lxr.free-electrons.com/source/net/core/dev.c#L1535 as you know, netdev and routing scale are not the same thing. Looking at the current code for netdevices (replay and rollback on failure), a pull api (equivalent to the netlink dump api) may end up being less complex...with an ability to batch in the future.
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
Hi Roopa, On Tue, Nov 01, 2016 at 08:14:14AM -0700, Roopa Prabhu wrote: > On 11/1/16, 7:19 AM, Eric Dumazet wrote: > > On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote: > >> On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote: > >>> How well will this work for large FIB tables ? > >>> > >>> Holding rtnl while sending thousands of skb will prevent consumers to > >>> make progress ? > >> Can you please clarify what do you mean by "while sending thousands of > >> skb"? This patch doesn't generate notifications to user space, but > >> instead invokes notification routines inside the kernel. I probably > >> misunderstood you. > >> > >> Are you suggesting this be done using RCU instead? Well, there are a > >> couple of reasons why I took RTNL here: > >> > > No, I do not believe RCU is wanted here, in control path where we might > > sleep anyway. > > > >> 1) The FIB notification chain is blocking, so listeners are expected to > >> be able to sleep. This isn't possible if we use RCU. Note that this > >> chain is mainly useful for drivers that reflect the FIB table into a > >> capable device and hardware operations usually involve sleeping. > >> > >> 2) The insertion of a single route is done with RTNL held. I didn't want > >> to differentiate between both cases. This property is really useful for > >> listeners, as they don't need to worry about locking in writer-side. > >> Access to data structs is serialized by RTNL. > > My concern was that for large iterations, you might hold RTNL and/or > > current cpu for hundred of ms or even seconds... > > > I have the same concern as Eric here. > > I understand why you need it, but can the driver request for an initial dump > and that > dump be made more efficient somehow ie not hold rtnl for the whole dump ?. > instead of making the fib notifier registration code doing it. We can do what we suggested in the last bi-weekly meeting, which is still holding rtnl, but moving the hardware operation to delayed work. This is possible because upper layers always assume operation was successful and driver is responsible for invoking its abort mechanism in case of failure. > these routing table sizes can be huge and an analogy for this in user-space: > We do request a netlink dump of routing tables at initialization (on driver > starts or resets)... > but, existing netlink routing table dumps for that scale don't hold rtnl for > the whole dump. > The dump is split into multiple responses to the user and hence it does not > starve other rtnl users. In my reply to Eric I mentioned that when we register and unregister from this chain the tables aren't really huge, but instead quite small. I understand your concerns, but I don't wish to make things more complicated than they should be only to address concerns that aren't really realistic. I believe current patch is quite simple and also consistent with other notification chains in the kernel, such as the netdevice, where rtnl is held during replay of events. http://lxr.free-electrons.com/source/net/core/dev.c#L1535
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
On Tue, Nov 01, 2016 at 07:19:59AM -0700, Eric Dumazet wrote: > On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote: > > On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote: > > > > How well will this work for large FIB tables ? > > > > > > Holding rtnl while sending thousands of skb will prevent consumers to > > > make progress ? > > > > Can you please clarify what do you mean by "while sending thousands of > > skb"? This patch doesn't generate notifications to user space, but > > instead invokes notification routines inside the kernel. I probably > > misunderstood you. > > > > Are you suggesting this be done using RCU instead? Well, there are a > > couple of reasons why I took RTNL here: > > > > No, I do not believe RCU is wanted here, in control path where we might > sleep anyway. > > > 1) The FIB notification chain is blocking, so listeners are expected to > > be able to sleep. This isn't possible if we use RCU. Note that this > > chain is mainly useful for drivers that reflect the FIB table into a > > capable device and hardware operations usually involve sleeping. > > > > 2) The insertion of a single route is done with RTNL held. I didn't want > > to differentiate between both cases. This property is really useful for > > listeners, as they don't need to worry about locking in writer-side. > > Access to data structs is serialized by RTNL. > > My concern was that for large iterations, you might hold RTNL and/or > current cpu for hundred of ms or even seconds... I understand your concern, but I think it's helpful to look at the users of this API. It was only recently introduced [1] because nobody needed it beside switch drivers that reflect the FIB table and I believe it'll stay that way. Currently, only mlxsw and rocker use it. Now, in these use cases when register_fib_notifier() is called the switch ports are still not present in the system, so we really only have a few routes used for management. Similarly, when unregister_fib_notifier() is called, the switch ports are already gone and most FIBs were flushed due to NETDEV_UNREGISTER, so again we only have a handful of FIBs to iterate over. Does that sound reasonable to you? 1. https://www.spinics.net/lists/netdev/msg397444.html
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
From: Roopa PrabhuDate: Tue, 01 Nov 2016 08:14:14 -0700 > On 11/1/16, 7:19 AM, Eric Dumazet wrote: >> On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote: >>> On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote: How well will this work for large FIB tables ? Holding rtnl while sending thousands of skb will prevent consumers to make progress ? >>> Can you please clarify what do you mean by "while sending thousands of >>> skb"? This patch doesn't generate notifications to user space, but >>> instead invokes notification routines inside the kernel. I probably >>> misunderstood you. >>> >>> Are you suggesting this be done using RCU instead? Well, there are a >>> couple of reasons why I took RTNL here: >>> >> No, I do not believe RCU is wanted here, in control path where we might >> sleep anyway. >> >>> 1) The FIB notification chain is blocking, so listeners are expected to >>> be able to sleep. This isn't possible if we use RCU. Note that this >>> chain is mainly useful for drivers that reflect the FIB table into a >>> capable device and hardware operations usually involve sleeping. >>> >>> 2) The insertion of a single route is done with RTNL held. I didn't want >>> to differentiate between both cases. This property is really useful for >>> listeners, as they don't need to worry about locking in writer-side. >>> Access to data structs is serialized by RTNL. >> My concern was that for large iterations, you might hold RTNL and/or >> current cpu for hundred of ms or even seconds... >> > I have the same concern as Eric here. > > I understand why you need it, but can the driver request for an initial dump > and that > dump be made more efficient somehow ie not hold rtnl for the whole dump ?. > instead of making the fib notifier registration code doing it. > > these routing table sizes can be huge and an analogy for this in user-space: > We do request a netlink dump of routing tables at initialization (on driver > starts or resets)... > but, existing netlink routing table dumps for that scale don't hold rtnl for > the whole dump. > The dump is split into multiple responses to the user and hence it does not > starve other rtnl users. > > In-fact I don't think netlink routing table dumps from user-space hold > rtnl_lock for the whole dump. > IIRC this was done to allow route add/dels to be allowed in parallel for > performance reasons. > (I will need to double check to confirm this). I've always had some reservations about using notifiers for getting the FIB entries down to the offloaded device. And this problem is just another symptom that it is the wrong mechanism for propagating this information. As suggested by Roopa here, perhaps we're looking at the problem from the wrong direction. We tend to design NDO ops and notifiers, to "push" things to the driver, but maybe something more like a push+pull model is better.
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
On 11/1/16, 7:19 AM, Eric Dumazet wrote: > On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote: >> On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote: >>> How well will this work for large FIB tables ? >>> >>> Holding rtnl while sending thousands of skb will prevent consumers to >>> make progress ? >> Can you please clarify what do you mean by "while sending thousands of >> skb"? This patch doesn't generate notifications to user space, but >> instead invokes notification routines inside the kernel. I probably >> misunderstood you. >> >> Are you suggesting this be done using RCU instead? Well, there are a >> couple of reasons why I took RTNL here: >> > No, I do not believe RCU is wanted here, in control path where we might > sleep anyway. > >> 1) The FIB notification chain is blocking, so listeners are expected to >> be able to sleep. This isn't possible if we use RCU. Note that this >> chain is mainly useful for drivers that reflect the FIB table into a >> capable device and hardware operations usually involve sleeping. >> >> 2) The insertion of a single route is done with RTNL held. I didn't want >> to differentiate between both cases. This property is really useful for >> listeners, as they don't need to worry about locking in writer-side. >> Access to data structs is serialized by RTNL. > My concern was that for large iterations, you might hold RTNL and/or > current cpu for hundred of ms or even seconds... > I have the same concern as Eric here. I understand why you need it, but can the driver request for an initial dump and that dump be made more efficient somehow ie not hold rtnl for the whole dump ?. instead of making the fib notifier registration code doing it. these routing table sizes can be huge and an analogy for this in user-space: We do request a netlink dump of routing tables at initialization (on driver starts or resets)... but, existing netlink routing table dumps for that scale don't hold rtnl for the whole dump. The dump is split into multiple responses to the user and hence it does not starve other rtnl users. In-fact I don't think netlink routing table dumps from user-space hold rtnl_lock for the whole dump. IIRC this was done to allow route add/dels to be allowed in parallel for performance reasons. (I will need to double check to confirm this).
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote: > On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote: > > How well will this work for large FIB tables ? > > > > Holding rtnl while sending thousands of skb will prevent consumers to > > make progress ? > > Can you please clarify what do you mean by "while sending thousands of > skb"? This patch doesn't generate notifications to user space, but > instead invokes notification routines inside the kernel. I probably > misunderstood you. > > Are you suggesting this be done using RCU instead? Well, there are a > couple of reasons why I took RTNL here: > No, I do not believe RCU is wanted here, in control path where we might sleep anyway. > 1) The FIB notification chain is blocking, so listeners are expected to > be able to sleep. This isn't possible if we use RCU. Note that this > chain is mainly useful for drivers that reflect the FIB table into a > capable device and hardware operations usually involve sleeping. > > 2) The insertion of a single route is done with RTNL held. I didn't want > to differentiate between both cases. This property is really useful for > listeners, as they don't need to worry about locking in writer-side. > Access to data structs is serialized by RTNL. My concern was that for large iterations, you might hold RTNL and/or current cpu for hundred of ms or even seconds...
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote: > On Mon, 2016-10-31 at 23:13 +0200, ido...@idosch.org wrote: > > From: Ido Schimmel> > > > When registering a FIB notifier block we should notify the caller of > > already existing FIB entries and rules, as it would otherwise have an > > incomplete view of the FIB tables. This is especially important for > > switchdev drivers that support FIB offloads. Failing to notify them of > > existing entries may lead to packet loss. > > > > Upon registration, walk the leafs of all the routing tables and for each > > leaf send notification of existing FIB aliases. Similarly, when > > unregistering the notifier synthesize a deletion event, thereby > > relieving potential callers from the need to perform cleanup. > > > > The above is consistent with the netdevice notification chain, where > > "registration and up events are replayed to the new notifier" upon > > registration. > > > > Signed-off-by: Ido Schimmel > > Reviewed-by: Jiri Pirko > > How well will this work for large FIB tables ? > > Holding rtnl while sending thousands of skb will prevent consumers to > make progress ? Can you please clarify what do you mean by "while sending thousands of skb"? This patch doesn't generate notifications to user space, but instead invokes notification routines inside the kernel. I probably misunderstood you. Are you suggesting this be done using RCU instead? Well, there are a couple of reasons why I took RTNL here: 1) The FIB notification chain is blocking, so listeners are expected to be able to sleep. This isn't possible if we use RCU. Note that this chain is mainly useful for drivers that reflect the FIB table into a capable device and hardware operations usually involve sleeping. 2) The insertion of a single route is done with RTNL held. I didn't want to differentiate between both cases. This property is really useful for listeners, as they don't need to worry about locking in writer-side. Access to data structs is serialized by RTNL. Thanks for reviewing.
Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
On Mon, 2016-10-31 at 23:13 +0200, ido...@idosch.org wrote: > From: Ido Schimmel> > When registering a FIB notifier block we should notify the caller of > already existing FIB entries and rules, as it would otherwise have an > incomplete view of the FIB tables. This is especially important for > switchdev drivers that support FIB offloads. Failing to notify them of > existing entries may lead to packet loss. > > Upon registration, walk the leafs of all the routing tables and for each > leaf send notification of existing FIB aliases. Similarly, when > unregistering the notifier synthesize a deletion event, thereby > relieving potential callers from the need to perform cleanup. > > The above is consistent with the netdevice notification chain, where > "registration and up events are replayed to the new notifier" upon > registration. > > Signed-off-by: Ido Schimmel > Reviewed-by: Jiri Pirko How well will this work for large FIB tables ? Holding rtnl while sending thousands of skb will prevent consumers to make progress ?
[PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier
From: Ido SchimmelWhen registering a FIB notifier block we should notify the caller of already existing FIB entries and rules, as it would otherwise have an incomplete view of the FIB tables. This is especially important for switchdev drivers that support FIB offloads. Failing to notify them of existing entries may lead to packet loss. Upon registration, walk the leafs of all the routing tables and for each leaf send notification of existing FIB aliases. Similarly, when unregistering the notifier synthesize a deletion event, thereby relieving potential callers from the need to perform cleanup. The above is consistent with the netdevice notification chain, where "registration and up events are replayed to the new notifier" upon registration. Signed-off-by: Ido Schimmel Reviewed-by: Jiri Pirko --- v1->v2: * Add #ifdef fixing compilation error reported by kbuild robot --- net/ipv4/fib_trie.c | 152 +++- 1 file changed, 150 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 31cef36..455d94b 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -86,15 +86,107 @@ static BLOCKING_NOTIFIER_HEAD(fib_chain); +static int call_fib_notifier(struct notifier_block *nb, struct net *net, +enum fib_event_type event_type, +struct fib_notifier_info *info) +{ + info->net = net; + return nb->notifier_call(nb, event_type, info); +} + +static void fib_rules_notify(struct net *net, struct notifier_block *nb, +enum fib_event_type event_type) +{ +#ifdef CONFIG_IP_MULTIPLE_TABLES + struct fib_notifier_info info; + + if (net->ipv4.fib_has_custom_rules) + call_fib_notifier(nb, net, event_type, ); +#endif +} + +static void fib_notify(struct net *net, struct notifier_block *nb, + enum fib_event_type event_type); + +static int call_fib_entry_notifier(struct notifier_block *nb, struct net *net, + enum fib_event_type event_type, u32 dst, + int dst_len, struct fib_info *fi, + u8 tos, u8 type, u32 tb_id, u32 nlflags) +{ + struct fib_entry_notifier_info info = { + .dst = dst, + .dst_len = dst_len, + .fi = fi, + .tos = tos, + .type = type, + .tb_id = tb_id, + .nlflags = nlflags, + }; + return call_fib_notifier(nb, net, event_type, ); +} + +/** + * register_fib_notifier - register a fib notifier block + * @nb: notifier + * + * Register a notifier to be called when FIB entries or rules are + * added or removed. A negative errno code is returned on failure. + * + * When registered, all FIB addition events are replayed to the new + * notifier to allow the caller to have a complete view of the FIB + * tables. + */ + int register_fib_notifier(struct notifier_block *nb) { - return blocking_notifier_chain_register(_chain, nb); + struct net *net; + int err; + + rtnl_lock(); + err = blocking_notifier_chain_register(_chain, nb); + if (err) + goto unlock; + for_each_net(net) { + fib_rules_notify(net, nb, FIB_EVENT_RULE_ADD); + fib_notify(net, nb, FIB_EVENT_ENTRY_ADD); + } + +unlock: + rtnl_unlock(); + return err; } EXPORT_SYMBOL(register_fib_notifier); +/** + * unregister_fib_notifier - unregister a fib notifier block + * @nb: notifier + * + * unregister a notifier previously registered by + * register_fib_notifier(). A negative errno code is returned on + * failure. + * + * After unregistering, FIB deletion events are synthesized to the + * removed notifier for all present FIB entries. This removes the + * need for special case cleanup code. + */ + int unregister_fib_notifier(struct notifier_block *nb) { - return blocking_notifier_chain_unregister(_chain, nb); + struct net *net; + int err; + + rtnl_lock(); + err = blocking_notifier_chain_unregister(_chain, nb); + if (err) + goto unlock; + for_each_net(net) { + fib_notify(net, nb, FIB_EVENT_ENTRY_DEL); + fib_rules_notify(net, nb, FIB_EVENT_RULE_DEL); + } + +unlock: + rtnl_unlock(); + return err; } EXPORT_SYMBOL(unregister_fib_notifier); @@ -1834,6 +1926,62 @@ int fib_table_flush(struct net *net, struct fib_table *tb) return found; } +static void fib_leaf_notify(struct net *net, struct key_vector *l, + struct fib_table *tb, struct notifier_block *nb, + enum fib_event_type event_type) +{ + struct fib_alias *fa; + + hlist_for_each_entry(fa,