Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread David Miller
From: Jiri Pirko 
Date: Wed, 2 Nov 2016 08:35:02 +0100

> How do you imagine this mode should looks like? Could you draw me some
> example?

Well, first of all, there is no reason we can't provide a mechanism by
which the driver can request and obtain a FIB dump.

And it can be designed in a way to be preemptible or at least not
require RTNL to be held during the entire operation.  Sequence
counters or similar can be used to make sure that if the table changes
mid-dump due to RTNL being dropped, the dump can be rewound and
restarted.


Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Jiri Pirko
Wed, Nov 02, 2016 at 03:35:03PM CET, ro...@cumulusnetworks.com wrote:
>On 11/2/16, 6:48 AM, Jiri Pirko wrote:
>> Wed, Nov 02, 2016 at 02:29:40PM CET, ro...@cumulusnetworks.com wrote:
>>> On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirko  wrote:
 Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
>>> [snip]
>>>
> I understand..but, if you are adding some core infrastructure for 
> switchdev ..it cannot be
> based on the number of simple use-cases or data you have today.
>
> I won't be surprised if tomorrow other switch drivers have a case where 
> they need to
> reset the hw routing table state and reprogram all routes again. 
> Re-registering the notifier to just
> get the routing state of the kernel will not scale. For the long term, 
> since the driver does not maintain a cache,
 Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
 otherwise.


> a pull api with efficient use of rtnl will be useful for other such cases 
> as well.
 How do you imagine this "pull API" should look like?
>>>
>>> Just like you already have added fib notifiers to parallel fib netlink
>>> notifications, the pull API is  a parallel to 'netlink dump'.
>>> Is my imagination too wild  ? :)
>> Perhaps I'm slow, but I don't understand what you mean.
> 
>
> If you don't want to get to the complexity of a new api right away 
> because of the
> simple case of management interface routes you have, Can your driver 
> register the notifier early  ?
> (I am sure you have probably already thought about this)
 Register early? What it would resolve? I must be missing something. We
 register as early as possible. But the thing is, we cannot register
 in a past. And that is what this patch resolves.
>>> sure, you must be having a valid problem then. I was just curious why
>>> your driver is not up and initialized before any of the addresses or
>>> routes get configured in the system (even on a management port). Ours
>> If you unload the module and load it again for example. This is a valid
>> usecase.
>
>I see, so you are optimizing for this use case. sure it is a valid use-case 
>but a narrow one

It is not an optimization, it's a bug fix.


>compared to the rtnl overhead the api may bring
> (note that i am not saying you should not solve it).
>
>>
>>
>>> does. But i agree there can be races and you cannot always guarantee
>>> (I was just responding to ido's comment about adding complexity for a
>>> small problem he has to solve for management routes). Our driver does
>>> a pull before it starts. This helps when we want to reset the hardware
>>> routing table state too.
>> Can you point me to you driver in the tree? I would like to see how you
>> do "the pull".
>:), you know all this... but, if i must explicitly say it, yes,  we don't have 
>a driver in the tree and
>we don't own the hardware. My analogy here is of a netlink dump that we use 
>heavily for the
>same scale that you will probably deploy.

You are comparing netlink kernel-user api with in kernel api. I don't
think that is comparable, at all. Therefore I asked how you imagine
"the pull" should look like, in kernel. Stating it should look like
some user api part does not help me much :(



>i do give you full credit for the hardware and the driver and switchdev 
>support and all that!.
>
>>
>>>
>>> But, my point was, when you are defining an API, you cannot quantify
>>> the 'past' to be just the very 'close past' or 'the past is just the
>>> management routes that were added' . Tomorrow the 'past' can be the
>>> full routing table if you need to reset the hardware state.
>> Sure.
>
>This pull api was a suggestion for an efficient use of rtnl ...similar to how 
>the netlink
>routing dump handles it. If you cannot imagine an api like that..., sure, your 
>call.

No, that's why I'm asking, because I was under impression you can
imagine that :)


Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Roopa Prabhu
On 11/2/16, 6:48 AM, Jiri Pirko wrote:
> Wed, Nov 02, 2016 at 02:29:40PM CET, ro...@cumulusnetworks.com wrote:
>> On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirko  wrote:
>>> Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
>> [snip]
>>
 I understand..but, if you are adding some core infrastructure for 
 switchdev ..it cannot be
 based on the number of simple use-cases or data you have today.

 I won't be surprised if tomorrow other switch drivers have a case where 
 they need to
 reset the hw routing table state and reprogram all routes again. 
 Re-registering the notifier to just
 get the routing state of the kernel will not scale. For the long term, 
 since the driver does not maintain a cache,
>>> Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
>>> otherwise.
>>>
>>>
 a pull api with efficient use of rtnl will be useful for other such cases 
 as well.
>>> How do you imagine this "pull API" should look like?
>>
>> Just like you already have added fib notifiers to parallel fib netlink
>> notifications, the pull API is  a parallel to 'netlink dump'.
>> Is my imagination too wild  ? :)
> Perhaps I'm slow, but I don't understand what you mean.
 

 If you don't want to get to the complexity of a new api right away because 
 of the
 simple case of management interface routes you have, Can your driver 
 register the notifier early  ?
 (I am sure you have probably already thought about this)
>>> Register early? What it would resolve? I must be missing something. We
>>> register as early as possible. But the thing is, we cannot register
>>> in a past. And that is what this patch resolves.
>> sure, you must be having a valid problem then. I was just curious why
>> your driver is not up and initialized before any of the addresses or
>> routes get configured in the system (even on a management port). Ours
> If you unload the module and load it again for example. This is a valid
> usecase.

I see, so you are optimizing for this use case. sure it is a valid use-case but 
a narrow one
compared to the rtnl overhead the api may bring
 (note that i am not saying you should not solve it).

>
>
>> does. But i agree there can be races and you cannot always guarantee
>> (I was just responding to ido's comment about adding complexity for a
>> small problem he has to solve for management routes). Our driver does
>> a pull before it starts. This helps when we want to reset the hardware
>> routing table state too.
> Can you point me to you driver in the tree? I would like to see how you
> do "the pull".
:), you know all this... but, if i must explicitly say it, yes,  we don't have 
a driver in the tree and
we don't own the hardware. My analogy here is of a netlink dump that we use 
heavily for the
same scale that you will probably deploy.
i do give you full credit for the hardware and the driver and switchdev support 
and all that!.

>
>>
>> But, my point was, when you are defining an API, you cannot quantify
>> the 'past' to be just the very 'close past' or 'the past is just the
>> management routes that were added' . Tomorrow the 'past' can be the
>> full routing table if you need to reset the hardware state.
> Sure.

This pull api was a suggestion for an efficient use of rtnl ...similar to how 
the netlink
routing dump handles it. If you cannot imagine an api like that..., sure, your 
call.



Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Jiri Pirko
Wed, Nov 02, 2016 at 02:29:40PM CET, ro...@cumulusnetworks.com wrote:
>On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirko  wrote:
>> Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
>>>
>[snip]
>
>>>I understand..but, if you are adding some core infrastructure for switchdev 
>>>..it cannot be
>>>based on the number of simple use-cases or data you have today.
>>>
>>>I won't be surprised if tomorrow other switch drivers have a case where they 
>>>need to
>>>reset the hw routing table state and reprogram all routes again. 
>>>Re-registering the notifier to just
>>>get the routing state of the kernel will not scale. For the long term, since 
>>>the driver does not maintain a cache,
>>
>> Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
>> otherwise.
>>
>>
>>>a pull api with efficient use of rtnl will be useful for other such cases as 
>>>well.
>>
>> How do you imagine this "pull API" should look like?
>
>
>Just like you already have added fib notifiers to parallel fib netlink
>notifications, the pull API is  a parallel to 'netlink dump'.
>Is my imagination too wild  ? :)

Perhaps I'm slow, but I don't understand what you mean.


>
>
>>
>>
>>>
>>>
>>>If you don't want to get to the complexity of a new api right away because 
>>>of the
>>>simple case of management interface routes you have, Can your driver 
>>>register the notifier early  ?
>>>(I am sure you have probably already thought about this)
>>
>> Register early? What it would resolve? I must be missing something. We
>> register as early as possible. But the thing is, we cannot register
>> in a past. And that is what this patch resolves.
>
>sure, you must be having a valid problem then. I was just curious why
>your driver is not up and initialized before any of the addresses or
>routes get configured in the system (even on a management port). Ours

If you unload the module and load it again for example. This is a valid
usecase.


>does. But i agree there can be races and you cannot always guarantee
>(I was just responding to ido's comment about adding complexity for a
>small problem he has to solve for management routes). Our driver does
>a pull before it starts. This helps when we want to reset the hardware
>routing table state too.

Can you point me to you driver in the tree? I would like to see how you
do "the pull".


>
>
>But, my point was, when you are defining an API, you cannot quantify
>the 'past' to be just the very 'close past' or 'the past is just the
>management routes that were added' . Tomorrow the 'past' can be the
>full routing table if you need to reset the hardware state.

Sure.


Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Ido Schimmel
On Wed, Nov 02, 2016 at 06:29:40AM -0700, Roopa Prabhu wrote:
> On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirko  wrote:
> > Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
> >>
> [snip]
> 
> >>I understand..but, if you are adding some core infrastructure for switchdev 
> >>..it cannot be
> >>based on the number of simple use-cases or data you have today.
> >>
> >>I won't be surprised if tomorrow other switch drivers have a case where 
> >>they need to
> >>reset the hw routing table state and reprogram all routes again. 
> >>Re-registering the notifier to just
> >>get the routing state of the kernel will not scale. For the long term, 
> >>since the driver does not maintain a cache,
> >
> > Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
> > otherwise.
> >
> >
> >>a pull api with efficient use of rtnl will be useful for other such cases 
> >>as well.
> >
> > How do you imagine this "pull API" should look like?
> 
> 
> Just like you already have added fib notifiers to parallel fib netlink
> notifications, the pull API is  a parallel to 'netlink dump'.
> Is my imagination too wild  ? :)

The question is more about the mechanics of this pull API, because it's
not very clear to me how that should look like. You want consumers to
dump the tables in batches, so that rtnl is held only during the batch
but not in between them? How are the routes passed down? Does the fib
code fill up some struct or does it use the fib chain?

> >>If you don't want to get to the complexity of a new api right away because 
> >>of the
> >>simple case of management interface routes you have, Can your driver 
> >>register the notifier early  ?
> >>(I am sure you have probably already thought about this)
> >
> > Register early? What it would resolve? I must be missing something. We
> > register as early as possible. But the thing is, we cannot register
> > in a past. And that is what this patch resolves.
> 
> sure, you must be having a valid problem then. I was just curious why
> your driver is not up and initialized before any of the addresses or
> routes get configured in the system (even on a management port). Ours
> does. But i agree there can be races and you cannot always guarantee
> (I was just responding to ido's comment about adding complexity for a
> small problem he has to solve for management routes). Our driver does
> a pull before it starts. This helps when we want to reset the hardware
> routing table state too.

One can modprobe the module after routes are already present on other
netdevs. That's actually how I tested the patch.


Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Roopa Prabhu
On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirko  wrote:
> Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
>>
[snip]

>>I understand..but, if you are adding some core infrastructure for switchdev 
>>..it cannot be
>>based on the number of simple use-cases or data you have today.
>>
>>I won't be surprised if tomorrow other switch drivers have a case where they 
>>need to
>>reset the hw routing table state and reprogram all routes again. 
>>Re-registering the notifier to just
>>get the routing state of the kernel will not scale. For the long term, since 
>>the driver does not maintain a cache,
>
> Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
> otherwise.
>
>
>>a pull api with efficient use of rtnl will be useful for other such cases as 
>>well.
>
> How do you imagine this "pull API" should look like?


Just like you already have added fib notifiers to parallel fib netlink
notifications, the pull API is  a parallel to 'netlink dump'.
Is my imagination too wild  ? :)


>
>
>>
>>
>>If you don't want to get to the complexity of a new api right away because of 
>>the
>>simple case of management interface routes you have, Can your driver register 
>>the notifier early  ?
>>(I am sure you have probably already thought about this)
>
> Register early? What it would resolve? I must be missing something. We
> register as early as possible. But the thing is, we cannot register
> in a past. And that is what this patch resolves.

sure, you must be having a valid problem then. I was just curious why
your driver is not up and initialized before any of the addresses or
routes get configured in the system (even on a management port). Ours
does. But i agree there can be races and you cannot always guarantee
(I was just responding to ido's comment about adding complexity for a
small problem he has to solve for management routes). Our driver does
a pull before it starts. This helps when we want to reset the hardware
routing table state too.


But, my point was, when you are defining an API, you cannot quantify
the 'past' to be just the very 'close past' or 'the past is just the
management routes that were added' . Tomorrow the 'past' can be the
full routing table if you need to reset the hardware state.


Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Jiri Pirko
Tue, Nov 01, 2016 at 04:36:50PM CET, da...@davemloft.net wrote:
>From: Roopa Prabhu 
>Date: Tue, 01 Nov 2016 08:14:14 -0700
>
>> On 11/1/16, 7:19 AM, Eric Dumazet wrote:
>>> On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote:
 On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote:
> How well will this work for large FIB tables ?
>
> Holding rtnl while sending thousands of skb will prevent consumers to
> make progress ?
 Can you please clarify what do you mean by "while sending thousands of
 skb"? This patch doesn't generate notifications to user space, but
 instead invokes notification routines inside the kernel. I probably
 misunderstood you.

 Are you suggesting this be done using RCU instead? Well, there are a
 couple of reasons why I took RTNL here:

>>> No, I do not believe RCU is wanted here, in control path where we might
>>> sleep anyway.
>>>
 1) The FIB notification chain is blocking, so listeners are expected to
 be able to sleep. This isn't possible if we use RCU. Note that this
 chain is mainly useful for drivers that reflect the FIB table into a
 capable device and hardware operations usually involve sleeping.

 2) The insertion of a single route is done with RTNL held. I didn't want
 to differentiate between both cases. This property is really useful for
 listeners, as they don't need to worry about locking in writer-side.
 Access to data structs is serialized by RTNL.
>>> My concern was that for large iterations, you might hold RTNL and/or
>>> current cpu for hundred of ms or even seconds...
>>>
>> I have the same concern as Eric here.
>> 
>> I understand why you need it, but can the driver request for an initial dump 
>> and that
>> dump be made more efficient somehow ie not hold rtnl for the whole dump ?.
>> instead of making the fib notifier registration code doing it.
>> 
>> these routing table sizes can be huge and an analogy for this in user-space:
>> We do request a netlink dump of  routing tables at initialization (on driver 
>> starts or resets)...
>> but, existing netlink routing table dumps for that scale don't hold rtnl for 
>> the whole dump.
>> The dump is split into multiple responses to the user and hence it does not 
>> starve other rtnl users.
>> 
>> In-fact I don't think netlink routing table dumps from user-space hold 
>> rtnl_lock for the whole dump.
>> IIRC this was done to allow route add/dels to be allowed in parallel for 
>> performance reasons.
>> (I will need to double check to confirm this).
>
>I've always had some reservations about using notifiers for getting
>the FIB entries down to the offloaded device.

Yeah, me too. But there is no really other way to do it. I thought about
it quite long. But maybe I missed something.


>
>And this problem is just another symptom that it is the wrong
>mechanism for propagating this information.
>
>As suggested by Roopa here, perhaps we're looking at the problem from
>the wrong direction.  We tend to design NDO ops and notifiers, to
>"push" things to the driver, but maybe something more like a push+pull
>model is better.

How do you imagine this mode should looks like? Could you draw me some
example?

Thanks!


Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Jiri Pirko
Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
>On 11/1/16, 10:03 AM, Ido Schimmel wrote:
>> Hi Roopa,
>>
>> On Tue, Nov 01, 2016 at 08:14:14AM -0700, Roopa Prabhu wrote:
>>>
>[snip]
>>> I have the same concern as Eric here.
>>>
>>> I understand why you need it, but can the driver request for an initial 
>>> dump and that
>>> dump be made more efficient somehow ie not hold rtnl for the whole dump ?.
>>> instead of making the fib notifier registration code doing it.
>> We can do what we suggested in the last bi-weekly meeting, which is
>> still holding rtnl, but moving the hardware operation to delayed work.
>> This is possible because upper layers always assume operation was
>> successful and driver is responsible for invoking its abort mechanism in
>> case of failure.
>>
>>> these routing table sizes can be huge and an analogy for this in user-space:
>>> We do request a netlink dump of  routing tables at initialization (on 
>>> driver starts or resets)...
>>> but, existing netlink routing table dumps for that scale don't hold rtnl 
>>> for the whole dump.
>>> The dump is split into multiple responses to the user and hence it does not 
>>> starve other rtnl users.
>> In my reply to Eric I mentioned that when we register and unregister
>> from this chain the tables aren't really huge, but instead quite small.
>> I understand your concerns, but I don't wish to make things more
>> complicated than they should be only to address concerns that aren't
>> really realistic.
>
>I understand..but, if you are adding some core infrastructure for switchdev 
>..it cannot be
>based on the number of simple use-cases or data you have today.
>
>I won't be surprised if tomorrow other switch drivers have a case where they 
>need to
>reset the hw routing table state and reprogram all routes again. 
>Re-registering the notifier to just
>get the routing state of the kernel will not scale. For the long term, since 
>the driver does not maintain a cache,

Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
otherwise.


>a pull api with efficient use of rtnl will be useful for other such cases as 
>well.

How do you imagine this "pull API" should look like?


>
>
>If you don't want to get to the complexity of a new api right away because of 
>the
>simple case of management interface routes you have, Can your driver register 
>the notifier early  ?
>(I am sure you have probably already thought about this)

Register early? What it would resolve? I must be missing something. We
register as early as possible. But the thing is, we cannot register
in a past. And that is what this patch resolves.


>
>>
>> I believe current patch is quite simple and also consistent with other
>> notification chains in the kernel, such as the netdevice, where rtnl is
>> held during replay of events.
>> http://lxr.free-electrons.com/source/net/core/dev.c#L1535
>as you know, netdev and routing scale are not the same thing.
>Looking at the current code for netdevices (replay and rollback on failure),
>a pull api (equivalent to the netlink dump api) may end up being less 
>complex...with an
>ability to batch in the future.
>
>
>
>
>
>
>


Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-01 Thread Roopa Prabhu
On 11/1/16, 10:03 AM, Ido Schimmel wrote:
> Hi Roopa,
>
> On Tue, Nov 01, 2016 at 08:14:14AM -0700, Roopa Prabhu wrote:
>>
[snip]
>> I have the same concern as Eric here.
>>
>> I understand why you need it, but can the driver request for an initial dump 
>> and that
>> dump be made more efficient somehow ie not hold rtnl for the whole dump ?.
>> instead of making the fib notifier registration code doing it.
> We can do what we suggested in the last bi-weekly meeting, which is
> still holding rtnl, but moving the hardware operation to delayed work.
> This is possible because upper layers always assume operation was
> successful and driver is responsible for invoking its abort mechanism in
> case of failure.
>
>> these routing table sizes can be huge and an analogy for this in user-space:
>> We do request a netlink dump of  routing tables at initialization (on driver 
>> starts or resets)...
>> but, existing netlink routing table dumps for that scale don't hold rtnl for 
>> the whole dump.
>> The dump is split into multiple responses to the user and hence it does not 
>> starve other rtnl users.
> In my reply to Eric I mentioned that when we register and unregister
> from this chain the tables aren't really huge, but instead quite small.
> I understand your concerns, but I don't wish to make things more
> complicated than they should be only to address concerns that aren't
> really realistic.

I understand..but, if you are adding some core infrastructure for switchdev 
..it cannot be
based on the number of simple use-cases or data you have today.

I won't be surprised if tomorrow other switch drivers have a case where they 
need to
reset the hw routing table state and reprogram all routes again. Re-registering 
the notifier to just
get the routing state of the kernel will not scale. For the long term, since 
the driver does not maintain a cache,
a pull api with efficient use of rtnl will be useful for other such cases as 
well.


If you don't want to get to the complexity of a new api right away because of 
the
simple case of management interface routes you have, Can your driver register 
the notifier early  ?
(I am sure you have probably already thought about this)

>
> I believe current patch is quite simple and also consistent with other
> notification chains in the kernel, such as the netdevice, where rtnl is
> held during replay of events.
> http://lxr.free-electrons.com/source/net/core/dev.c#L1535
as you know, netdev and routing scale are not the same thing.
Looking at the current code for netdevices (replay and rollback on failure),
a pull api (equivalent to the netlink dump api) may end up being less 
complex...with an
ability to batch in the future.









Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-01 Thread Ido Schimmel
Hi Roopa,

On Tue, Nov 01, 2016 at 08:14:14AM -0700, Roopa Prabhu wrote:
> On 11/1/16, 7:19 AM, Eric Dumazet wrote:
> > On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote:
> >> On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote:
> >>> How well will this work for large FIB tables ?
> >>>
> >>> Holding rtnl while sending thousands of skb will prevent consumers to
> >>> make progress ?
> >> Can you please clarify what do you mean by "while sending thousands of
> >> skb"? This patch doesn't generate notifications to user space, but
> >> instead invokes notification routines inside the kernel. I probably
> >> misunderstood you.
> >>
> >> Are you suggesting this be done using RCU instead? Well, there are a
> >> couple of reasons why I took RTNL here:
> >>
> > No, I do not believe RCU is wanted here, in control path where we might
> > sleep anyway.
> >
> >> 1) The FIB notification chain is blocking, so listeners are expected to
> >> be able to sleep. This isn't possible if we use RCU. Note that this
> >> chain is mainly useful for drivers that reflect the FIB table into a
> >> capable device and hardware operations usually involve sleeping.
> >>
> >> 2) The insertion of a single route is done with RTNL held. I didn't want
> >> to differentiate between both cases. This property is really useful for
> >> listeners, as they don't need to worry about locking in writer-side.
> >> Access to data structs is serialized by RTNL.
> > My concern was that for large iterations, you might hold RTNL and/or
> > current cpu for hundred of ms or even seconds...
> >
> I have the same concern as Eric here.
> 
> I understand why you need it, but can the driver request for an initial dump 
> and that
> dump be made more efficient somehow ie not hold rtnl for the whole dump ?.
> instead of making the fib notifier registration code doing it.

We can do what we suggested in the last bi-weekly meeting, which is
still holding rtnl, but moving the hardware operation to delayed work.
This is possible because upper layers always assume operation was
successful and driver is responsible for invoking its abort mechanism in
case of failure.

> these routing table sizes can be huge and an analogy for this in user-space:
> We do request a netlink dump of  routing tables at initialization (on driver 
> starts or resets)...
> but, existing netlink routing table dumps for that scale don't hold rtnl for 
> the whole dump.
> The dump is split into multiple responses to the user and hence it does not 
> starve other rtnl users.

In my reply to Eric I mentioned that when we register and unregister
from this chain the tables aren't really huge, but instead quite small.
I understand your concerns, but I don't wish to make things more
complicated than they should be only to address concerns that aren't
really realistic.

I believe current patch is quite simple and also consistent with other
notification chains in the kernel, such as the netdevice, where rtnl is
held during replay of events.
http://lxr.free-electrons.com/source/net/core/dev.c#L1535


Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-01 Thread Ido Schimmel
On Tue, Nov 01, 2016 at 07:19:59AM -0700, Eric Dumazet wrote:
> On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote:
> > On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote:
> 
> > > How well will this work for large FIB tables ?
> > > 
> > > Holding rtnl while sending thousands of skb will prevent consumers to
> > > make progress ?
> > 
> > Can you please clarify what do you mean by "while sending thousands of
> > skb"? This patch doesn't generate notifications to user space, but
> > instead invokes notification routines inside the kernel. I probably
> > misunderstood you.
> > 
> > Are you suggesting this be done using RCU instead? Well, there are a
> > couple of reasons why I took RTNL here:
> > 
> 
> No, I do not believe RCU is wanted here, in control path where we might
> sleep anyway.
> 
> > 1) The FIB notification chain is blocking, so listeners are expected to
> > be able to sleep. This isn't possible if we use RCU. Note that this
> > chain is mainly useful for drivers that reflect the FIB table into a
> > capable device and hardware operations usually involve sleeping.
> > 
> > 2) The insertion of a single route is done with RTNL held. I didn't want
> > to differentiate between both cases. This property is really useful for
> > listeners, as they don't need to worry about locking in writer-side.
> > Access to data structs is serialized by RTNL.
> 
> My concern was that for large iterations, you might hold RTNL and/or
> current cpu for hundred of ms or even seconds...

I understand your concern, but I think it's helpful to look at the users
of this API. It was only recently introduced [1] because nobody needed
it beside switch drivers that reflect the FIB table and I believe it'll
stay that way. Currently, only mlxsw and rocker use it.

Now, in these use cases when register_fib_notifier() is called the
switch ports are still not present in the system, so we really only have
a few routes used for management. Similarly, when
unregister_fib_notifier() is called, the switch ports are already gone
and most FIBs were flushed due to NETDEV_UNREGISTER, so again we only
have a handful of FIBs to iterate over.

Does that sound reasonable to you?

1. https://www.spinics.net/lists/netdev/msg397444.html


Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-01 Thread David Miller
From: Roopa Prabhu 
Date: Tue, 01 Nov 2016 08:14:14 -0700

> On 11/1/16, 7:19 AM, Eric Dumazet wrote:
>> On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote:
>>> On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote:
 How well will this work for large FIB tables ?

 Holding rtnl while sending thousands of skb will prevent consumers to
 make progress ?
>>> Can you please clarify what do you mean by "while sending thousands of
>>> skb"? This patch doesn't generate notifications to user space, but
>>> instead invokes notification routines inside the kernel. I probably
>>> misunderstood you.
>>>
>>> Are you suggesting this be done using RCU instead? Well, there are a
>>> couple of reasons why I took RTNL here:
>>>
>> No, I do not believe RCU is wanted here, in control path where we might
>> sleep anyway.
>>
>>> 1) The FIB notification chain is blocking, so listeners are expected to
>>> be able to sleep. This isn't possible if we use RCU. Note that this
>>> chain is mainly useful for drivers that reflect the FIB table into a
>>> capable device and hardware operations usually involve sleeping.
>>>
>>> 2) The insertion of a single route is done with RTNL held. I didn't want
>>> to differentiate between both cases. This property is really useful for
>>> listeners, as they don't need to worry about locking in writer-side.
>>> Access to data structs is serialized by RTNL.
>> My concern was that for large iterations, you might hold RTNL and/or
>> current cpu for hundred of ms or even seconds...
>>
> I have the same concern as Eric here.
> 
> I understand why you need it, but can the driver request for an initial dump 
> and that
> dump be made more efficient somehow ie not hold rtnl for the whole dump ?.
> instead of making the fib notifier registration code doing it.
> 
> these routing table sizes can be huge and an analogy for this in user-space:
> We do request a netlink dump of  routing tables at initialization (on driver 
> starts or resets)...
> but, existing netlink routing table dumps for that scale don't hold rtnl for 
> the whole dump.
> The dump is split into multiple responses to the user and hence it does not 
> starve other rtnl users.
> 
> In-fact I don't think netlink routing table dumps from user-space hold 
> rtnl_lock for the whole dump.
> IIRC this was done to allow route add/dels to be allowed in parallel for 
> performance reasons.
> (I will need to double check to confirm this).

I've always had some reservations about using notifiers for getting
the FIB entries down to the offloaded device.

And this problem is just another symptom that it is the wrong
mechanism for propagating this information.

As suggested by Roopa here, perhaps we're looking at the problem from
the wrong direction.  We tend to design NDO ops and notifiers, to
"push" things to the driver, but maybe something more like a push+pull
model is better.


Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-01 Thread Roopa Prabhu
On 11/1/16, 7:19 AM, Eric Dumazet wrote:
> On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote:
>> On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote:
>>> How well will this work for large FIB tables ?
>>>
>>> Holding rtnl while sending thousands of skb will prevent consumers to
>>> make progress ?
>> Can you please clarify what do you mean by "while sending thousands of
>> skb"? This patch doesn't generate notifications to user space, but
>> instead invokes notification routines inside the kernel. I probably
>> misunderstood you.
>>
>> Are you suggesting this be done using RCU instead? Well, there are a
>> couple of reasons why I took RTNL here:
>>
> No, I do not believe RCU is wanted here, in control path where we might
> sleep anyway.
>
>> 1) The FIB notification chain is blocking, so listeners are expected to
>> be able to sleep. This isn't possible if we use RCU. Note that this
>> chain is mainly useful for drivers that reflect the FIB table into a
>> capable device and hardware operations usually involve sleeping.
>>
>> 2) The insertion of a single route is done with RTNL held. I didn't want
>> to differentiate between both cases. This property is really useful for
>> listeners, as they don't need to worry about locking in writer-side.
>> Access to data structs is serialized by RTNL.
> My concern was that for large iterations, you might hold RTNL and/or
> current cpu for hundred of ms or even seconds...
>
I have the same concern as Eric here.

I understand why you need it, but can the driver request for an initial dump 
and that
dump be made more efficient somehow ie not hold rtnl for the whole dump ?.
instead of making the fib notifier registration code doing it.

these routing table sizes can be huge and an analogy for this in user-space:
We do request a netlink dump of  routing tables at initialization (on driver 
starts or resets)...
but, existing netlink routing table dumps for that scale don't hold rtnl for 
the whole dump.
The dump is split into multiple responses to the user and hence it does not 
starve other rtnl users.

In-fact I don't think netlink routing table dumps from user-space hold 
rtnl_lock for the whole dump.
IIRC this was done to allow route add/dels to be allowed in parallel for 
performance reasons.
(I will need to double check to confirm this).




Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-01 Thread Eric Dumazet
On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote:
> On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote:

> > How well will this work for large FIB tables ?
> > 
> > Holding rtnl while sending thousands of skb will prevent consumers to
> > make progress ?
> 
> Can you please clarify what do you mean by "while sending thousands of
> skb"? This patch doesn't generate notifications to user space, but
> instead invokes notification routines inside the kernel. I probably
> misunderstood you.
> 
> Are you suggesting this be done using RCU instead? Well, there are a
> couple of reasons why I took RTNL here:
> 

No, I do not believe RCU is wanted here, in control path where we might
sleep anyway.

> 1) The FIB notification chain is blocking, so listeners are expected to
> be able to sleep. This isn't possible if we use RCU. Note that this
> chain is mainly useful for drivers that reflect the FIB table into a
> capable device and hardware operations usually involve sleeping.
> 
> 2) The insertion of a single route is done with RTNL held. I didn't want
> to differentiate between both cases. This property is really useful for
> listeners, as they don't need to worry about locking in writer-side.
> Access to data structs is serialized by RTNL.

My concern was that for large iterations, you might hold RTNL and/or
current cpu for hundred of ms or even seconds...





Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-10-31 Thread Ido Schimmel
On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote:
> On Mon, 2016-10-31 at 23:13 +0200, ido...@idosch.org wrote:
> > From: Ido Schimmel 
> > 
> > When registering a FIB notifier block we should notify the caller of
> > already existing FIB entries and rules, as it would otherwise have an
> > incomplete view of the FIB tables. This is especially important for
> > switchdev drivers that support FIB offloads. Failing to notify them of
> > existing entries may lead to packet loss.
> > 
> > Upon registration, walk the leafs of all the routing tables and for each
> > leaf send notification of existing FIB aliases. Similarly, when
> > unregistering the notifier synthesize a deletion event, thereby
> > relieving potential callers from the need to perform cleanup.
> > 
> > The above is consistent with the netdevice notification chain, where
> > "registration and up events are replayed to the new notifier" upon
> > registration.
> > 
> > Signed-off-by: Ido Schimmel 
> > Reviewed-by: Jiri Pirko 
> 
> How well will this work for large FIB tables ?
> 
> Holding rtnl while sending thousands of skb will prevent consumers to
> make progress ?

Can you please clarify what do you mean by "while sending thousands of
skb"? This patch doesn't generate notifications to user space, but
instead invokes notification routines inside the kernel. I probably
misunderstood you.

Are you suggesting this be done using RCU instead? Well, there are a
couple of reasons why I took RTNL here:

1) The FIB notification chain is blocking, so listeners are expected to
be able to sleep. This isn't possible if we use RCU. Note that this
chain is mainly useful for drivers that reflect the FIB table into a
capable device and hardware operations usually involve sleeping.

2) The insertion of a single route is done with RTNL held. I didn't want
to differentiate between both cases. This property is really useful for
listeners, as they don't need to worry about locking in writer-side.
Access to data structs is serialized by RTNL.

Thanks for reviewing.


Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-10-31 Thread Eric Dumazet
On Mon, 2016-10-31 at 23:13 +0200, ido...@idosch.org wrote:
> From: Ido Schimmel 
> 
> When registering a FIB notifier block we should notify the caller of
> already existing FIB entries and rules, as it would otherwise have an
> incomplete view of the FIB tables. This is especially important for
> switchdev drivers that support FIB offloads. Failing to notify them of
> existing entries may lead to packet loss.
> 
> Upon registration, walk the leafs of all the routing tables and for each
> leaf send notification of existing FIB aliases. Similarly, when
> unregistering the notifier synthesize a deletion event, thereby
> relieving potential callers from the need to perform cleanup.
> 
> The above is consistent with the netdevice notification chain, where
> "registration and up events are replayed to the new notifier" upon
> registration.
> 
> Signed-off-by: Ido Schimmel 
> Reviewed-by: Jiri Pirko 



How well will this work for large FIB tables ?

Holding rtnl while sending thousands of skb will prevent consumers to
make progress ?






[PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-10-31 Thread idosch
From: Ido Schimmel 

When registering a FIB notifier block we should notify the caller of
already existing FIB entries and rules, as it would otherwise have an
incomplete view of the FIB tables. This is especially important for
switchdev drivers that support FIB offloads. Failing to notify them of
existing entries may lead to packet loss.

Upon registration, walk the leafs of all the routing tables and for each
leaf send notification of existing FIB aliases. Similarly, when
unregistering the notifier synthesize a deletion event, thereby
relieving potential callers from the need to perform cleanup.

The above is consistent with the netdevice notification chain, where
"registration and up events are replayed to the new notifier" upon
registration.

Signed-off-by: Ido Schimmel 
Reviewed-by: Jiri Pirko 
---
v1->v2:
* Add #ifdef fixing compilation error reported by kbuild robot
---
 net/ipv4/fib_trie.c | 152 +++-
 1 file changed, 150 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 31cef36..455d94b 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -86,15 +86,107 @@
 
 static BLOCKING_NOTIFIER_HEAD(fib_chain);
 
+static int call_fib_notifier(struct notifier_block *nb, struct net *net,
+enum fib_event_type event_type,
+struct fib_notifier_info *info)
+{
+   info->net = net;
+   return nb->notifier_call(nb, event_type, info);
+}
+
+static void fib_rules_notify(struct net *net, struct notifier_block *nb,
+enum fib_event_type event_type)
+{
+#ifdef CONFIG_IP_MULTIPLE_TABLES
+   struct fib_notifier_info info;
+
+   if (net->ipv4.fib_has_custom_rules)
+   call_fib_notifier(nb, net, event_type, );
+#endif
+}
+
+static void fib_notify(struct net *net, struct notifier_block *nb,
+  enum fib_event_type event_type);
+
+static int call_fib_entry_notifier(struct notifier_block *nb, struct net *net,
+  enum fib_event_type event_type, u32 dst,
+  int dst_len, struct fib_info *fi,
+  u8 tos, u8 type, u32 tb_id, u32 nlflags)
+{
+   struct fib_entry_notifier_info info = {
+   .dst = dst,
+   .dst_len = dst_len,
+   .fi = fi,
+   .tos = tos,
+   .type = type,
+   .tb_id = tb_id,
+   .nlflags = nlflags,
+   };
+   return call_fib_notifier(nb, net, event_type, );
+}
+
+/**
+ * register_fib_notifier - register a fib notifier block
+ * @nb: notifier
+ *
+ * Register a notifier to be called when FIB entries or rules are
+ * added or removed. A negative errno code is returned on failure.
+ *
+ * When registered, all FIB addition events are replayed to the new
+ * notifier to allow the caller to have a complete view of the FIB
+ * tables.
+ */
+
 int register_fib_notifier(struct notifier_block *nb)
 {
-   return blocking_notifier_chain_register(_chain, nb);
+   struct net *net;
+   int err;
+
+   rtnl_lock();
+   err = blocking_notifier_chain_register(_chain, nb);
+   if (err)
+   goto unlock;
+   for_each_net(net) {
+   fib_rules_notify(net, nb, FIB_EVENT_RULE_ADD);
+   fib_notify(net, nb, FIB_EVENT_ENTRY_ADD);
+   }
+
+unlock:
+   rtnl_unlock();
+   return err;
 }
 EXPORT_SYMBOL(register_fib_notifier);
 
+/**
+ * unregister_fib_notifier - unregister a fib notifier block
+ * @nb: notifier
+ *
+ * unregister a notifier previously registered by
+ * register_fib_notifier(). A negative errno code is returned on
+ * failure.
+ *
+ * After unregistering, FIB deletion events are synthesized to the
+ * removed notifier for all present FIB entries. This removes the
+ * need for special case cleanup code.
+ */
+
 int unregister_fib_notifier(struct notifier_block *nb)
 {
-   return blocking_notifier_chain_unregister(_chain, nb);
+   struct net *net;
+   int err;
+
+   rtnl_lock();
+   err = blocking_notifier_chain_unregister(_chain, nb);
+   if (err)
+   goto unlock;
+   for_each_net(net) {
+   fib_notify(net, nb, FIB_EVENT_ENTRY_DEL);
+   fib_rules_notify(net, nb, FIB_EVENT_RULE_DEL);
+   }
+
+unlock:
+   rtnl_unlock();
+   return err;
 }
 EXPORT_SYMBOL(unregister_fib_notifier);
 
@@ -1834,6 +1926,62 @@ int fib_table_flush(struct net *net, struct fib_table 
*tb)
return found;
 }
 
+static void fib_leaf_notify(struct net *net, struct key_vector *l,
+   struct fib_table *tb, struct notifier_block *nb,
+   enum fib_event_type event_type)
+{
+   struct fib_alias *fa;
+
+   hlist_for_each_entry(fa,