Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread Hannes Frederic Sowa
On 17.11.2016 19:32, Ido Schimmel wrote:
> On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote:
>> On 17.11.2016 17:45, David Miller wrote:
>>> From: Hannes Frederic Sowa 
>>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>>>
 The other way is the journal idea I had, which uses an rb-tree with
 timestamps as keys (can be lamport timestamps). You insert into the tree
 until the dump is finished and use it as queue later to shuffle stuff
 into the hardware.
>>>
>>> If you have this "place" where pending inserts are stored, you have
>>> a policy decision to make.
>>>
>>> First of all what do other lookups see when there are pending entires?
>>
>> I think this is a problem with the current approach already, as the
>> delayed work queue already postpones the insert for an undecidable
>> amount of time (and reorders depending on which CPU the entry was
>> inserted and the fib notifier was called).
> 
> The delayed work queue only postpones the insert into the listener's
> table, but the entries will be present in the kernel's table as usual.
> Therefore, other lookup made on the kernel's table will see the pending
> entries.
> 
> Note that both listeners that currently call the dump API do that during
> their init, before any lookups can be made on their tables. If you think
> it's critical, then we can make sure the workqueues are flushed before
> the listeners register their netdevs and effectively expose their tables
> to lookups.

I do see routing anyway as a best-effort. ;)

That means in not too long time the hardware needs to be fully
synchronized to the software path and either have all correct entries or
abort must have been called.

> I'm looking into the reordering issues you mentioned. I belive it's a
> valid point.

Thanks!

>> For user space queries we would still query the in-kernel table.
> 
> Right. No changes here.
> 
>>> If, once inserted into the pending queue, you return success to the
>>> inserting entity, then you must make those pending entries visible
>>> to lookups.
>>
>> I think this same problem is in this patch set already. The way I
>> understood it, is, that if a problem during insert emerges, the driver
>> calls abort and every packet will be send to user space, as the routing
>> table cannot be offloaded and it won't try it again, Ido?
> 
> First of all, this abort mechanism is already in place and isn't part of
> this patchset. Secondly, why is this a problem? The all-or-nothing
> approach is an hard requirement and current implementation is infinitely
> better than previous approach in which the kernel's tables were flushed
> upon route insertion failure. It rendered the system unusable. Current
> implementation of abort mechanism keeps the system usable, but with
> reduced performance.

Yes, I argued below that I am toally fine with this approach.

>> Probably this is an artifact of the mellanox implementation and we can
>> implement this differently for other cards, but the schema to abort all
>> if the modification doesn't work, seems to be fundamental (I think we
>> require the all-or-nothing approach for now).
> 
> Yes, it's an hard requirement for all implementations. mlxsw implements
> it by evicting all routes from its tables and inserting a default route
> that traps packets to CPU.

Correct, I don't see how fib offloading can do something else, besides
semantically looking at the routing table and figure out where to insert
"exceptions" for subtrees but keep most packets flowing over the hw
directly. But this for another time... ;)

>>> If you block the inserting entity, well that doesn't make a lot of
>>> sense.  If blocking is a workable solution, then we can just block the
>>> entire insert while this FIB dump to the device driver is happening.
>>
>> I don't think we should really block (as in kernel-block) at any time.
>>
>> I was suggesting something like:
>>
>> rtnl_lock();
>> synchronize_rcu_expedited(); // barrier, all rounting modifications are
>> stable and no new can be added due to rtnl_lock
>> register notifier(); // notifier adds entries also into journal();
>> rtnl_unlock();
>> walk_fib_tree_rcu_into_journal();
>> // walk finished
>> start_syncing_journal_to_hw(); // if new entries show up we sync them
>> asap after this point
>>
>> The journal would need a spin lock to protect its internal state and
>> order events correctly.
>>
>>> But I am pretty sure the idea of blocking modifications for so long
>>> was considered undesirable.
>>
>> Yes, this is also still my opinion.
> 
> It was indeed rejected :)
> https://marc.info/?l=linux-netdev=147794848224884=2

Bye,
Hannes



Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread Hannes Frederic Sowa
On 17.11.2016 19:16, David Miller wrote:
> From: Hannes Frederic Sowa 
> Date: Thu, 17 Nov 2016 18:20:39 +0100
> 
>> Hi,
>>
>> On 17.11.2016 17:45, David Miller wrote:
>>> From: Hannes Frederic Sowa 
>>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>>>
 The other way is the journal idea I had, which uses an rb-tree with
 timestamps as keys (can be lamport timestamps). You insert into the tree
 until the dump is finished and use it as queue later to shuffle stuff
 into the hardware.
>>>
>>> If you have this "place" where pending inserts are stored, you have
>>> a policy decision to make.
>>>
>>> First of all what do other lookups see when there are pending entires?
>>
>> I think this is a problem with the current approach already, as the
>> delayed work queue already postpones the insert for an undecidable
>> amount of time (and reorders depending on which CPU the entry was
>> inserted and the fib notifier was called).
>>
>> For user space queries we would still query the in-kernel table.
> 
> Ok, I think I might misunderstand something.
> 
> What is going into this journal exactly?  The actual full software and
> hardware insert operation, or just the notification to the hardware
> device driver notifiers?

The journal is only used as a timely ordered queue for updating the
hardware in correct order.

The enqueue side is the fib notifier only. If no fib notifier is
registered we don't use this code at all (and also don't hit the lock
protecting this journal in fib insertion/deletion path - fast in-kernel
path is untouched - otherwise just a spin_lock already under rtnl_lock
in slow path).

The fib notifier enqueues the packet with a timestamp into this journal
and can also merge entries while they are in the queue, e.g. we got a
delete from the fib notifier but the rcu walk indicated an addition of
the entry, so we can merge that at this point and depending on the
timestamp remove the entry or drop the deletion event.

We start dequeueing the fib entries into the hardware as soon as the rcu
dump is finished, at this point we are up-to-date in the queue with all
events. New events can be added to the journal (with appropriate
locking) during this time, as the queue was once in proper synced state
we stay proper synchronized. We keep up with the queue in steady state
after the dump, so syncing happens ASAP. Maybe we can also drop the
journal then.

Something alike this described queue is implemented here (haven't
checked if it exactly matches the specification, certainly it provides
more features):
https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.h
https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.c

For this to work the config path needs to add timestamps to the
fib_infos or fib_aliases.

> The "lookup" I'm mostly concerned with is the fast path where the
> packets being processed actually look up a route.

This doesn't change at all. All code will be hidden in a library that
gets attached to the fib notifier, which is configuration code path.

> I do not think we can return success on the insert to the user yet
> have the route lookup dataplace not return that route on a lookup.

We don't change kernel fast path at all.

If we add/delete a route in software and hardware, kernel indicates
success as soon as software has the entry added. It also gets queued up
in the journal. Journal will be lazily processed, if error happens
during that (e.g. hardware signals table full), abort is called and all
packets go to user space ASAP. User space will always show the route as
it is added to in the first place and after the driver called abort also
process the packets accordingly.

I can imagine this can get very complicated. David's approach with a
counter to check for modifications and a limited number of retries
probably works too, especially because the hardware will probably be
initialized before routing daemons start up and will be synced up
hopefully all the time.

So maybe this is over engineered, but I have no idea how long hardware
needs to sync up a e.g. full IPv4 routing table into hardware (if that
is actually the current goal of this).

Bye,
Hannes



Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread Ido Schimmel
On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote:
> On 17.11.2016 17:45, David Miller wrote:
> > From: Hannes Frederic Sowa 
> > Date: Thu, 17 Nov 2016 15:36:48 +0100
> > 
> >> The other way is the journal idea I had, which uses an rb-tree with
> >> timestamps as keys (can be lamport timestamps). You insert into the tree
> >> until the dump is finished and use it as queue later to shuffle stuff
> >> into the hardware.
> > 
> > If you have this "place" where pending inserts are stored, you have
> > a policy decision to make.
> > 
> > First of all what do other lookups see when there are pending entires?
> 
> I think this is a problem with the current approach already, as the
> delayed work queue already postpones the insert for an undecidable
> amount of time (and reorders depending on which CPU the entry was
> inserted and the fib notifier was called).

The delayed work queue only postpones the insert into the listener's
table, but the entries will be present in the kernel's table as usual.
Therefore, other lookup made on the kernel's table will see the pending
entries.

Note that both listeners that currently call the dump API do that during
their init, before any lookups can be made on their tables. If you think
it's critical, then we can make sure the workqueues are flushed before
the listeners register their netdevs and effectively expose their tables
to lookups.

I'm looking into the reordering issues you mentioned. I belive it's a
valid point.

> For user space queries we would still query the in-kernel table.

Right. No changes here.

> > If, once inserted into the pending queue, you return success to the
> > inserting entity, then you must make those pending entries visible
> > to lookups.
> 
> I think this same problem is in this patch set already. The way I
> understood it, is, that if a problem during insert emerges, the driver
> calls abort and every packet will be send to user space, as the routing
> table cannot be offloaded and it won't try it again, Ido?

First of all, this abort mechanism is already in place and isn't part of
this patchset. Secondly, why is this a problem? The all-or-nothing
approach is an hard requirement and current implementation is infinitely
better than previous approach in which the kernel's tables were flushed
upon route insertion failure. It rendered the system unusable. Current
implementation of abort mechanism keeps the system usable, but with
reduced performance.

> Probably this is an artifact of the mellanox implementation and we can
> implement this differently for other cards, but the schema to abort all
> if the modification doesn't work, seems to be fundamental (I think we
> require the all-or-nothing approach for now).

Yes, it's an hard requirement for all implementations. mlxsw implements
it by evicting all routes from its tables and inserting a default route
that traps packets to CPU.

> > If you block the inserting entity, well that doesn't make a lot of
> > sense.  If blocking is a workable solution, then we can just block the
> > entire insert while this FIB dump to the device driver is happening.
> 
> I don't think we should really block (as in kernel-block) at any time.
> 
> I was suggesting something like:
> 
> rtnl_lock();
> synchronize_rcu_expedited(); // barrier, all rounting modifications are
> stable and no new can be added due to rtnl_lock
> register notifier(); // notifier adds entries also into journal();
> rtnl_unlock();
> walk_fib_tree_rcu_into_journal();
> // walk finished
> start_syncing_journal_to_hw(); // if new entries show up we sync them
> asap after this point
> 
> The journal would need a spin lock to protect its internal state and
> order events correctly.
> 
> > But I am pretty sure the idea of blocking modifications for so long
> > was considered undesirable.
> 
> Yes, this is also still my opinion.

It was indeed rejected :)
https://marc.info/?l=linux-netdev=147794848224884=2


Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread David Miller
From: Hannes Frederic Sowa 
Date: Thu, 17 Nov 2016 18:20:39 +0100

> Hi,
> 
> On 17.11.2016 17:45, David Miller wrote:
>> From: Hannes Frederic Sowa 
>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>> 
>>> The other way is the journal idea I had, which uses an rb-tree with
>>> timestamps as keys (can be lamport timestamps). You insert into the tree
>>> until the dump is finished and use it as queue later to shuffle stuff
>>> into the hardware.
>> 
>> If you have this "place" where pending inserts are stored, you have
>> a policy decision to make.
>> 
>> First of all what do other lookups see when there are pending entires?
> 
> I think this is a problem with the current approach already, as the
> delayed work queue already postpones the insert for an undecidable
> amount of time (and reorders depending on which CPU the entry was
> inserted and the fib notifier was called).
> 
> For user space queries we would still query the in-kernel table.

Ok, I think I might misunderstand something.

What is going into this journal exactly?  The actual full software and
hardware insert operation, or just the notification to the hardware
device driver notifiers?

The "lookup" I'm mostly concerned with is the fast path where the
packets being processed actually look up a route.

I do not think we can return success on the insert to the user yet
have the route lookup dataplace not return that route on a lookup.


Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread Hannes Frederic Sowa
Hi,

On 17.11.2016 17:45, David Miller wrote:
> From: Hannes Frederic Sowa 
> Date: Thu, 17 Nov 2016 15:36:48 +0100
> 
>> The other way is the journal idea I had, which uses an rb-tree with
>> timestamps as keys (can be lamport timestamps). You insert into the tree
>> until the dump is finished and use it as queue later to shuffle stuff
>> into the hardware.
> 
> If you have this "place" where pending inserts are stored, you have
> a policy decision to make.
> 
> First of all what do other lookups see when there are pending entires?

I think this is a problem with the current approach already, as the
delayed work queue already postpones the insert for an undecidable
amount of time (and reorders depending on which CPU the entry was
inserted and the fib notifier was called).

For user space queries we would still query the in-kernel table.

> If, once inserted into the pending queue, you return success to the
> inserting entity, then you must make those pending entries visible
> to lookups.

I think this same problem is in this patch set already. The way I
understood it, is, that if a problem during insert emerges, the driver
calls abort and every packet will be send to user space, as the routing
table cannot be offloaded and it won't try it again, Ido?

Probably this is an artifact of the mellanox implementation and we can
implement this differently for other cards, but the schema to abort all
if the modification doesn't work, seems to be fundamental (I think we
require the all-or-nothing approach for now).

> If you block the inserting entity, well that doesn't make a lot of
> sense.  If blocking is a workable solution, then we can just block the
> entire insert while this FIB dump to the device driver is happening.

I don't think we should really block (as in kernel-block) at any time.

I was suggesting something like:

rtnl_lock();
synchronize_rcu_expedited(); // barrier, all rounting modifications are
stable and no new can be added due to rtnl_lock
register notifier(); // notifier adds entries also into journal();
rtnl_unlock();
walk_fib_tree_rcu_into_journal();
// walk finished
start_syncing_journal_to_hw(); // if new entries show up we sync them
asap after this point

The journal would need a spin lock to protect its internal state and
order events correctly.

> But I am pretty sure the idea of blocking modifications for so long
> was considered undesirable.

Yes, this is also still my opinion.

Bye,
Hannes



Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread Hannes Frederic Sowa
On 17.11.2016 14:10, Ido Schimmel wrote:
> Hi Hannes,
> 
> On Wed, Nov 16, 2016 at 08:43:25PM +0100, Hannes Frederic Sowa wrote:
>> On 16.11.2016 19:51, Ido Schimmel wrote:
>>> On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
 On 16.11.2016 16:18, Ido Schimmel wrote:
> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
>> On 16.11.2016 15:09, Jiri Pirko wrote:
>>> From: Ido Schimmel 
>>>
>>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
>>> introduced a new notification chain to notify listeners (f.e., switchdev
>>> drivers) about addition and deletion of routes.
>>>
>>> However, upon registration to the chain the FIB tables can already be
>>> populated, which means potential listeners will have an incomplete view
>>> of the tables.
>>>
>>> Solve that by adding an API to request a FIB dump. The dump itself it
>>> done using RCU in order not to starve consumers that need RTNL to make
>>> progress.
>>>
>>> Signed-off-by: Ido Schimmel 
>>> Signed-off-by: Jiri Pirko 
>>
>> Have you looked at potential inconsistencies resulting of RCU walking
>> the table and having concurrent inserts?
>
> Yes. I did try to think about situations in which this approach will
> fail, but I could only find problems with concurrent removals, which I
> addressed in 5/8. In case of concurrent insertions, even if you missed
> the node, you would still get the ENTRY_ADD event to your listener.

 Theoretically a node could still be installed while the deletion event
 fired before registering the notifier. E.g. a synchronize_net before
 dumping could help here?
>>>
>>> If the deletion event fired for some fib alias, then by 5/8 we are
>>> guaranteed that it was already unlinked from the fib alias list in the
>>> leaf in which it was contained. So, while it's possible we didn't
>>> register our listener in time for the deletion event, we won't traverse
>>> this fib alias while dumping the trie anyway. Did I understand you
>>> correctly?
>>>
>>
>> Theoretically we can have the same problem for insertion:
>>
>> You receive a delete event from the notifier that is queued up first but
>> the dump will still see the entry in the fib due to being managed by RCU
>> (the notifier running on another CPU).
>>
>> The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is
>> still not strongly ordered against the local fib dump trie walk.
> 
> OK, so I believe my analysis in 5/8 was wrong. Despite CPU A invoking
> fib_remove_alias() in t0 for fa1, it's possible for CPU B doing the fib
> dump to see fa1 in t1, which will lead to fa1 being permanently present
> in the listener's table.

Yep. :(

Also the delayed work queue is only partial ordered (only on one CPU),
so you don't know about when an event is processed from the dump and the
notifier. I think you need to create your own workqueue for doing so.

> Given the above, I think Dave's suggestion is the only applicable
> solution. Do you agree? Any other suggestions?

As I wrote before the problem with seqcounter is, that with a quagga
running on top and pushing updates you end up never getting a stable
view, so maybe some logic to abort in case it cannot be loaded after a
few tries would be fine.

The other way is the journal idea I had, which uses an rb-tree with
timestamps as keys (can be lamport timestamps). You insert into the tree
until the dump is finished and use it as queue later to shuffle stuff
into the hardware.

Because you should be able to hold refs to the fa_info or other structs
directly, I don't expect gigantic memory overhead, just for a cell to
store timetamp and pointer (rb_node).

Bye,
Hannes



Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread Ido Schimmel
Hi Hannes,

On Wed, Nov 16, 2016 at 08:43:25PM +0100, Hannes Frederic Sowa wrote:
> On 16.11.2016 19:51, Ido Schimmel wrote:
> > On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
> >> On 16.11.2016 16:18, Ido Schimmel wrote:
> >>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
>  On 16.11.2016 15:09, Jiri Pirko wrote:
> > From: Ido Schimmel 
> >
> > Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> > introduced a new notification chain to notify listeners (f.e., switchdev
> > drivers) about addition and deletion of routes.
> >
> > However, upon registration to the chain the FIB tables can already be
> > populated, which means potential listeners will have an incomplete view
> > of the tables.
> >
> > Solve that by adding an API to request a FIB dump. The dump itself it
> > done using RCU in order not to starve consumers that need RTNL to make
> > progress.
> >
> > Signed-off-by: Ido Schimmel 
> > Signed-off-by: Jiri Pirko 
> 
>  Have you looked at potential inconsistencies resulting of RCU walking
>  the table and having concurrent inserts?
> >>>
> >>> Yes. I did try to think about situations in which this approach will
> >>> fail, but I could only find problems with concurrent removals, which I
> >>> addressed in 5/8. In case of concurrent insertions, even if you missed
> >>> the node, you would still get the ENTRY_ADD event to your listener.
> >>
> >> Theoretically a node could still be installed while the deletion event
> >> fired before registering the notifier. E.g. a synchronize_net before
> >> dumping could help here?
> > 
> > If the deletion event fired for some fib alias, then by 5/8 we are
> > guaranteed that it was already unlinked from the fib alias list in the
> > leaf in which it was contained. So, while it's possible we didn't
> > register our listener in time for the deletion event, we won't traverse
> > this fib alias while dumping the trie anyway. Did I understand you
> > correctly?
> > 
> 
> Theoretically we can have the same problem for insertion:
> 
> You receive a delete event from the notifier that is queued up first but
> the dump will still see the entry in the fib due to being managed by RCU
> (the notifier running on another CPU).
> 
> The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is
> still not strongly ordered against the local fib dump trie walk.

OK, so I believe my analysis in 5/8 was wrong. Despite CPU A invoking
fib_remove_alias() in t0 for fa1, it's possible for CPU B doing the fib
dump to see fa1 in t1, which will lead to fa1 being permanently present
in the listener's table.

Given the above, I think Dave's suggestion is the only applicable
solution. Do you agree? Any other suggestions?

Thanks


Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread David Miller
From: Hannes Frederic Sowa 
Date: Thu, 17 Nov 2016 15:36:48 +0100

> The other way is the journal idea I had, which uses an rb-tree with
> timestamps as keys (can be lamport timestamps). You insert into the tree
> until the dump is finished and use it as queue later to shuffle stuff
> into the hardware.

If you have this "place" where pending inserts are stored, you have
a policy decision to make.

First of all what do other lookups see when there are pending entires?

If, once inserted into the pending queue, you return success to the
inserting entity, then you must make those pending entries visible
to lookups.

If you block the inserting entity, well that doesn't make a lot of
sense.  If blocking is a workable solution, then we can just block the
entire insert while this FIB dump to the device driver is happening.

But I am pretty sure the idea of blocking modifications for so long
was considered undesirable.


Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-16 Thread Ido Schimmel
On Wed, Nov 16, 2016 at 08:43:25PM +0100, Hannes Frederic Sowa wrote:
> On 16.11.2016 19:51, Ido Schimmel wrote:
> > On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
> >> On 16.11.2016 16:18, Ido Schimmel wrote:
> >>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
>  On 16.11.2016 15:09, Jiri Pirko wrote:
> > From: Ido Schimmel 
> >
> > Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> > introduced a new notification chain to notify listeners (f.e., switchdev
> > drivers) about addition and deletion of routes.
> >
> > However, upon registration to the chain the FIB tables can already be
> > populated, which means potential listeners will have an incomplete view
> > of the tables.
> >
> > Solve that by adding an API to request a FIB dump. The dump itself it
> > done using RCU in order not to starve consumers that need RTNL to make
> > progress.
> >
> > Signed-off-by: Ido Schimmel 
> > Signed-off-by: Jiri Pirko 
> 
>  Have you looked at potential inconsistencies resulting of RCU walking
>  the table and having concurrent inserts?
> >>>
> >>> Yes. I did try to think about situations in which this approach will
> >>> fail, but I could only find problems with concurrent removals, which I
> >>> addressed in 5/8. In case of concurrent insertions, even if you missed
> >>> the node, you would still get the ENTRY_ADD event to your listener.
> >>
> >> Theoretically a node could still be installed while the deletion event
> >> fired before registering the notifier. E.g. a synchronize_net before
> >> dumping could help here?
> > 
> > If the deletion event fired for some fib alias, then by 5/8 we are
> > guaranteed that it was already unlinked from the fib alias list in the
> > leaf in which it was contained. So, while it's possible we didn't
> > register our listener in time for the deletion event, we won't traverse
> > this fib alias while dumping the trie anyway. Did I understand you
> > correctly?
> > 
> 
> Theoretically we can have the same problem for insertion:
> 
> You receive a delete event from the notifier that is queued up first but
> the dump will still see the entry in the fib due to being managed by RCU
> (the notifier running on another CPU).
> 
> The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is
> still not strongly ordered against the local fib dump trie walk.

It's pretty late here so I would have to check this out tomorrow
morning. If this is indeed the case (not saying you're wrong, just want
to verify for myself), then I guess 5/8 can be dropped and instead we
should go with Dave's suggestion? I don't see any other way given the
constraints...

Thanks a lot Hannes!


Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-16 Thread Hannes Frederic Sowa
On 16.11.2016 19:51, Ido Schimmel wrote:
> Hi,
> 
> On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
>> On 16.11.2016 16:18, Ido Schimmel wrote:
>>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
 On 16.11.2016 15:09, Jiri Pirko wrote:
> From: Ido Schimmel 
>
> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> introduced a new notification chain to notify listeners (f.e., switchdev
> drivers) about addition and deletion of routes.
>
> However, upon registration to the chain the FIB tables can already be
> populated, which means potential listeners will have an incomplete view
> of the tables.
>
> Solve that by adding an API to request a FIB dump. The dump itself it
> done using RCU in order not to starve consumers that need RTNL to make
> progress.
>
> Signed-off-by: Ido Schimmel 
> Signed-off-by: Jiri Pirko 

 Have you looked at potential inconsistencies resulting of RCU walking
 the table and having concurrent inserts?
>>>
>>> Yes. I did try to think about situations in which this approach will
>>> fail, but I could only find problems with concurrent removals, which I
>>> addressed in 5/8. In case of concurrent insertions, even if you missed
>>> the node, you would still get the ENTRY_ADD event to your listener.
>>
>> Theoretically a node could still be installed while the deletion event
>> fired before registering the notifier. E.g. a synchronize_net before
>> dumping could help here?
> 
> If the deletion event fired for some fib alias, then by 5/8 we are
> guaranteed that it was already unlinked from the fib alias list in the
> leaf in which it was contained. So, while it's possible we didn't
> register our listener in time for the deletion event, we won't traverse
> this fib alias while dumping the trie anyway. Did I understand you
> correctly?
> 

Theoretically we can have the same problem for insertion:

You receive a delete event from the notifier that is queued up first but
the dump will still see the entry in the fib due to being managed by RCU
(the notifier running on another CPU).

The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is
still not strongly ordered against the local fib dump trie walk.

>> I don't know how you prepare the data structures for inserting in into
>> the hardware, but if ordering matters, the notifier for a delete event
>> can be called before the dump installed the fib entry?
> 
> Right. It's possible for the listener to receive a deletion event for a
> fib entry it doesn't have, in which case it should just ignore it (as
> current listeners do).

Yep, for this specific case.

Bye,
Hannes



Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-16 Thread Ido Schimmel
Hi,

On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
> On 16.11.2016 16:18, Ido Schimmel wrote:
> > On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
> >> On 16.11.2016 15:09, Jiri Pirko wrote:
> >>> From: Ido Schimmel 
> >>>
> >>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> >>> introduced a new notification chain to notify listeners (f.e., switchdev
> >>> drivers) about addition and deletion of routes.
> >>>
> >>> However, upon registration to the chain the FIB tables can already be
> >>> populated, which means potential listeners will have an incomplete view
> >>> of the tables.
> >>>
> >>> Solve that by adding an API to request a FIB dump. The dump itself it
> >>> done using RCU in order not to starve consumers that need RTNL to make
> >>> progress.
> >>>
> >>> Signed-off-by: Ido Schimmel 
> >>> Signed-off-by: Jiri Pirko 
> >>
> >> Have you looked at potential inconsistencies resulting of RCU walking
> >> the table and having concurrent inserts?
> > 
> > Yes. I did try to think about situations in which this approach will
> > fail, but I could only find problems with concurrent removals, which I
> > addressed in 5/8. In case of concurrent insertions, even if you missed
> > the node, you would still get the ENTRY_ADD event to your listener.
> 
> Theoretically a node could still be installed while the deletion event
> fired before registering the notifier. E.g. a synchronize_net before
> dumping could help here?

If the deletion event fired for some fib alias, then by 5/8 we are
guaranteed that it was already unlinked from the fib alias list in the
leaf in which it was contained. So, while it's possible we didn't
register our listener in time for the deletion event, we won't traverse
this fib alias while dumping the trie anyway. Did I understand you
correctly?

> I don't know how you prepare the data structures for inserting in into
> the hardware, but if ordering matters, the notifier for a delete event
> can be called before the dump installed the fib entry?

Right. It's possible for the listener to receive a deletion event for a
fib entry it doesn't have, in which case it should just ignore it (as
current listeners do).


Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-16 Thread Hannes Frederic Sowa
On 16.11.2016 16:18, Ido Schimmel wrote:
> Hi Hannes,
> 
> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
>> On 16.11.2016 15:09, Jiri Pirko wrote:
>>> From: Ido Schimmel 
>>>
>>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
>>> introduced a new notification chain to notify listeners (f.e., switchdev
>>> drivers) about addition and deletion of routes.
>>>
>>> However, upon registration to the chain the FIB tables can already be
>>> populated, which means potential listeners will have an incomplete view
>>> of the tables.
>>>
>>> Solve that by adding an API to request a FIB dump. The dump itself it
>>> done using RCU in order not to starve consumers that need RTNL to make
>>> progress.
>>>
>>> Signed-off-by: Ido Schimmel 
>>> Signed-off-by: Jiri Pirko 
>>
>> Have you looked at potential inconsistencies resulting of RCU walking
>> the table and having concurrent inserts?
> 
> Yes. I did try to think about situations in which this approach will
> fail, but I could only find problems with concurrent removals, which I
> addressed in 5/8. In case of concurrent insertions, even if you missed
> the node, you would still get the ENTRY_ADD event to your listener.

Theoretically a node could still be installed while the deletion event
fired before registering the notifier. E.g. a synchronize_net before
dumping could help here?

I don't know how you prepare the data structures for inserting in into
the hardware, but if ordering matters, the notifier for a delete event
can be called before the dump installed the fib entry? E.g. you also
don't check what events are already queued in the delayed work queue so
far. I hope I understand the patches correctly?

Thanks,
Hannes



Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-16 Thread Hannes Frederic Sowa
On 16.11.2016 17:27, David Miller wrote:
> From: Hannes Frederic Sowa 
> Date: Wed, 16 Nov 2016 15:51:01 +0100
> 
>> I don't see a way around doing a journal like in filesystems somehow,
> 
> We really just need a sequence counter incremented for each insert/remove,
> and restart the dump from the beginning if it changes mid-dump.

I thought about this at first, too, but became afraid if we could end up
looping endlessly because of routing changes happen constantly while
trying to upload the fib into the hardware.




Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-16 Thread Ido Schimmel
Hi Dave,

On Wed, Nov 16, 2016 at 11:27:35AM -0500, David Miller wrote:
> From: Hannes Frederic Sowa 
> Date: Wed, 16 Nov 2016 15:51:01 +0100
> 
> > I don't see a way around doing a journal like in filesystems somehow,
> 
> We really just need a sequence counter incremented for each insert/remove,
> and restart the dump from the beginning if it changes mid-dump.

When you dump the FIB table your listener is already registered to the
chain, so any changes happening mid-dump will be propagated to it. I've
noted this in 5/8 and in my reply to Hannes.


Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-16 Thread David Miller
From: Hannes Frederic Sowa 
Date: Wed, 16 Nov 2016 15:51:01 +0100

> I don't see a way around doing a journal like in filesystems somehow,

We really just need a sequence counter incremented for each insert/remove,
and restart the dump from the beginning if it changes mid-dump.


Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-16 Thread Ido Schimmel
Hi Hannes,

On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
> On 16.11.2016 15:09, Jiri Pirko wrote:
> > From: Ido Schimmel 
> > 
> > Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> > introduced a new notification chain to notify listeners (f.e., switchdev
> > drivers) about addition and deletion of routes.
> > 
> > However, upon registration to the chain the FIB tables can already be
> > populated, which means potential listeners will have an incomplete view
> > of the tables.
> > 
> > Solve that by adding an API to request a FIB dump. The dump itself it
> > done using RCU in order not to starve consumers that need RTNL to make
> > progress.
> > 
> > Signed-off-by: Ido Schimmel 
> > Signed-off-by: Jiri Pirko 
> 
> Have you looked at potential inconsistencies resulting of RCU walking
> the table and having concurrent inserts?

Yes. I did try to think about situations in which this approach will
fail, but I could only find problems with concurrent removals, which I
addressed in 5/8. In case of concurrent insertions, even if you missed
the node, you would still get the ENTRY_ADD event to your listener.


Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-16 Thread Hannes Frederic Sowa
On 16.11.2016 15:09, Jiri Pirko wrote:
> From: Ido Schimmel 
> 
> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> introduced a new notification chain to notify listeners (f.e., switchdev
> drivers) about addition and deletion of routes.
> 
> However, upon registration to the chain the FIB tables can already be
> populated, which means potential listeners will have an incomplete view
> of the tables.
> 
> Solve that by adding an API to request a FIB dump. The dump itself it
> done using RCU in order not to starve consumers that need RTNL to make
> progress.
> 
> Signed-off-by: Ido Schimmel 
> Signed-off-by: Jiri Pirko 

Have you looked at potential inconsistencies resulting of RCU walking
the table and having concurrent inserts?

I don't see a way around doing a journal like in filesystems somehow,
but maybe the effects are not that severe and it is not a problem after all.

Bye,
Hannes