On Fri, Aug 26, 2022 at 10:58:38AM +0200, Claudio Jeker wrote:
> Noticed on a route collector with >100 full feeds and well 80Mio prefixes.
> On startup the RDE slurps in a lot of messages and then slowly processes
> them. Those are mostly IMSG_UDPATE but the current code also queues
> IMSG_SESSION_DOWN, IMSG_SESSION_UP and the graceful restart imsgs.
> It does not queue IMSG_SESSION_ADD and this is a problem.
>
> If the queue of a neighbor is long then an IMSG_SESSION_DOWN can be
> delayed after processing the IMSG_SESSION_ADD which is sent when the
> session is reestablished. The result is that a peer is removed in the
> RDE that is actually active. Most noticable is that the log is filled
> with "rde_dispatch: unknown peer id XYZ".
>
> To solve this issue IMSG_SESSION_DOWN needs to be called immeditatly.
> For graceful restart the same is required. In the graceful restart case
> two messages indicate a session reset (IMSG_SESSION_STALE and the new
> IMSG_SESSION_NOGRACE). Depending on the AFI/SAFI settings in the GR
> capability either one or the other is sent when the session is reset.
>
> For the above 3 messages the RDE should stop all rib dump runners and more
> importantly flush the peer imsg queue. All those queued messages no longer
> matter.
>
> IMSG_SESSION_UP could still be handled through the queue since it is the
> first message sent after IMSG_SESSION_ADD. But I decided to also handle it
> immediatly.
>
> The graceful restart imsgs IMSG_SESSION_FLUSH and IMSG_SESSION_RESTARTED
> are there to clean up stale routes (either after a timeout or a successful
> restart). Again it is best to handle those messages immediately and clean
> up the RIB (even though a backlog of UPDATES may be present).
> In this case the imsg queue should not be flushed since there may be an
> active connection. Note: IMSG_SESSION_RESTARTED is sent based on an EoR
> marker from an IMSG_UPDATE (so it is delayed and in order because of that).
>
> With this only IMSG_UPDATE and IMSG_REFRESH are queued.
>
> I'm running this diff on the route collector and some test systems and it
> seems to solve the issue.
This reads fine to me. One comment:
> Index: rde_peer.c
> ===================================================================
> RCS file: /cvs/src/usr.sbin/bgpd/rde_peer.c,v
> retrieving revision 1.21
> diff -u -p -r1.21 rde_peer.c
> --- rde_peer.c 17 Aug 2022 15:15:26 -0000 1.21
> +++ rde_peer.c 26 Aug 2022 08:24:49 -0000
> @@ -194,7 +194,7 @@ peer_add(uint32_t id, struct peer_config
>
> if ((peer = peer_get(id))) {
> memcpy(&peer->conf, p_conf, sizeof(struct peer_config));
> - return (NULL);
> + return (peer);
In peer_init() there is a fatalx() that ensures that PEER_ID_SELF is
not present in the peer list. This no longer works with this change.
I can't tell if this sanity check is important or whether it can go.