Re: bgpd fix peer signaling bug for busy systems

Claudio Jeker Fri, 26 Aug 2022 05:07:59 -0700

On Fri, Aug 26, 2022 at 01:42:15PM +0200, Theo Buehler wrote:
> On Fri, Aug 26, 2022 at 10:58:38AM +0200, Claudio Jeker wrote:
> > Noticed on a route collector with >100 full feeds and well 80Mio prefixes.
> > On startup the RDE slurps in a lot of messages and then slowly processes
> > them. Those are mostly IMSG_UDPATE but the current code also queues
> > IMSG_SESSION_DOWN, IMSG_SESSION_UP and the graceful restart imsgs.
> > It does not queue IMSG_SESSION_ADD and this is a problem.
> > 
> > If the queue of a neighbor is long then an IMSG_SESSION_DOWN can be
> > delayed after processing the IMSG_SESSION_ADD which is sent when the
> > session is reestablished. The result is that a peer is removed in the
> > RDE that is actually active. Most noticable is that the log is filled
> > with "rde_dispatch: unknown peer id XYZ".
> > 
> > To solve this issue IMSG_SESSION_DOWN needs to be called immeditatly.
> > For graceful restart the same is required. In the graceful restart case
> > two messages indicate a session reset (IMSG_SESSION_STALE and the new
> > IMSG_SESSION_NOGRACE). Depending on the AFI/SAFI settings in the GR
> > capability either one or the other is sent when the session is reset.
> > 
> > For the above 3 messages the RDE should stop all rib dump runners and more
> > importantly flush the peer imsg queue. All those queued messages no longer
> > matter.
> > 
> > IMSG_SESSION_UP could still be handled through the queue since it is the
> > first message sent after IMSG_SESSION_ADD. But I decided to also handle it
> > immediatly.
> > 
> > The graceful restart imsgs IMSG_SESSION_FLUSH and IMSG_SESSION_RESTARTED
> > are there to clean up stale routes (either after a timeout or a successful
> > restart). Again it is best to handle those messages immediately and clean
> > up the RIB (even though a backlog of UPDATES may be present).
> > In this case the imsg queue should not be flushed since there may be an
> > active connection. Note: IMSG_SESSION_RESTARTED is sent based on an EoR
> > marker from an IMSG_UPDATE (so it is delayed and in order because of that).
> > 
> > With this only IMSG_UPDATE and IMSG_REFRESH are queued.
> > 
> > I'm running this diff on the route collector and some test systems and it
> > seems to solve the issue.
> 
> This reads fine to me. One comment:
> 
> > Index: rde_peer.c
> > ===================================================================
> > RCS file: /cvs/src/usr.sbin/bgpd/rde_peer.c,v
> > retrieving revision 1.21
> > diff -u -p -r1.21 rde_peer.c
> > --- rde_peer.c      17 Aug 2022 15:15:26 -0000      1.21
> > +++ rde_peer.c      26 Aug 2022 08:24:49 -0000
> > @@ -194,7 +194,7 @@ peer_add(uint32_t id, struct peer_config
> >  
> >     if ((peer = peer_get(id))) {
> >             memcpy(&peer->conf, p_conf, sizeof(struct peer_config));
> > -           return (NULL);
> > +           return (peer);
> 
> In peer_init() there is a fatalx() that ensures that PEER_ID_SELF is
> not present in the peer list. This no longer works with this change.
> I can't tell if this sanity check is important or whether it can go.
 
Hmm. I had to add it because I use the return value now in the
IMSG_SESSION_ADD case to track the rde_eval_all flag.
I think the check in peer_init() can go it should be impossible to trigger
this case there.


-- 
:wq Claudio

Re: bgpd fix peer signaling bug for busy systems

Reply via email to