Re: bgpd async nexthop update loop

2019-06-19 Thread Sebastian Benoit
ok!

Claudio Jeker(cje...@diehard.n-r-g.com) on 2019.06.17 23:08:37 +0200:
> On Mon, Jun 17, 2019 at 10:20:57PM +0200, Sebastian Benoit wrote:
> > Claudio Jeker(cje...@diehard.n-r-g.com) on 2019.06.17 21:34:30 +0200:
> > > Hi,
> > > 
> > > Now that the community rewrite is in here another diff to make bgpd more
> > > awesome. There was one loop left that did not run asynchronous compared to
> > > the main event loop. This loop was in nexthop_update() now a route server
> > > or a big route reflector would be hit hard by this loop since it will send
> > > out many updates when a nexthop becomes valid. I have seen times where the
> > > RDE was busy for 15min in nexthop_update().
> > > 
> > > This diff changes nexthop_update() to do this work asynchronous and
> > > therefor returns to the main event loop much more often.
> > > Nexthops now have a prefix pointer next_prefix which is used for looping
> > > through all prefixes. The for loop is broken up in RDE_RUNNER_ROUNDS steps
> > > which should hopefully return quickly.
> > > 
> > > Additionally prefixes belonging to a RIB that is not evaluated (no
> > > decision process running) are no longer linked into this nexthop list
> > > which should also shorten the loop a bit (at least Adj-RIB-In and
> > > Adj-RIB-Out are now skipped). To make this work correctly the linking and
> > > unlinking of prefixes has been adjusted to always work in the right way
> > > (nexhop last on link and first on unlink).
> > 
> > reads ok, one comment.
> 
> Fixed.
>  
> > > When testing please watch out for this message:
> > > prefix_updateall: prefix with F_RIB_NOEVALUATE hit
> > > This should never happen.
> > > 
> > > I'm running this on a few routers with good success. Please check if you 
> > > see
> > > a) worse convergence time (updates propagating much slower)
> > > b) better responsiveness (to e.g. bgpctl commands)
> > 
> > i see things much faster, actually too fast, i cant meaningdul test bgpctl
> > responsiveness.
> 
> You just need around 500 peers which are affected by a big nexthop update :)
>  
> > > There is some potential for tuning but I would prefer to do this at a
> > > later stage.
> > 
> > sure. one question about that: is the TAILQ_INSERT_TAIL/TAILQ_INSERT_HEAD
> > optimal?
> 
> The idea is to make this like a ring. Start with the latest arrival
> first and then process one after the other. This makes sure that every
> entry gets about the same amount of process time. The TAILQ_INSERT_HEAD
> in nexthop_update() will reduce initial latency for nexthops with only
> few prefixes since they will be done in one nexthop_runner() call.
> 
> There is also additional interactions with the other runners
> (imsg processing, rib_dump_runner and rde_update_queue_runner) which I
> need to profile to understand them better. It looks like
> rde_dispatch_imsg_session() is running ahead and produces a lot of work
> that can't be absorbed that quickly by the other runners. At least that is
> the case in the big route server setup that I'm working with.
> 
> > > -- 
> > > :wq Claudio
> > > 
> > > Index: rde.c
> > > ===
> > > RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v
> > > retrieving revision 1.470
> > > diff -u -p -r1.470 rde.c
> > > --- rde.c 17 Jun 2019 13:35:43 -  1.470
> > > +++ rde.c 17 Jun 2019 13:56:16 -
> > > @@ -256,9 +256,6 @@ rde_main(int debug, int verbose)
> > >   set_pollfd([PFD_PIPE_SESSION], ibuf_se);
> > >   set_pollfd([PFD_PIPE_SESSION_CTL], ibuf_se_ctl);
> > >  
> > > - if (rib_dump_pending() || rde_update_queue_pending())
> > > - timeout = 0;
> > > -
> > >   i = PFD_PIPE_COUNT;
> > >   for (mctx = LIST_FIRST(_mrts); mctx != 0; mctx = xmctx) {
> > >   xmctx = LIST_NEXT(mctx, entry);
> > > @@ -275,6 +272,10 @@ rde_main(int debug, int verbose)
> > >   }
> > >   }
> > >  
> > > + if (rib_dump_pending() || rde_update_queue_pending() ||
> > > + nexthop_pending())
> > > + timeout = 0;
> > > +
> > >   if (poll(pfd, i, timeout) == -1) {
> > >   if (errno != EINTR)
> > >   fatal("poll error");
> > > @@ -311,12 +312,13 @@ rde_main(int debug, int verbose)
> > >   mctx = LIST_NEXT(mctx, entry);
> > >   }
> > >  
> > > + rib_dump_runner();
> > > + nexthop_runner();
> > >   if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) {
> > >   rde_update_queue_runner();
> > >   for (aid = AID_INET6; aid < AID_MAX; aid++)
> > >   rde_update6_queue_runner(aid);
> > >   }
> > > - rib_dump_runner();
> > >   }
> > >  
> > >   /* do not clean up on shutdown on production, it takes ages. */
> > > Index: rde.h
> > > ===
> > > RCS file: 

Re: bgpd async nexthop update loop

2019-06-17 Thread Claudio Jeker
On Mon, Jun 17, 2019 at 10:20:57PM +0200, Sebastian Benoit wrote:
> Claudio Jeker(cje...@diehard.n-r-g.com) on 2019.06.17 21:34:30 +0200:
> > Hi,
> > 
> > Now that the community rewrite is in here another diff to make bgpd more
> > awesome. There was one loop left that did not run asynchronous compared to
> > the main event loop. This loop was in nexthop_update() now a route server
> > or a big route reflector would be hit hard by this loop since it will send
> > out many updates when a nexthop becomes valid. I have seen times where the
> > RDE was busy for 15min in nexthop_update().
> > 
> > This diff changes nexthop_update() to do this work asynchronous and
> > therefor returns to the main event loop much more often.
> > Nexthops now have a prefix pointer next_prefix which is used for looping
> > through all prefixes. The for loop is broken up in RDE_RUNNER_ROUNDS steps
> > which should hopefully return quickly.
> > 
> > Additionally prefixes belonging to a RIB that is not evaluated (no
> > decision process running) are no longer linked into this nexthop list
> > which should also shorten the loop a bit (at least Adj-RIB-In and
> > Adj-RIB-Out are now skipped). To make this work correctly the linking and
> > unlinking of prefixes has been adjusted to always work in the right way
> > (nexhop last on link and first on unlink).
> 
> reads ok, one comment.

Fixed.
 
> > When testing please watch out for this message:
> > prefix_updateall: prefix with F_RIB_NOEVALUATE hit
> > This should never happen.
> > 
> > I'm running this on a few routers with good success. Please check if you see
> > a) worse convergence time (updates propagating much slower)
> > b) better responsiveness (to e.g. bgpctl commands)
> 
> i see things much faster, actually too fast, i cant meaningdul test bgpctl
> responsiveness.

You just need around 500 peers which are affected by a big nexthop update :)
 
> > There is some potential for tuning but I would prefer to do this at a
> > later stage.
> 
> sure. one question about that: is the TAILQ_INSERT_TAIL/TAILQ_INSERT_HEAD
> optimal?

The idea is to make this like a ring. Start with the latest arrival
first and then process one after the other. This makes sure that every
entry gets about the same amount of process time. The TAILQ_INSERT_HEAD
in nexthop_update() will reduce initial latency for nexthops with only
few prefixes since they will be done in one nexthop_runner() call.

There is also additional interactions with the other runners
(imsg processing, rib_dump_runner and rde_update_queue_runner) which I
need to profile to understand them better. It looks like
rde_dispatch_imsg_session() is running ahead and produces a lot of work
that can't be absorbed that quickly by the other runners. At least that is
the case in the big route server setup that I'm working with.

> > -- 
> > :wq Claudio
> > 
> > Index: rde.c
> > ===
> > RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v
> > retrieving revision 1.470
> > diff -u -p -r1.470 rde.c
> > --- rde.c   17 Jun 2019 13:35:43 -  1.470
> > +++ rde.c   17 Jun 2019 13:56:16 -
> > @@ -256,9 +256,6 @@ rde_main(int debug, int verbose)
> > set_pollfd([PFD_PIPE_SESSION], ibuf_se);
> > set_pollfd([PFD_PIPE_SESSION_CTL], ibuf_se_ctl);
> >  
> > -   if (rib_dump_pending() || rde_update_queue_pending())
> > -   timeout = 0;
> > -
> > i = PFD_PIPE_COUNT;
> > for (mctx = LIST_FIRST(_mrts); mctx != 0; mctx = xmctx) {
> > xmctx = LIST_NEXT(mctx, entry);
> > @@ -275,6 +272,10 @@ rde_main(int debug, int verbose)
> > }
> > }
> >  
> > +   if (rib_dump_pending() || rde_update_queue_pending() ||
> > +   nexthop_pending())
> > +   timeout = 0;
> > +
> > if (poll(pfd, i, timeout) == -1) {
> > if (errno != EINTR)
> > fatal("poll error");
> > @@ -311,12 +312,13 @@ rde_main(int debug, int verbose)
> > mctx = LIST_NEXT(mctx, entry);
> > }
> >  
> > +   rib_dump_runner();
> > +   nexthop_runner();
> > if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) {
> > rde_update_queue_runner();
> > for (aid = AID_INET6; aid < AID_MAX; aid++)
> > rde_update6_queue_runner(aid);
> > }
> > -   rib_dump_runner();
> > }
> >  
> > /* do not clean up on shutdown on production, it takes ages. */
> > Index: rde.h
> > ===
> > RCS file: /cvs/src/usr.sbin/bgpd/rde.h,v
> > retrieving revision 1.214
> > diff -u -p -r1.214 rde.h
> > --- rde.h   17 Jun 2019 13:35:43 -  1.214
> > +++ rde.h   17 Jun 2019 13:56:16 -
> > @@ -228,12 +228,15 @@ struct rde_aspath {
> >  enum nexthop_state {
> 

Re: bgpd async nexthop update loop

2019-06-17 Thread Sebastian Benoit
Claudio Jeker(cje...@diehard.n-r-g.com) on 2019.06.17 21:34:30 +0200:
> Hi,
> 
> Now that the community rewrite is in here another diff to make bgpd more
> awesome. There was one loop left that did not run asynchronous compared to
> the main event loop. This loop was in nexthop_update() now a route server
> or a big route reflector would be hit hard by this loop since it will send
> out many updates when a nexthop becomes valid. I have seen times where the
> RDE was busy for 15min in nexthop_update().
> 
> This diff changes nexthop_update() to do this work asynchronous and
> therefor returns to the main event loop much more often.
> Nexthops now have a prefix pointer next_prefix which is used for looping
> through all prefixes. The for loop is broken up in RDE_RUNNER_ROUNDS steps
> which should hopefully return quickly.
> 
> Additionally prefixes belonging to a RIB that is not evaluated (no
> decision process running) are no longer linked into this nexthop list
> which should also shorten the loop a bit (at least Adj-RIB-In and
> Adj-RIB-Out are now skipped). To make this work correctly the linking and
> unlinking of prefixes has been adjusted to always work in the right way
> (nexhop last on link and first on unlink).

reads ok, one comment.

> When testing please watch out for this message:
> prefix_updateall: prefix with F_RIB_NOEVALUATE hit
> This should never happen.
> 
> I'm running this on a few routers with good success. Please check if you see
> a) worse convergence time (updates propagating much slower)
> b) better responsiveness (to e.g. bgpctl commands)

i see things much faster, actually too fast, i cant meaningdul test bgpctl
responsiveness.

> There is some potential for tuning but I would prefer to do this at a
> later stage.

sure. one question about that: is the TAILQ_INSERT_TAIL/TAILQ_INSERT_HEAD
optimal?

> -- 
> :wq Claudio
> 
> Index: rde.c
> ===
> RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v
> retrieving revision 1.470
> diff -u -p -r1.470 rde.c
> --- rde.c 17 Jun 2019 13:35:43 -  1.470
> +++ rde.c 17 Jun 2019 13:56:16 -
> @@ -256,9 +256,6 @@ rde_main(int debug, int verbose)
>   set_pollfd([PFD_PIPE_SESSION], ibuf_se);
>   set_pollfd([PFD_PIPE_SESSION_CTL], ibuf_se_ctl);
>  
> - if (rib_dump_pending() || rde_update_queue_pending())
> - timeout = 0;
> -
>   i = PFD_PIPE_COUNT;
>   for (mctx = LIST_FIRST(_mrts); mctx != 0; mctx = xmctx) {
>   xmctx = LIST_NEXT(mctx, entry);
> @@ -275,6 +272,10 @@ rde_main(int debug, int verbose)
>   }
>   }
>  
> + if (rib_dump_pending() || rde_update_queue_pending() ||
> + nexthop_pending())
> + timeout = 0;
> +
>   if (poll(pfd, i, timeout) == -1) {
>   if (errno != EINTR)
>   fatal("poll error");
> @@ -311,12 +312,13 @@ rde_main(int debug, int verbose)
>   mctx = LIST_NEXT(mctx, entry);
>   }
>  
> + rib_dump_runner();
> + nexthop_runner();
>   if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) {
>   rde_update_queue_runner();
>   for (aid = AID_INET6; aid < AID_MAX; aid++)
>   rde_update6_queue_runner(aid);
>   }
> - rib_dump_runner();
>   }
>  
>   /* do not clean up on shutdown on production, it takes ages. */
> Index: rde.h
> ===
> RCS file: /cvs/src/usr.sbin/bgpd/rde.h,v
> retrieving revision 1.214
> diff -u -p -r1.214 rde.h
> --- rde.h 17 Jun 2019 13:35:43 -  1.214
> +++ rde.h 17 Jun 2019 13:56:16 -
> @@ -228,12 +228,15 @@ struct rde_aspath {
>  enum nexthop_state {
>   NEXTHOP_LOOKUP,
>   NEXTHOP_UNREACH,
> - NEXTHOP_REACH
> + NEXTHOP_REACH,
> + NEXTHOP_FLAPPED
>  };
>  
>  struct nexthop {
>   LIST_ENTRY(nexthop) nexthop_l;
> + TAILQ_ENTRY(nexthop)runner_l;
>   struct prefix_list  prefix_h;
> + struct prefix   *next_prefix;
>   struct bgpd_addrexit_nexthop;
>   struct bgpd_addrtrue_nexthop;
>   struct bgpd_addrnexthop_net;
> @@ -246,6 +249,7 @@ struct nexthop {
>  #endif
>   int refcnt;
>   enum nexthop_state  state;
> + enum nexthop_state  oldstate;
>   u_int8_tnexthop_netlen;
>   u_int8_tflags;
>  #define NEXTHOP_CONNECTED0x01
> @@ -593,6 +597,8 @@ prefix_vstate(struct prefix *p)
>  
>  void  nexthop_init(u_int32_t);
>  void  nexthop_shutdown(void);
> +int   nexthop_pending(void);
> +void  nexthop_runner(void);
>  void  nexthop_modify(struct nexthop *, enum 

bgpd async nexthop update loop

2019-06-17 Thread Claudio Jeker
Hi,

Now that the community rewrite is in here another diff to make bgpd more
awesome. There was one loop left that did not run asynchronous compared to
the main event loop. This loop was in nexthop_update() now a route server
or a big route reflector would be hit hard by this loop since it will send
out many updates when a nexthop becomes valid. I have seen times where the
RDE was busy for 15min in nexthop_update().

This diff changes nexthop_update() to do this work asynchronous and
therefor returns to the main event loop much more often.
Nexthops now have a prefix pointer next_prefix which is used for looping
through all prefixes. The for loop is broken up in RDE_RUNNER_ROUNDS steps
which should hopefully return quickly.

Additionally prefixes belonging to a RIB that is not evaluated (no
decision process running) are no longer linked into this nexthop list
which should also shorten the loop a bit (at least Adj-RIB-In and
Adj-RIB-Out are now skipped). To make this work correctly the linking and
unlinking of prefixes has been adjusted to always work in the right way
(nexhop last on link and first on unlink).

When testing please watch out for this message:
prefix_updateall: prefix with F_RIB_NOEVALUATE hit
This should never happen.

I'm running this on a few routers with good success. Please check if you see
a) worse convergence time (updates propagating much slower)
b) better responsiveness (to e.g. bgpctl commands)

There is some potential for tuning but I would prefer to do this at a
later stage.
-- 
:wq Claudio

Index: rde.c
===
RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v
retrieving revision 1.470
diff -u -p -r1.470 rde.c
--- rde.c   17 Jun 2019 13:35:43 -  1.470
+++ rde.c   17 Jun 2019 13:56:16 -
@@ -256,9 +256,6 @@ rde_main(int debug, int verbose)
set_pollfd([PFD_PIPE_SESSION], ibuf_se);
set_pollfd([PFD_PIPE_SESSION_CTL], ibuf_se_ctl);
 
-   if (rib_dump_pending() || rde_update_queue_pending())
-   timeout = 0;
-
i = PFD_PIPE_COUNT;
for (mctx = LIST_FIRST(_mrts); mctx != 0; mctx = xmctx) {
xmctx = LIST_NEXT(mctx, entry);
@@ -275,6 +272,10 @@ rde_main(int debug, int verbose)
}
}
 
+   if (rib_dump_pending() || rde_update_queue_pending() ||
+   nexthop_pending())
+   timeout = 0;
+
if (poll(pfd, i, timeout) == -1) {
if (errno != EINTR)
fatal("poll error");
@@ -311,12 +312,13 @@ rde_main(int debug, int verbose)
mctx = LIST_NEXT(mctx, entry);
}
 
+   rib_dump_runner();
+   nexthop_runner();
if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) {
rde_update_queue_runner();
for (aid = AID_INET6; aid < AID_MAX; aid++)
rde_update6_queue_runner(aid);
}
-   rib_dump_runner();
}
 
/* do not clean up on shutdown on production, it takes ages. */
Index: rde.h
===
RCS file: /cvs/src/usr.sbin/bgpd/rde.h,v
retrieving revision 1.214
diff -u -p -r1.214 rde.h
--- rde.h   17 Jun 2019 13:35:43 -  1.214
+++ rde.h   17 Jun 2019 13:56:16 -
@@ -228,12 +228,15 @@ struct rde_aspath {
 enum nexthop_state {
NEXTHOP_LOOKUP,
NEXTHOP_UNREACH,
-   NEXTHOP_REACH
+   NEXTHOP_REACH,
+   NEXTHOP_FLAPPED
 };
 
 struct nexthop {
LIST_ENTRY(nexthop) nexthop_l;
+   TAILQ_ENTRY(nexthop)runner_l;
struct prefix_list  prefix_h;
+   struct prefix   *next_prefix;
struct bgpd_addrexit_nexthop;
struct bgpd_addrtrue_nexthop;
struct bgpd_addrnexthop_net;
@@ -246,6 +249,7 @@ struct nexthop {
 #endif
int refcnt;
enum nexthop_state  state;
+   enum nexthop_state  oldstate;
u_int8_tnexthop_netlen;
u_int8_tflags;
 #define NEXTHOP_CONNECTED  0x01
@@ -593,6 +597,8 @@ prefix_vstate(struct prefix *p)
 
 voidnexthop_init(u_int32_t);
 voidnexthop_shutdown(void);
+int nexthop_pending(void);
+voidnexthop_runner(void);
 voidnexthop_modify(struct nexthop *, enum action_types, u_int8_t,
struct nexthop **, u_int8_t *);
 voidnexthop_link(struct prefix *);
Index: rde_rib.c
===
RCS file: /cvs/src/usr.sbin/bgpd/rde_rib.c,v
retrieving revision 1.191
diff -u -p -r1.191 rde_rib.c
--- rde_rib.c   17 Jun 2019 11:02:20 -  1.191
+++ rde_rib.c   17 Jun 2019