Re: bgpd decision process and bad dmetric

2022-07-16 Thread Claudio Jeker
On Sat, Jul 16, 2022 at 01:51:58PM +0200, Theo Buehler wrote:
> On Sat, Jul 16, 2022 at 12:41:07PM +0200, Claudio Jeker wrote:
> > I deployed bgpd on one of more core routers and triggered the fatal
> > "bad dmetric in decision process" from time to time.
> > 
> > I realized after a longer debugging session that one reason this happens
> > is when nexthops become valid. The state change affects all prefixes at
> > once but then they are reevaluated one by one (see prefix_evaluate_all()
> > which is called by nexthop_runner()).
> > 
> > I currently have no good solution for this issue. I think the problem is
> > that invalid prefixes are not sorted when added. There may be a similar
> > issue when flipping a rib from no-evaluate to evaluate in the reload code.
> > 
> > For now neuter the fatalx and convert it to a log_debug() until I figured
> > out a proper fix.
> 
> ok.
> 
> Now that the scope_id is part of struct bgpd_addr, the XXX in
> pt_getaddr() can go.

Not yet. The code in rde_prefix.c itself is not ready to handle scope_id.
It is not really needed there since it makes no sense to put link-local
prefixes into BGP. I need to look what the the next step is on the
scope_id quest.
 
> > -- 
> > :wq Claudio
> > 
> > Index: rde_decide.c
> > ===
> > RCS file: /cvs/src/usr.sbin/bgpd/rde_decide.c,v
> > retrieving revision 1.95
> > diff -u -p -r1.95 rde_decide.c
> > --- rde_decide.c11 Jul 2022 16:46:41 -  1.95
> > +++ rde_decide.c16 Jul 2022 10:28:19 -
> > @@ -331,8 +331,12 @@ prefix_set_dmetric(struct prefix *pp, st
> > PREFIX_DMETRIC_BEST : PREFIX_DMETRIC_INVALID;
> > else
> > np->dmetric = prefix_cmp(pp, np, &testall);
> > -   if (np->dmetric < 0)
> > -   fatalx("bad dmetric in decision process");
> > +   if (np->dmetric < 0) {
> > +   struct bgpd_addr addr;
> > +   pt_getaddr(np->pt, &addr);
> > +   log_debug("bad dmetric in decision process: %s/%u",
> > +   log_addr(&addr), np->pt->prefixlen);
> > +   }
> > }
> >  }
> >  
> > 
> 

-- 
:wq Claudio



Re: bgpd decision process and bad dmetric

2022-07-16 Thread Theo Buehler
On Sat, Jul 16, 2022 at 12:41:07PM +0200, Claudio Jeker wrote:
> I deployed bgpd on one of more core routers and triggered the fatal
> "bad dmetric in decision process" from time to time.
> 
> I realized after a longer debugging session that one reason this happens
> is when nexthops become valid. The state change affects all prefixes at
> once but then they are reevaluated one by one (see prefix_evaluate_all()
> which is called by nexthop_runner()).
> 
> I currently have no good solution for this issue. I think the problem is
> that invalid prefixes are not sorted when added. There may be a similar
> issue when flipping a rib from no-evaluate to evaluate in the reload code.
> 
> For now neuter the fatalx and convert it to a log_debug() until I figured
> out a proper fix.

ok.

Now that the scope_id is part of struct bgpd_addr, the XXX in
pt_getaddr() can go.

> -- 
> :wq Claudio
> 
> Index: rde_decide.c
> ===
> RCS file: /cvs/src/usr.sbin/bgpd/rde_decide.c,v
> retrieving revision 1.95
> diff -u -p -r1.95 rde_decide.c
> --- rde_decide.c  11 Jul 2022 16:46:41 -  1.95
> +++ rde_decide.c  16 Jul 2022 10:28:19 -
> @@ -331,8 +331,12 @@ prefix_set_dmetric(struct prefix *pp, st
>   PREFIX_DMETRIC_BEST : PREFIX_DMETRIC_INVALID;
>   else
>   np->dmetric = prefix_cmp(pp, np, &testall);
> - if (np->dmetric < 0)
> - fatalx("bad dmetric in decision process");
> + if (np->dmetric < 0) {
> + struct bgpd_addr addr;
> + pt_getaddr(np->pt, &addr);
> + log_debug("bad dmetric in decision process: %s/%u",
> + log_addr(&addr), np->pt->prefixlen);
> + }
>   }
>  }
>  
> 



bgpd decision process and bad dmetric

2022-07-16 Thread Claudio Jeker
I deployed bgpd on one of more core routers and triggered the fatal
"bad dmetric in decision process" from time to time.

I realized after a longer debugging session that one reason this happens
is when nexthops become valid. The state change affects all prefixes at
once but then they are reevaluated one by one (see prefix_evaluate_all()
which is called by nexthop_runner()).

I currently have no good solution for this issue. I think the problem is
that invalid prefixes are not sorted when added. There may be a similar
issue when flipping a rib from no-evaluate to evaluate in the reload code.

For now neuter the fatalx and convert it to a log_debug() until I figured
out a proper fix.
-- 
:wq Claudio

Index: rde_decide.c
===
RCS file: /cvs/src/usr.sbin/bgpd/rde_decide.c,v
retrieving revision 1.95
diff -u -p -r1.95 rde_decide.c
--- rde_decide.c11 Jul 2022 16:46:41 -  1.95
+++ rde_decide.c16 Jul 2022 10:28:19 -
@@ -331,8 +331,12 @@ prefix_set_dmetric(struct prefix *pp, st
PREFIX_DMETRIC_BEST : PREFIX_DMETRIC_INVALID;
else
np->dmetric = prefix_cmp(pp, np, &testall);
-   if (np->dmetric < 0)
-   fatalx("bad dmetric in decision process");
+   if (np->dmetric < 0) {
+   struct bgpd_addr addr;
+   pt_getaddr(np->pt, &addr);
+   log_debug("bad dmetric in decision process: %s/%u",
+   log_addr(&addr), np->pt->prefixlen);
+   }
}
 }