Re: openBGPd crashes in 6.2 and 6.3: "a politician in the decision process"

2018-09-10 Thread Daniel Stocker
Hi,

> We haven't got any dumps from when the problem happened. Would there
> still be a point now that the RS is stable?

i spoke with the Guys which caused the crash (https://communityrack.org) and 
asked them what they did exactly.

they said that they changed their AS Nr on their frr 5 router. after a short 
shut / no shut, the famous „fatal in RDE“ appeared.

Important detail. it happened after restarting the ipv6 session. Seems like 
ipv4 didn’t trigger the bug.

hope that helps.

cheers
stoege



Re: openBGPd crashes in 6.2 and 6.3: "a politician in the decision process"

2018-08-24 Thread Remi Locherer
Hi Pietro,

On Thu, Aug 23, 2018 at 10:05:30AM +0200, Pietro Stäheli wrote:
> Hi,
> 
> openBGPd is running at an internet exchange, two openBSD route servers
> (rs3 on openBSD 6.3 and rs4 on openBSD 6.2, both virtual machines on
> different hypervisors in different locations) connect with peering
> customers.
> 
> We've experienced crashes in openBGPd twice in the past two weeks. Both
> times with the same error message: "fatal in RDE: Uh, oh a politician in
> the decision process". These error messages are logged on both route
> servers right before they crash within seconds of each other.
> 
> The route servers had been running quite reliably for a long time before
> the recent incidents. The daemon can then be restarted without an issue.
> CPU usage prior to the crash is minimal (<5%).
> 
> In the minutes before the crash we're seeing error messages like the
> following in daemon.log:
> 
> bgpd[23099]: no such peer: id=4294967037
> 
> 
> Sample of logs just before the crash:
> 
> Aug 22 15:38:58 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.170
> AS6939: update 81.163.124.0/24 via 91.206.52.170
> Aug 22 15:38:58 rs3 bgpd[23099]: no such peer: id=4294967037
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
> AS31424: withdraw 2a01:6a8::/32
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: withdraw 2a01:6a8::/32
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
> AS6939: withdraw 2804:364c::/33
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
> AS6939: withdraw 2804:364c:8000::/33
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
> AS31424: update 2a01:6a8::/32 via 2001:7f8:24::11
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
> AS6939: update 2804:364c::/33 via 2001:7f8:24::aa
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::aa
> AS6939: update 2804:364c:8000::/33 via 2001:7f8:24::aa
> Aug 22 15:38:59 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: update 2a01:6a8::/32 via 2001:7f8:24::bf
> Aug 22 15:39:00 rs3 bgpd[23099]: Connection attempt from neighbor
> 91.206.52.139 while session is in state Idle
> Aug 22 15:39:01 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.96
> AS31042: update 185.64.172.0/24 via 91.206.52.96
> Aug 22 15:39:01 rs3 bgpd[43763]: fatal in RDE: Uh, oh a politician in
> the decision process
> Aug 22 15:39:01 rs3 bgpd[99961]: peer closed imsg connection
> Aug 22 15:39:01 rs3 bgpd[99961]: main: Lost connection to RDE
> Aug 22 15:39:01 rs3 bgpd[23099]: peer closed imsg connection
> Aug 22 15:39:01 rs3 bgpd[23099]: SE: Lost connection to parent
> 
> 
> Logs just before the "no such peer" messages appear:
> 
> Aug 22 15:36:43 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.54
> AS34554: update 80.75.112.0/20 via 91.206.52.54
> Aug 22 15:36:43 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::36
> AS34554: update 2a01:6a8::/32 via 2001:7f8:24::36
> Aug 22 15:36:44 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: update 2a0d:8d80::/32 via 2001:7f8:24::bf
> Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: graceful restart
> of IPv4 unicast, keeping routes
> Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: state change
> Established -> Idle, reason: Connection closed
> Aug 22 15:36:47 rs3 bgpd[23099]: neighbor 91.206.52.96: removed
> Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
> AS31424: withdraw 2a01:6a8::/32
> Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: withdraw 2a01:6a8::/32
> Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::11
> AS31424: update 2a01:6a8::/32 via 2001:7f8:24::11
> Aug 22 15:36:49 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: update 2a01:6a8::/32 via 2001:7f8:24::bf
> Aug 22 15:36:54 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 2001:7f8:24::bf
> AS33891: update 2a0d:8d80::/32 via 2001:7f8:24::bf
> Aug 22 15:36:55 rs3 bgpd[43763]: Rib Loc-RIB: neighbor 91.206.52.170
> AS6939: update 197.249.160.0/19 via 91.206.52.170
> Aug 22 15:36:55 rs3 bgpd[23099]: no such peer: id=4294967037
> 
> 
> 
> I haven't found much about the error message apart from this mailing
> list thread: https://www.mail-archive.com/misc@openbsd.org/msg04565.html
> 
> The thread suggests that invoking bgpctl may cause the failure. 'bgpctl
> show' is invoked every few minutes through our monitoring system to
> check on the status of peer connections.
> 
> Can anybody point me to a possible cause or troubleshooting regarding
> this issue? Could a misconfigured/broken peer be the cause?  Has anybody
> dealt with a similar problem?
> 
> I can provide bgpd.conf and full logs of both incidents if necessary.

I move this over to bugs@ which feels more appropriate to me. Chances that
a bgpd developer sees it is higher.

Somehow there where two identical paths and bgpd could not figure out