Re: bgpd dying repeatedly on latest snapshot

Sebastien Marie Tue, 06 Oct 2015 02:48:53 -0700

On Tue, Oct 06, 2015 at 08:48:07AM +0200, Eric Ripa wrote:
> Hi,
> 
> I upgraded my private mail host today to the latest 5.8-current snapshot (5th 
> of october). Now bgpd keeps crashing. The previous snapshot (from around 14th 
> of september) was working fine.
> 
> I have a quite simple setup with only the reference configuration for spamd, 
> using http://bgp-spamd.net
> 
> Below is the output from "bgpd -d".
> 
> Could this be related to recent tame changes?


Yes it could.

I will try to explain globally how to check if it is tame(2) related
problem, and how to debug/report problem against tame(2).

And after, I will go back to the bgpd case :)


1. First, when a program violate a tame(2) policy, it will be reported in
syslog. By calling dmesg(1), you will see entry like:

generic(2198): syscall 89

the format is: PROGNAME(PID): syscall SYSCALL

so if you have entries like that, the kernel has enforced a policy and
killed a process, because it called the syscall number SYSCALL.

In the example, the program name is "generic" (it is
src/regress/sys/kern/tame/generic regression testsuite), which tried to
call syscall 89.

To get the syscall name, see /usr/include/sys/syscalls.h file.

$ grep 89 /usr/include/sys/syscalls.h
#define SYS_clock_getres        89

So the "generic" program, pid 2198, was tamed, and call clock_getres(2)
but wasn't allowed to do that.


2. Another way is to call the program using "ktrace -di"
 -d : Descendants
 -i : Inherit

$ ktrace -di ./generic
$ kdump
[...]
 32005 generic  CALL  tame(0x3730f018,0)
 32005 generic  STRU  tame request="inet"
 32005 generic  RET   tame 0
[...]
 32005 generic  CALL  kill(0,SIGINT)
 32005 generic  PSIG  SIGKILL SIG_DFL
[...]

ktrace(1) will record syscalls used by the program.

Here you can see the tame() call. The request was "inet".
Later the program try to call kill() syscall: it enters in kill, but
never go back: the program receive a SIGKILL signal.

ktrace(1) could permit to (quickly) investigated why a program has been
killed.


3. In order to get a full backtrace of the "Killed" point, you have to
recompile your program.

tame(2) provide a "abort" request that will slightly change the
behaviour in case of policy violation: the program will receive a
(uncatchable) SIGABRT, and so will generate a coredump and kill the
program.

It is the way to fully understand where (and we hope, why) the program
has been killed.

Please note that issetugid(2) programs (like bgpd) don't generate
coredump in usual way. You have to tell the system you want them (refer
to sysctl man page for detail):

# mkdir -m 700 /var/crash/bgpd
# sysctl kern.nosuidcoredump=3

Now, when you run the program, if the system kill the process, you will
get coredump, usable using gdb(1).

$ gdb -q generic generic.core
Core was generated by `generic'.
Program terminated with signal 6, Aborted.
Loaded symbols for /usr/obj/regress/sys/kern/tame/generic/generic
Reading symbols from /usr/lib/libc.so.83.0...done.
Loaded symbols for /usr/lib/libc.so.83.0
Reading symbols from /usr/libexec/ld.so...done.
Loaded symbols for /usr/libexec/ld.so
#0  0x0e4eef0d in kill () at <stdin>:2
2       <stdin>: No such file or directory.
        in <stdin>
(gdb) bt
#0  0x0e4eef0d in kill () at <stdin>:2
#1  0x1641d696 in test_kill () at 
/usr/src/regress/sys/kern/tame/generic/main.c:58
#2  0x1641e499 in _start_test (ret=0xcf7c77c8, test_name=0x3641c1b6 
"test_kill", 
    request=0x3641c1ab "inet abort", paths=0x0, test_func=0x1641d670 
<test_kill>)
    at /usr/src/regress/sys/kern/tame/generic/manager.c:231
#3  0x1641dcdd in main (argc=536910692, argv=0xcf7c7834) at 
/usr/src/regress/sys/kern/tame/generic/main.c:226
(gdb)

This third method is the preferred way to report tame(2) problem: it
permits to ensure the exact point of the policy violation.



Now, if I go back to your bgpd problem...

First check your dmesg if you have bgpd(pid): syscall XX

If it is the case, could you recompile bgpd with the inlined patch at
bottom ? Please use a really -current system (or latest available
snapshot): there are quick move in tame area for now.

It will add "abort" request to tame() in bgpd.

Next, allow bgpd to make coredump:
# mkdir -m 700 /var/crash/bgpd
# sysctl kern.nosuidcoredump=3

Run the program, and finally reports your dmesg and gdb-backtrace.

I hope it helps.
-- 
Sebastien Marie



Index: usr.sbin/bgpd/rde.c
===================================================================
RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v
retrieving revision 1.339
diff -u -p -u -r1.339 rde.c
--- usr.sbin/bgpd/rde.c 21 Sep 2015 09:47:15 -0000      1.339
+++ usr.sbin/bgpd/rde.c 4 Oct 2015 03:57:30 -0000
@@ -30,6 +30,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <err.h>
 
 #include "bgpd.h"
 #include "mrt.h"
@@ -185,6 +186,9 @@ rde_main(int debug, int verbose)
            setresgid(pw->pw_gid, pw->pw_gid, pw->pw_gid) ||
            setresuid(pw->pw_uid, pw->pw_uid, pw->pw_uid))
                fatal("can't drop privileges");
+
+       if (tame("malloc unix cmsg abort", NULL) == -1)
+               err(1, "tame");
 
        signal(SIGTERM, rde_sighdlr);
        signal(SIGINT, rde_sighdlr);

Index: session.c
===================================================================
RCS file: /cvs/src/usr.sbin/bgpd/session.c,v
retrieving revision 1.341
diff -u -p -r1.341 session.c
--- session.c   5 Oct 2015 16:16:41 -0000       1.341
+++ session.c   6 Oct 2015 08:42:47 -0000
@@ -219,7 +219,7 @@ session_main(int debug, int verbose)
            setresuid(pw->pw_uid, pw->pw_uid, pw->pw_uid))
                fatal("can't drop privileges");
 
-       if (tame("stdio inet cmsg", NULL) == -1)
+       if (tame("stdio inet cmsg abort", NULL) == -1)
                err(1, "tame");
 
        signal(SIGTERM, session_sighdlr);

HOWTO debug/report tame(2) problem / Re: bgpd dying repeatedly on latest snapshot

Reply via email to