Hi.
I'm looking for guidance on how to troubleshoot a piece of software
which is spinning after calling fork(2).
I'm working on making the s2n-tls[1] library build on OpenBSD[2]. One
of the unit tests[3] does this:
1. The test framework forks the test (s2n_fork_generation_number_test)
2. The test process forks a child and calls waitpid()
3. The child creates 2 threads (pthread_create()), waits for the
threads to finish (pthread_join()), forks a grandchild, and calls
waitpid()
4. The grandchild is supposed to create 2 threads, wait for the
threads, fork a great grandchild, and call waitpid()
5. The great grandchild is supposed to create 2 threads, wait for the
threads, and exit
6. The processes exit in reverse order, and the test is finished.
During #3, the great grandchild never starts to execute and the
grandchild doesn't return from fork(). The test fails after a timeout.
ktrace[4] shows the grandchild (pid 91188) getting to the point where
it calls fork() (line 23862), but then it spins on sched_yield():
23860 91188 s2n_fork_generation_num CALL write(2,0x7f7d4f80,0x33)
23861 69848 ctestCALL
kevent(4,0x7f7e3d40,0,0x7f7e3d40,1024,0x7f7ebd48)
23862 91188 s2n_fork_generation_num GIO fd 2 wrote 51 bytes
23863"s2n_unit_test_fork: calling fork(); fork_level = 1
23864"
23865 69848 ctestSTRU struct timespec { 9.91400 }
23866 91188 s2n_fork_generation_num RET write 51/0x33
23867 69848 ctestSTRU struct kevent { ident=12,
filter=EVFILT_READ, flags=0x1, fflags=0<>, data=51, udata=0x0
}
23868 91188 s2n_fork_generation_num CALL sched_yield()
23869 69848 ctestRET kevent 1
23870 69848 ctestCALL read(12,0x6eb25d1f000,0x1)
23871 91188 s2n_fork_generation_num CALL __threxit(0xc296ca59e30)
23872 69848 ctestGIO fd 12 read 51 bytes
23873"s2n_unit_test_fork: calling fork(); fork_level = 1
23874"
23875 69848 ctestRET read 51/0x33
23876 91188 s2n_fork_generation_num CALL __threxit(0xc296ca59230)
23877 91188 s2n_fork_generation_num RET sched_yield 0
23878 91188 s2n_fork_generation_num CALL sched_yield()
23879 91188 s2n_fork_generation_num RET sched_yield 0
[...]
btrace also doesn't show that the grandchild enters fork()
1668777953577037810 ctest (69848/127562) syscall:fork:entry (0)
1668777953577858328 ctest (69848/127562) syscall:fork:return (87140)
# test process is pid 87140
1668777953577876672 ctest (87140/115691) syscall:fork:return (0)
1668777953586430021 s2n_fork_generation_num (87140/115691)
syscall:fork:entry (0)
1668777953586746921 s2n_fork_generation_num (87140/115691)
syscall:fork:return (62504)# child pid 62504
1668777953586831091 s2n_fork_generation_num (62504/463506)
syscall:fork:return (0)
1668777953587202790 s2n_fork_generation_num (87140/115691)
syscall:wait4:entry (62504)
1668777953658413274 s2n_fork_generation_num (62504/463506)
syscall:fork:entry (0)
1668777953658911264 s2n_fork_generation_num (62504/463506)
syscall:fork:return (91188)# grandchild pid 91188
1668777953658967606 s2n_fork_generation_num (91188/186182)
syscall:fork:return (0)
1668777953660934467 s2n_fork_generation_num (62504/463506)
syscall:wait4:entry (91188)
# no entry showing 91188 entering fork()
If I comment out the part of the test which creates the threads, the
test completes successfully. So some sort of locking/contention issue
in the threading code?
[1] https://github.com/aws/s2n-tls
[2] https://github.com/knightjoel/s2n-tls/tree/fix/build-on-openbsd
[3]
https://github.com/knightjoel/s2n-tls/blob/fix/build-on-openbsd/tests/unit/s2n_fork_generation_number_test.c
[4] https://www.packetmischief.ca/files/fork_test2.txt.gz
.joel