Re: child exit on self-proxy
On 7/7/21 11:45 AM, Stefan Eissing wrote: > In my h2 test suite, I do a setup where I use proxy configs against the > server itself. We seem to have a problem performing a clean child exit with > that. Test in 2.4.48 and trunk: > - run tests with several graceful restarts > - no proxied request, clean exit > - with proxied requests > AH00045: child process 53682 still did not exit, sending a SIGTERM > AH00045: child process 53682 still did not exit, sending a SIGTERM > [often stops here, sometimes] > ... > AH00046: child process 53682 still did not exit, sending a SIGKILL > > Question: > - is such a test setup doomed to fail in general? > - are we sure that we cannot encounter such states in "normal" setups? > > If someone wants at log at whatever LogLevels, let me know. It's seems highly > reproducible. Do you have stack traces where these processes are hanging and a simple config that causes this? Regards Rüdiger
Re: child exit on self-proxy
> Am 07.07.2021 um 11:49 schrieb Ruediger Pluem : > > > > On 7/7/21 11:45 AM, Stefan Eissing wrote: >> In my h2 test suite, I do a setup where I use proxy configs against the >> server itself. We seem to have a problem performing a clean child exit with >> that. Test in 2.4.48 and trunk: >> - run tests with several graceful restarts >> - no proxied request, clean exit >> - with proxied requests >> AH00045: child process 53682 still did not exit, sending a SIGTERM >> AH00045: child process 53682 still did not exit, sending a SIGTERM >> [often stops here, sometimes] >> ... >> AH00046: child process 53682 still did not exit, sending a SIGKILL >> >> Question: >> - is such a test setup doomed to fail in general? >> - are we sure that we cannot encounter such states in "normal" setups? >> >> If someone wants at log at whatever LogLevels, let me know. It's seems >> highly reproducible. > > Do you have stack traces where these processes are hanging and a simple > config that causes this? I added a TRACE1 log in event.c before/after join_workers (line 2921) and see: [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid 4493635072] event.c(2921): graceful termination received, join workers [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid 123145435672576] event.c(1799): All workers are busy or dying, will close 1 keep-alive connections [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid 123145435672576] event.c(1799): All workers are busy or dying, will close 0 keep-alive connections [Wed Jul 07 10:06:10.78 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] AH00045: child process 72886 still did not exit, sending a SIGTERM [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] AH00046: child process 72886 still did not exit, sending a SIGKILL So, I assume the keep-alive connection is the mod_proxy_http connection to the sever itself. Since the join_workers() never returns, there seems to be a thread not finishing. On another run, I got a stacktrace of the child: Call graph: 264 Thread_36766542 DispatchQueue_1: com.apple.main-thread (serial) + 264 start (in libdyld.dylib) + 1 [0x7fff2032cf5d] + 264 main (in httpd) + 2278 [0x104729b86] main.c:862 + 264 ap_run_mpm (in httpd) + 75 [0x10473917b] mpm_common.c:100 + 264 event_run (in mod_mpm_event.so) + 2994 [0x10502ae62] event.c:3398 + 264 make_child (in mod_mpm_event.so) + 436 [0x10502bbc4] event.c:2997 + 264 child_main (in mod_mpm_event.so) + 1734 [0x10502c2e6] event.c:2924 + 264 join_workers (in mod_mpm_event.so) + 386 [0x10502cc72] event.c:2717 + 264 apr_thread_join (in libapr-2.0.dylib) + 44 [0x1048b347c] thread.c:256 + 264 _pthread_join (in libsystem_pthread.dylib) + 362 [0x7fff20312f60] + 264 __ulock_wait (in libsystem_kernel.dylib) + 10 [0x7fff202dd9ee] 264 Thread_36766548 + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc] + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148 + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260 + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49] + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde] 264 Thread_36766549 + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc] + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148 + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260 + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49] + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde] 264 Thread_36766550 + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] + 264 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff203118fc] + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] thread.c:148 + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] h2_workers.c:260 + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff20311e49] + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff202decde] 264 Thread_36766551 + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] +
Re: child exit on self-proxy
On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing wrote: > > I added a TRACE1 log in event.c before/after join_workers (line 2921) and see: > > [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid > 4493635072] event.c(2921): graceful termination received, join workers > [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid > 123145435672576] event.c(1799): All workers are busy or dying, will close 1 > keep-alive connections > [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid > 123145435672576] event.c(1799): All workers are busy or dying, will close 0 > keep-alive connections > [Wed Jul 07 10:06:10.78 2021] [core:warn] [pid 72813:tid 4493635072] > AH00045: child process 72886 still did not exit, sending a SIGTERM > [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] > AH00045: child process 72886 still did not exit, sending a SIGTERM > [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] > AH00045: child process 72886 still did not exit, sending a SIGTERM > [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] > AH00046: child process 72886 still did not exit, sending a SIGKILL > > So, I assume the keep-alive connection is the mod_proxy_http connection to > the sever itself. Since the join_workers() never returns, there seems to be a > thread not finishing. No, MPM event doesn't handle mod_proxy kept alive connections, those are client connections. > > On another run, I got a stacktrace of the child: > > Call graph: > 264 Thread_36766542 DispatchQueue_1: com.apple.main-thread (serial) > + 264 start (in libdyld.dylib) + 1 [0x7fff2032cf5d] > + 264 main (in httpd) + 2278 [0x104729b86] main.c:862 > + 264 ap_run_mpm (in httpd) + 75 [0x10473917b] mpm_common.c:100 > + 264 event_run (in mod_mpm_event.so) + 2994 [0x10502ae62] > event.c:3398 > + 264 make_child (in mod_mpm_event.so) + 436 [0x10502bbc4] > event.c:2997 > + 264 child_main (in mod_mpm_event.so) + 1734 [0x10502c2e6] > event.c:2924 > + 264 join_workers (in mod_mpm_event.so) + 386 > [0x10502cc72] event.c:2717 > + 264 apr_thread_join (in libapr-2.0.dylib) + 44 > [0x1048b347c] thread.c:256 > + 264 _pthread_join (in libsystem_pthread.dylib) + 362 > [0x7fff20312f60] > + 264 __ulock_wait (in libsystem_kernel.dylib) + 10 > [0x7fff202dd9ee] OK, that's where the child waits for worker threads. > 264 Thread_36766548 > + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] > + 264 _pthread_start (in libsystem_pthread.dylib) + 224 > [0x7fff203118fc] > + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] > thread.c:148 > + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] > h2_workers.c:260 > + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 > [0x7fff20311e49] > + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 > [0x7fff202decde] > 264 Thread_36766549 > + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] > + 264 _pthread_start (in libsystem_pthread.dylib) + 224 > [0x7fff203118fc] > + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] > thread.c:148 > + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] > h2_workers.c:260 > + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 > [0x7fff20311e49] > + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 > [0x7fff202decde] > 264 Thread_36766550 > + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] > + 264 _pthread_start (in libsystem_pthread.dylib) + 224 > [0x7fff203118fc] > + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] > thread.c:148 > + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] > h2_workers.c:260 > + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 > [0x7fff20311e49] > + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 > [0x7fff202decde] > 264 Thread_36766551 > + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] > + 264 _pthread_start (in libsystem_pthread.dylib) + 224 > [0x7fff203118fc] > + 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] > thread.c:148 > + 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] > h2_workers.c:260 > + 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 > [0x7fff20311e49] > + 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 > [0x7fff202decde] The above mod_http2 threads are not accounted for by the MPM I suppose (they should be stopped by the future child_exit hook, but still the MPM will only care about its own worker threads, i.e. thos
Re: child exit on self-proxy
Was busy on other things. Will try to get answers... > Am 07.07.2021 um 13:51 schrieb Yann Ylavic : > > On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing > wrote: >> >> I added a TRACE1 log in event.c before/after join_workers (line 2921) and >> see: >> >> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid >> 4493635072] event.c(2921): graceful termination received, join workers >> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid >> 123145435672576] event.c(1799): All workers are busy or dying, will close 1 >> keep-alive connections >> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid >> 123145435672576] event.c(1799): All workers are busy or dying, will close 0 >> keep-alive connections >> [Wed Jul 07 10:06:10.78 2021] [core:warn] [pid 72813:tid 4493635072] >> AH00045: child process 72886 still did not exit, sending a SIGTERM >> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] >> AH00045: child process 72886 still did not exit, sending a SIGTERM >> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] >> AH00045: child process 72886 still did not exit, sending a SIGTERM >> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] >> AH00046: child process 72886 still did not exit, sending a SIGKILL >> >> So, I assume the keep-alive connection is the mod_proxy_http connection to >> the sever itself. Since the join_workers() never returns, there seems to be >> a thread not finishing. > > No, MPM event doesn't handle mod_proxy kept alive connections, those > are client connections. > >> >> On another run, I got a stacktrace of the child: >> >> Call graph: >>264 Thread_36766542 DispatchQueue_1: com.apple.main-thread (serial) >>+ 264 start (in libdyld.dylib) + 1 [0x7fff2032cf5d] >>+ 264 main (in httpd) + 2278 [0x104729b86] main.c:862 >>+ 264 ap_run_mpm (in httpd) + 75 [0x10473917b] mpm_common.c:100 >>+ 264 event_run (in mod_mpm_event.so) + 2994 [0x10502ae62] >> event.c:3398 >>+ 264 make_child (in mod_mpm_event.so) + 436 [0x10502bbc4] >> event.c:2997 >>+ 264 child_main (in mod_mpm_event.so) + 1734 [0x10502c2e6] >> event.c:2924 >>+ 264 join_workers (in mod_mpm_event.so) + 386 >> [0x10502cc72] event.c:2717 >>+ 264 apr_thread_join (in libapr-2.0.dylib) + 44 >> [0x1048b347c] thread.c:256 >>+ 264 _pthread_join (in libsystem_pthread.dylib) + 362 >> [0x7fff20312f60] >>+ 264 __ulock_wait (in libsystem_kernel.dylib) + 10 >> [0x7fff202dd9ee] > > OK, that's where the child waits for worker threads. > >>264 Thread_36766548 >>+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 >> [0x7fff203118fc] >>+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >> thread.c:148 >>+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >> h2_workers.c:260 >>+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >> [0x7fff20311e49] >>+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >> [0x7fff202decde] >>264 Thread_36766549 >>+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 >> [0x7fff203118fc] >>+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >> thread.c:148 >>+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >> h2_workers.c:260 >>+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >> [0x7fff20311e49] >>+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >> [0x7fff202decde] >>264 Thread_36766550 >>+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 >> [0x7fff203118fc] >>+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >> thread.c:148 >>+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >> h2_workers.c:260 >>+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >> [0x7fff20311e49] >>+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >> [0x7fff202decde] >>264 Thread_36766551 >>+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 >> [0x7fff203118fc] >>+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >> thread.c:148 >>+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >> h2_workers.c:260 >>+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >> [0x7fff20311e49] >>+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >> [0x7fff202decde] > > The above
Re: child exit on self-proxy
> Am 07.07.2021 um 13:51 schrieb Yann Ylavic : > > On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing > wrote: >> >> I added a TRACE1 log in event.c before/after join_workers (line 2921) and >> see: >> >> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid >> 4493635072] event.c(2921): graceful termination received, join workers >> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid >> 123145435672576] event.c(1799): All workers are busy or dying, will close 1 >> keep-alive connections >> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid >> 123145435672576] event.c(1799): All workers are busy or dying, will close 0 >> keep-alive connections >> [Wed Jul 07 10:06:10.78 2021] [core:warn] [pid 72813:tid 4493635072] >> AH00045: child process 72886 still did not exit, sending a SIGTERM >> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] >> AH00045: child process 72886 still did not exit, sending a SIGTERM >> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] >> AH00045: child process 72886 still did not exit, sending a SIGTERM >> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] >> AH00046: child process 72886 still did not exit, sending a SIGKILL >> >> So, I assume the keep-alive connection is the mod_proxy_http connection to >> the sever itself. Since the join_workers() never returns, there seems to be >> a thread not finishing. > > No, MPM event doesn't handle mod_proxy kept alive connections, those > are client connections. > >> >> On another run, I got a stacktrace of the child: >> >> Call graph: >>264 Thread_36766542 DispatchQueue_1: com.apple.main-thread (serial) >>+ 264 start (in libdyld.dylib) + 1 [0x7fff2032cf5d] >>+ 264 main (in httpd) + 2278 [0x104729b86] main.c:862 >>+ 264 ap_run_mpm (in httpd) + 75 [0x10473917b] mpm_common.c:100 >>+ 264 event_run (in mod_mpm_event.so) + 2994 [0x10502ae62] >> event.c:3398 >>+ 264 make_child (in mod_mpm_event.so) + 436 [0x10502bbc4] >> event.c:2997 >>+ 264 child_main (in mod_mpm_event.so) + 1734 [0x10502c2e6] >> event.c:2924 >>+ 264 join_workers (in mod_mpm_event.so) + 386 >> [0x10502cc72] event.c:2717 >>+ 264 apr_thread_join (in libapr-2.0.dylib) + 44 >> [0x1048b347c] thread.c:256 >>+ 264 _pthread_join (in libsystem_pthread.dylib) + 362 >> [0x7fff20312f60] >>+ 264 __ulock_wait (in libsystem_kernel.dylib) + 10 >> [0x7fff202dd9ee] > > OK, that's where the child waits for worker threads. > >>264 Thread_36766548 >>+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 >> [0x7fff203118fc] >>+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >> thread.c:148 >>+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >> h2_workers.c:260 >>+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >> [0x7fff20311e49] >>+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >> [0x7fff202decde] >>264 Thread_36766549 >>+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 >> [0x7fff203118fc] >>+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >> thread.c:148 >>+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >> h2_workers.c:260 >>+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >> [0x7fff20311e49] >>+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >> [0x7fff202decde] >>264 Thread_36766550 >>+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 >> [0x7fff203118fc] >>+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >> thread.c:148 >>+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >> h2_workers.c:260 >>+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >> [0x7fff20311e49] >>+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >> [0x7fff202decde] >>264 Thread_36766551 >>+ 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>+ 264 _pthread_start (in libsystem_pthread.dylib) + 224 >> [0x7fff203118fc] >>+ 264 dummy_worker (in libapr-2.0.dylib) + 30 [0x1048b33ee] >> thread.c:148 >>+ 264 slot_run (in mod_http2.so) + 189 [0x104f99a8d] >> h2_workers.c:260 >>+ 264 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 >> [0x7fff20311e49] >>+ 264 __psynch_cvwait (in libsystem_kernel.dylib) + 10 >> [0x7fff202decde] > > The above mod_http2 threads are not accounted for by the MPM
Re: child exit on self-proxy
In order to reproduce the logs I see on this weird behaviour, I'll attach the patch I made: event_debug.diff Description: Binary data With this, I see in a hanging process: [Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid 123145448968192] core_filters.c(429): (32)Broken pipe: [client 127.0.0.1:49824] core_output_filter: writing data to the network [Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(599): AH: wakeup_listener: start [Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(603): AH: wakeup_listener: apr_pollset_wakeup [Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(610): AH: wakeup_listener: ap_queue_info_term [Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid 123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=0 [Wed Jul 07 15:20:23.044817 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(629): AH: wakeup_listener: pthread_kill [Wed Jul 07 15:20:23.044954 2021] [mpm_event:trace1] [pid 42471:tid 123145461309440] event.c(1809): All workers are busy or dying, will close 1 keep-alive connections [Wed Jul 07 15:20:23.044987 2021] [mpm_event:debug] [pid 42471:tid 4525448704] event.c(635): AH: wakeup_listener: end [Wed Jul 07 15:20:23.045154 2021] [mpm_event:trace1] [pid 42471:tid 4525448704] event.c(2940): graceful termination received, join workers [Wed Jul 07 15:20:25.927437 2021] [mpm_event:debug] [pid 42471:tid 123145461309440] event.c(1985): (70007)The timeout specified has expired: AH: pollset returned listener_may_exit=0 [Wed Jul 07 15:20:25.927468 2021] [mpm_event:trace1] [pid 42471:tid 123145461309440] event.c(1809): All workers are busy or dying, will close 0 keep-alive connections Which says (to my eyes) that the pollset_wakeup happened, but the listener saw "listener_may_exit == 0". Is the volatile maybe not enough? > Am 07.07.2021 um 16:56 schrieb Stefan Eissing : > > > >> Am 07.07.2021 um 13:51 schrieb Yann Ylavic : >> >> On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing >> wrote: >>> >>> I added a TRACE1 log in event.c before/after join_workers (line 2921) and >>> see: >>> >>> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid >>> 4493635072] event.c(2921): graceful termination received, join workers >>> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid >>> 123145435672576] event.c(1799): All workers are busy or dying, will close 1 >>> keep-alive connections >>> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid >>> 123145435672576] event.c(1799): All workers are busy or dying, will close 0 >>> keep-alive connections >>> [Wed Jul 07 10:06:10.78 2021] [core:warn] [pid 72813:tid 4493635072] >>> AH00045: child process 72886 still did not exit, sending a SIGTERM >>> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] >>> AH00045: child process 72886 still did not exit, sending a SIGTERM >>> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] >>> AH00045: child process 72886 still did not exit, sending a SIGTERM >>> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] >>> AH00046: child process 72886 still did not exit, sending a SIGKILL >>> >>> So, I assume the keep-alive connection is the mod_proxy_http connection to >>> the sever itself. Since the join_workers() never returns, there seems to be >>> a thread not finishing. >> >> No, MPM event doesn't handle mod_proxy kept alive connections, those >> are client connections. >> >>> >>> On another run, I got a stacktrace of the child: >>> >>> Call graph: >>> 264 Thread_36766542 DispatchQueue_1: com.apple.main-thread (serial) >>> + 264 start (in libdyld.dylib) + 1 [0x7fff2032cf5d] >>> + 264 main (in httpd) + 2278 [0x104729b86] main.c:862 >>> + 264 ap_run_mpm (in httpd) + 75 [0x10473917b] mpm_common.c:100 >>> + 264 event_run (in mod_mpm_event.so) + 2994 [0x10502ae62] >>> event.c:3398 >>> + 264 make_child (in mod_mpm_event.so) + 436 [0x10502bbc4] >>> event.c:2997 >>> + 264 child_main (in mod_mpm_event.so) + 1734 [0x10502c2e6] >>> event.c:2924 >>> + 264 join_workers (in mod_mpm_event.so) + 386 >>> [0x10502cc72] event.c:2717 >>> + 264 apr_thread_join (in libapr-2.0.dylib) + 44 >>> [0x1048b347c] thread.c:256 >>> + 264 _pthread_join (in libsystem_pthread.dylib) + 362 >>> [0x7fff20312f60] >>> + 264 __ulock_wait (in libsystem_kernel.dylib) + 10 >>> [0x7fff202dd9ee] >> >> OK, that's where the child waits for worker threads. >> >>> 264 Thread_36766548 >>> + 264 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff2030d443] >>> + 264 _pthread_start (in libsystem_pthread.dylib) + 224 >>> [0x7fff203118fc] >>> + 264 dummy_worker (in l
Re: child exit on self-proxy
On 7/7/21 5:25 PM, Stefan Eissing wrote: > In order to reproduce the logs I see on this weird behaviour, I'll attach the > patch I made: > > > > > With this, I see in a hanging process: > > [Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid > 123145448968192] core_filters.c(429): (32)Broken pipe: [client > 127.0.0.1:49824] core_output_filter: writing data to the network > [Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid > 4525448704] event.c(599): AH: wakeup_listener: start > [Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid > 4525448704] event.c(603): AH: wakeup_listener: apr_pollset_wakeup > [Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid > 4525448704] event.c(610): AH: wakeup_listener: ap_queue_info_term > [Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid > 123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset > returned listener_may_exit=0 Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the actual value of listener_may_exit in the log message? > [Wed Jul 07 15:20:23.044817 2021] [mpm_event:debug] [pid 42471:tid > 4525448704] event.c(629): AH: wakeup_listener: pthread_kill > [Wed Jul 07 15:20:23.044954 2021] [mpm_event:trace1] [pid 42471:tid > 123145461309440] event.c(1809): All workers are busy or dying, will close 1 > keep-alive connections > [Wed Jul 07 15:20:23.044987 2021] [mpm_event:debug] [pid 42471:tid > 4525448704] event.c(635): AH: wakeup_listener: end > [Wed Jul 07 15:20:23.045154 2021] [mpm_event:trace1] [pid 42471:tid > 4525448704] event.c(2940): graceful termination received, join workers > [Wed Jul 07 15:20:25.927437 2021] [mpm_event:debug] [pid 42471:tid > 123145461309440] event.c(1985): (70007)The timeout specified has expired: AH: > pollset returned listener_may_exit=0 Looks like we get back in the poll and have to wait for the timeout. > [Wed Jul 07 15:20:25.927468 2021] [mpm_event:trace1] [pid 42471:tid > 123145461309440] event.c(1809): All workers are busy or dying, will close 0 > keep-alive connections > > Which says (to my eyes) that the pollset_wakeup happened, but the listener > saw "listener_may_exit == 0". Is the volatile maybe not enough? Regards Rüdiger
Re: child exit on self-proxy
> Am 07.07.2021 um 17:39 schrieb Ruediger Pluem : > > > > On 7/7/21 5:25 PM, Stefan Eissing wrote: >> In order to reproduce the logs I see on this weird behaviour, I'll attach >> the patch I made: >> >> >> >> >> With this, I see in a hanging process: >> >> [Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid >> 123145448968192] core_filters.c(429): (32)Broken pipe: [client >> 127.0.0.1:49824] core_output_filter: writing data to the network >> [Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid >> 4525448704] event.c(599): AH: wakeup_listener: start >> [Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid >> 4525448704] event.c(603): AH: wakeup_listener: apr_pollset_wakeup >> [Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid >> 4525448704] event.c(610): AH: wakeup_listener: ap_queue_info_term >> [Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid >> 123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset >> returned listener_may_exit=0 > > Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the > actual value of listener_may_exit in the log message? I put it statically since that was the value in the "if ()" 6 lines above. it should have been visible then and there to work. > >> [Wed Jul 07 15:20:23.044817 2021] [mpm_event:debug] [pid 42471:tid >> 4525448704] event.c(629): AH: wakeup_listener: pthread_kill >> [Wed Jul 07 15:20:23.044954 2021] [mpm_event:trace1] [pid 42471:tid >> 123145461309440] event.c(1809): All workers are busy or dying, will close 1 >> keep-alive connections >> [Wed Jul 07 15:20:23.044987 2021] [mpm_event:debug] [pid 42471:tid >> 4525448704] event.c(635): AH: wakeup_listener: end >> [Wed Jul 07 15:20:23.045154 2021] [mpm_event:trace1] [pid 42471:tid >> 4525448704] event.c(2940): graceful termination received, join workers >> [Wed Jul 07 15:20:25.927437 2021] [mpm_event:debug] [pid 42471:tid >> 123145461309440] event.c(1985): (70007)The timeout specified has expired: >> AH: pollset returned listener_may_exit=0 > > Looks like we get back in the poll and have to wait for the timeout. > >> [Wed Jul 07 15:20:25.927468 2021] [mpm_event:trace1] [pid 42471:tid >> 123145461309440] event.c(1809): All workers are busy or dying, will close 0 >> keep-alive connections >> >> Which says (to my eyes) that the pollset_wakeup happened, but the listener >> saw "listener_may_exit == 0". Is the volatile maybe not enough? > > Regards > > Rüdiger
Re: child exit on self-proxy
On Wed, Jul 7, 2021 at 5:39 PM Ruediger Pluem wrote: > > > > On 7/7/21 5:25 PM, Stefan Eissing wrote: > > In order to reproduce the logs I see on this weird behaviour, I'll attach > > the patch I made: > > > > > > > > > > With this, I see in a hanging process: > > > > [Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid > > 123145448968192] core_filters.c(429): (32)Broken pipe: [client > > 127.0.0.1:49824] core_output_filter: writing data to the network > > [Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid > > 4525448704] event.c(599): AH: wakeup_listener: start > > [Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid > > 4525448704] event.c(603): AH: wakeup_listener: apr_pollset_wakeup > > [Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid > > 4525448704] event.c(610): AH: wakeup_listener: ap_queue_info_term > > [Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid > > 123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset > > returned listener_may_exit=0 > > Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the > actual value of listener_may_exit in the log message? It would be interesting to log apr_atomic_read32(&connection_count) too.
Re: child exit on self-proxy
On Wed, Jul 7, 2021 at 5:45 PM Stefan Eissing wrote: > > > Am 07.07.2021 um 17:39 schrieb Ruediger Pluem : > > > > Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the > > actual value of listener_may_exit in the log message? > > I put it statically since that was the value in the "if ()" 6 lines above. it > should have been visible then and there to work. The above if () does not always break. Cheers; Yann.
Re: child exit on self-proxy
The things I do for you... [Wed Jul 07 15:48:52.591983 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(599): AH: wakeup_listener: start [Wed Jul 07 15:48:52.592000 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(603): AH: wakeup_listener: apr_pollset_wakeup [Wed Jul 07 15:48:52.592046 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(610): AH: wakeup_listener: ap_queue_info_term [Wed Jul 07 15:48:52.592188 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(629): AH: wakeup_listener: pthread_kill [Wed Jul 07 15:48:52.592166 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=2 [Wed Jul 07 15:48:52.592523 2021] [mpm_event:debug] [pid 48747:tid 4385197568] event.c(635): AH: wakeup_listener: end [Wed Jul 07 15:48:52.592727 2021] [mpm_event:trace1] [pid 48747:tid 4385197568] event.c(2942): graceful termination received, join workers [Wed Jul 07 15:48:52.592515 2021] [mpm_event:trace1] [pid 48747:tid 123145502474240] event.c(1809): All workers are busy or dying, will close 2 keep-alive connections [Wed Jul 07 15:48:52.593381 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): AH: pollset returned listener_may_exit=1 connection_count=2 [Wed Jul 07 15:48:52.593478 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=1 [Wed Jul 07 15:48:57.552623 2021] [mpm_event:debug] [pid 48747:tid 123145502474240] event.c(1985): (70007)The timeout specified has expired: AH: pollset returned listener_may_exit=1 connection_count=1 [Wed Jul 07 15:48:57.552658 2021] [mpm_event:trace1] [pid 48747:tid 123145502474240] event.c(1809): All workers are busy or dying, will close 0 keep-alive connections event_debugv2.diff Description: Binary data Btw. I opened a bottle of nice wine - this is entertaining. ;-) - Stefan > Am 07.07.2021 um 17:48 schrieb Yann Ylavic : > > On Wed, Jul 7, 2021 at 5:45 PM Stefan Eissing > wrote: >> >>> Am 07.07.2021 um 17:39 schrieb Ruediger Pluem : >>> >>> Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the >>> actual value of listener_may_exit in the log message? >> >> I put it statically since that was the value in the "if ()" 6 lines above. >> it should have been visible then and there to work. > > The above if () does not always break. > > > Cheers; > Yann.
Re: child exit on self-proxy
Does this attached patch help? This is to verify an hypothesis where killed keepalive connections are still in the defer_linger_chain without a worker to handle them. I don't see "interrupted, but continuing" triggering in your log but possibly it happened earlier.. On Wed, Jul 7, 2021 at 5:51 PM Stefan Eissing wrote: > > The things I do for you... > > [Wed Jul 07 15:48:52.591983 2021] [mpm_event:debug] [pid 48747:tid > 4385197568] event.c(599): AH: wakeup_listener: start > [Wed Jul 07 15:48:52.592000 2021] [mpm_event:debug] [pid 48747:tid > 4385197568] event.c(603): AH: wakeup_listener: apr_pollset_wakeup > [Wed Jul 07 15:48:52.592046 2021] [mpm_event:debug] [pid 48747:tid > 4385197568] event.c(610): AH: wakeup_listener: ap_queue_info_term > [Wed Jul 07 15:48:52.592188 2021] [mpm_event:debug] [pid 48747:tid > 4385197568] event.c(629): AH: wakeup_listener: pthread_kill > [Wed Jul 07 15:48:52.592166 2021] [mpm_event:debug] [pid 48747:tid > 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset > returned listener_may_exit=1 connection_count=2 > [Wed Jul 07 15:48:52.592523 2021] [mpm_event:debug] [pid 48747:tid > 4385197568] event.c(635): AH: wakeup_listener: end > [Wed Jul 07 15:48:52.592727 2021] [mpm_event:trace1] [pid 48747:tid > 4385197568] event.c(2942): graceful termination received, join workers > [Wed Jul 07 15:48:52.592515 2021] [mpm_event:trace1] [pid 48747:tid > 123145502474240] event.c(1809): All workers are busy or dying, will close 2 > keep-alive connections > [Wed Jul 07 15:48:52.593381 2021] [mpm_event:debug] [pid 48747:tid > 123145502474240] event.c(1985): AH: pollset returned listener_may_exit=1 > connection_count=2 > [Wed Jul 07 15:48:52.593478 2021] [mpm_event:debug] [pid 48747:tid > 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset > returned listener_may_exit=1 connection_count=1 > [Wed Jul 07 15:48:57.552623 2021] [mpm_event:debug] [pid 48747:tid > 123145502474240] event.c(1985): (70007)The timeout specified has expired: AH: > pollset returned listener_may_exit=1 connection_count=1 > [Wed Jul 07 15:48:57.552658 2021] [mpm_event:trace1] [pid 48747:tid > 123145502474240] event.c(1809): All workers are busy or dying, will close 0 > keep-alive connections > > > > Btw. I opened a bottle of nice wine - this is entertaining. ;-) > > - Stefan > > > Am 07.07.2021 um 17:48 schrieb Yann Ylavic : > > > > On Wed, Jul 7, 2021 at 5:45 PM Stefan Eissing > > wrote: > >> > >>> Am 07.07.2021 um 17:39 schrieb Ruediger Pluem : > >>> > >>> Hm. The ap_log_error statically writes listener_may_exit=0. Can you put > >>> the actual value of listener_may_exit in the log message? > >> > >> I put it statically since that was the value in the "if ()" 6 lines above. > >> it should have been visible then and there to work. > > > > The above if () does not always break. > > > > > > Cheers; > > Yann. > Index: server/mpm/event/event.c === --- server/mpm/event/event.c (revision 1891217) +++ server/mpm/event/event.c (working copy) @@ -1940,15 +1940,6 @@ static void * APR_THREAD_FUNC listener_thread(apr_ rc = apr_pollset_poll(event_pollset, timeout_interval, &num, &out_pfd); if (rc != APR_SUCCESS) { if (APR_STATUS_IS_EINTR(rc)) { -/* Woken up, if we are exiting or listeners are disabled we - * must fall through to kill kept-alive connections or test - * whether listeners should be re-enabled. Otherwise we only - * need to update timeouts (logic is above, so simply restart - * the loop). - */ -if (!listener_may_exit && !listeners_disabled()) { -continue; -} timeout_time = 0; } else if (!APR_STATUS_IS_TIMEUP(rc)) {
Re: child exit on self-proxy
> Am 07.07.2021 um 18:34 schrieb Yann Ylavic : > > Does this attached patch help? This is to verify an hypothesis where > killed keepalive connections are still in the defer_linger_chain > without a worker to handle them. > > I don't see "interrupted, but continuing" triggering in your log but > possibly it happened earlier.. [Wed Jul 07 16:48:04.237865 2021] [mpm_event:debug] [pid 55134:tid 4643098112] event.c(599): AH: wakeup_listener: start [Wed Jul 07 16:48:04.237895 2021] [mpm_event:debug] [pid 55134:tid 4643098112] event.c(603): AH: wakeup_listener: apr_pollset_wakeup [Wed Jul 07 16:48:04.238042 2021] [mpm_event:debug] [pid 55134:tid 4643098112] event.c(610): AH: wakeup_listener: ap_queue_info_term [Wed Jul 07 16:48:04.238224 2021] [mpm_event:debug] [pid 55134:tid 123145362530304] event.c(1971): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=1 [Wed Jul 07 16:48:04.238381 2021] [mpm_event:trace1] [pid 55134:tid 123145362530304] event.c(1809): All workers are busy or dying, will close 1 keep-alive connections [Wed Jul 07 16:48:04.238196 2021] [mpm_event:debug] [pid 55134:tid 4643098112] event.c(629): AH: wakeup_listener: pthread_kill [Wed Jul 07 16:48:04.238513 2021] [mpm_event:debug] [pid 55134:tid 123145362530304] event.c(1971): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 connection_count=1 [Wed Jul 07 16:48:04.238506 2021] [mpm_event:debug] [pid 55134:tid 4643098112] event.c(635): AH: wakeup_listener: end [Wed Jul 07 16:48:04.238555 2021] [mpm_event:trace1] [pid 55134:tid 4643098112] event.c(2928): graceful termination received, join workers [Wed Jul 07 16:48:08.443130 2021] [mpm_event:debug] [pid 55134:tid 123145362530304] event.c(1971): (70007)The timeout specified has expired: AH: pollset returned listener_may_exit=1 connection_count=1 [Wed Jul 07 16:48:08.443149 2021] [mpm_event:trace1] [pid 55134:tid 123145362530304] event.c(1809): All workers are busy or dying, will close 0 keep-alive connections > > On Wed, Jul 7, 2021 at 5:51 PM Stefan Eissing > wrote: >> >> The things I do for you... >> >> [Wed Jul 07 15:48:52.591983 2021] [mpm_event:debug] [pid 48747:tid >> 4385197568] event.c(599): AH: wakeup_listener: start >> [Wed Jul 07 15:48:52.592000 2021] [mpm_event:debug] [pid 48747:tid >> 4385197568] event.c(603): AH: wakeup_listener: apr_pollset_wakeup >> [Wed Jul 07 15:48:52.592046 2021] [mpm_event:debug] [pid 48747:tid >> 4385197568] event.c(610): AH: wakeup_listener: ap_queue_info_term >> [Wed Jul 07 15:48:52.592188 2021] [mpm_event:debug] [pid 48747:tid >> 4385197568] event.c(629): AH: wakeup_listener: pthread_kill >> [Wed Jul 07 15:48:52.592166 2021] [mpm_event:debug] [pid 48747:tid >> 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset >> returned listener_may_exit=1 connection_count=2 >> [Wed Jul 07 15:48:52.592523 2021] [mpm_event:debug] [pid 48747:tid >> 4385197568] event.c(635): AH: wakeup_listener: end >> [Wed Jul 07 15:48:52.592727 2021] [mpm_event:trace1] [pid 48747:tid >> 4385197568] event.c(2942): graceful termination received, join workers >> [Wed Jul 07 15:48:52.592515 2021] [mpm_event:trace1] [pid 48747:tid >> 123145502474240] event.c(1809): All workers are busy or dying, will close 2 >> keep-alive connections >> [Wed Jul 07 15:48:52.593381 2021] [mpm_event:debug] [pid 48747:tid >> 123145502474240] event.c(1985): AH: pollset returned listener_may_exit=1 >> connection_count=2 >> [Wed Jul 07 15:48:52.593478 2021] [mpm_event:debug] [pid 48747:tid >> 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset >> returned listener_may_exit=1 connection_count=1 >> [Wed Jul 07 15:48:57.552623 2021] [mpm_event:debug] [pid 48747:tid >> 123145502474240] event.c(1985): (70007)The timeout specified has expired: >> AH: pollset returned listener_may_exit=1 connection_count=1 >> [Wed Jul 07 15:48:57.552658 2021] [mpm_event:trace1] [pid 48747:tid >> 123145502474240] event.c(1809): All workers are busy or dying, will close 0 >> keep-alive connections >> >> >> >> Btw. I opened a bottle of nice wine - this is entertaining. ;-) >> >> - Stefan >> >>> Am 07.07.2021 um 17:48 schrieb Yann Ylavic : >>> >>> On Wed, Jul 7, 2021 at 5:45 PM Stefan Eissing >>> wrote: > Am 07.07.2021 um 17:39 schrieb Ruediger Pluem : > > Hm. The ap_log_error statically writes listener_may_exit=0. Can you put > the actual value of listener_may_exit in the log message? I put it statically since that was the value in the "if ()" 6 lines above. it should have been visible then and there to work. >>> >>> The above if () does not always break. >>> >>> >>> Cheers; >>> Yann. >> >
Re: child exit on self-proxy
Some day, I knew I had to learn more about mpm_event. =) Adding more DEBUGs, I see in the example below that 2 connections were idling at start of the graceful and the get added to the linger_chain. 2 workers are then woken up and process the socket. connection_count stays at 2 however. As I read it, that count drops when the connection pool is destroyed/cleanup'ed. This normally seem to happen on the worker_queue_info, but in this example this just does not happen. Is this a correct read? [Thu Jul 08 09:39:32.423198 2021] [mpm_event:trace1] [pid 9272:tid 4494732800] event.c(2973): graceful termination received, join workers [Thu Jul 08 09:39:32.423032 2021] [mpm_event:debug] [pid 9272:tid 123145324126208] event.c(2250): (4)Interrupted system call: AH: listener: process defer_linger_chain have_idle_worker=1 listener_may_exit=1 connection_count=2 [Thu Jul 08 09:39:32.422765 2021] [mpm_event:debug] [pid 9272:tid 123145321443328] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop [Thu Jul 08 09:39:32.422251 2021] [mpm_event:debug] [pid 9272:tid 123145311784960] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop [Thu Jul 08 09:39:32.421638 2021] [mpm_event:debug] [pid 9272:tid 123145318760448] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop [Thu Jul 08 09:39:32.423541 2021] [mpm_event:debug] [pid 9272:tid 123145324126208] event.c(2259): (4)Interrupted system call: AH: listener: process defer_linger_chain pushed to worker, listener_may_exit=1 connection_count=2 [Thu Jul 08 09:39:32.422212 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(2408): (4)Interrupted system call: AH: worker pop_something, workers_may_exit=0, dying=1 [Thu Jul 08 09:39:32.421806 2021] [mpm_event:debug] [pid 9272:tid 123145312858112] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop [Thu Jul 08 09:39:32.422789 2021] [mpm_event:debug] [pid 9272:tid 123145323053056] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop [Thu Jul 08 09:39:32.423346 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop [Thu Jul 08 09:39:32.423755 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(2431): (4)Interrupted system call: AH: worker pop_something interrupted, goto worker_pop [Thu Jul 08 09:39:32.424230 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2402): (4)Interrupted system call: AH: worker read worker_queue, workers_may_exit=0, dying=1 [Thu Jul 08 09:39:32.424373 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(2402): (4)Interrupted system call: AH: worker read worker_queue, workers_may_exit=0, dying=1 [Thu Jul 08 09:39:32.424384 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2408): AH: worker pop_something, workers_may_exit=0, dying=1 [Thu Jul 08 09:39:32.424477 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2338): AH: process_defer_linger, processing socket, connection_count=2 [Thu Jul 08 09:39:32.424653 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2342): AH: process_defer_linger, processed socket, connection_count=2 [Thu Jul 08 09:39:32.424664 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2338): AH: process_defer_linger, processing socket, connection_count=2 [Thu Jul 08 09:39:32.424680 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2342): AH: process_defer_linger, processed socket, connection_count=2 [Thu Jul 08 09:39:32.424708 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2319): AH: process_defer_linger, nothing more to do [Thu Jul 08 09:39:32.424754 2021] [mpm_event:debug] [pid 9272:tid 123145317150720] event.c(2402): AH: worker read worker_queue, workers_may_exit=0, dying=1 [Thu Jul 08 09:39:32.424756 2021] [mpm_event:debug] [pid 9272:tid 123145324126208] event.c(1978): AH: pollset returned listener_may_exit=1 connection_count=2 [Thu Jul 08 09:39:32.424830 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(2408): AH: worker pop_something, workers_may_exit=0, dying=1 [Thu Jul 08 09:39:32.424910 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(808): AH: decrement_connection_count: last_connection=0 listener_may_exit=1 connection_count=1 [Thu Jul 08 09:39:32.424924 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(816): AH: decrement_connection_count: pollset_wakeup [Thu Jul 08 09:39:32.424958 2021] [mpm_event:debug] [pid 9272:tid 123145311248384] event.c(2319): AH: process_defer_linger, nothing more to do [Thu Jul 08 09:39:32.424960 2021] [mpm_event:debug] [pid 9272:tid 123145324126208] event.c(1978): (4)Interrupted system call: AH: pollset returned listener_may_exit=1 conn
Re: child exit on self-proxy
On Thu, Jul 8, 2021 at 11:47 AM Stefan Eissing wrote: > > Some day, I knew I had to learn more about mpm_event. =) > > Adding more DEBUGs, I see in the example below that 2 connections were idling > at start of the graceful and the get added to the linger_chain. 2 workers are > then woken up and process the socket. connection_count stays at 2 however. As > I read it, that count drops when the connection pool is destroyed/cleanup'ed. > This normally seem to happen on the worker_queue_info, but in this example > this just does not happen. > > Is this a correct read? You proxy to a local server, so the 2 connections are the incoming ones on the local proxy vhost and the local server vhost. But mod_proxy backend connections are recycled and end up in a reslist which is not checked until reuse. So when on graceful restart the local server starts lingering close with its kept alive connection, it's just ignored by mod_proxy and the MAX_SECS_TO_LINGER timeout (30s) applies. I think we shouldn't do lingering on keepalive connections when they expire or get killed (graceful or max workers), this does not help the client anyway because any data sent on the connection is doomed, the sooner we RESET the faster it will try on another connection. So how about the attached patch that closes the connections when their keepalive expires (including when it's shortened)? There are other changes in there, like more trace logs that helped me debug things, they are worth it too IMHO. Also I changed reqevents = POLLIN to POLLIN|POLLHUP (by generalizing and using update_reqevents_from_sense() in more places), because I'm not sure that we catch the connections closed remotely otherwise (connection_count should go down to 1 more quickly in the local proxy+server case because the local server's client connection should have responded to the lingering close almost instantly, but this one seems to timeout too). Cheers; Yann. Index: server/mpm/event/event.c === --- server/mpm/event/event.c (revision 1891217) +++ server/mpm/event/event.c (working copy) @@ -554,7 +554,7 @@ static APR_INLINE int connections_above_limit(int return 1; } -static void abort_socket_nonblocking(apr_socket_t *csd) +static void close_socket_nonblocking(apr_socket_t *csd) { apr_status_t rv; apr_socket_timeout_set(csd, 0); @@ -573,7 +573,7 @@ static void close_worker_sockets(void) apr_socket_t *csd = worker_sockets[i]; if (csd) { worker_sockets[i] = NULL; -abort_socket_nonblocking(csd); +close_socket_nonblocking(csd); } } for (;;) { @@ -587,12 +587,16 @@ static void close_worker_sockets(void) continue; } cs->chain = NULL; -abort_socket_nonblocking(cs->pfd.desc.s); +close_socket_nonblocking(cs->pfd.desc.s); } } -static void wakeup_listener(void) +static void shutdown_listener(void) { +ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, ap_server_conf, + "shutting down listener (%s)", + listener_may_exit ? "again" : "first"); + listener_may_exit = 1; disable_listensocks(); @@ -643,7 +647,7 @@ static void signal_threads(int mode) /* in case we weren't called from the listener thread, wake up the * listener thread */ -wakeup_listener(); +shutdown_listener(); /* for ungraceful termination, let the workers exit now; * for graceful termination, the listener thread will notify the @@ -779,6 +783,8 @@ static apr_status_t decrement_connection_count(voi { int is_last_connection; event_conn_state_t *cs = cs_; +ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c, + "cleanup connection in state %i", (int)cs->pub.state); switch (cs->pub.state) { case CONN_STATE_LINGER_NORMAL: case CONN_STATE_LINGER_SHORT: @@ -817,6 +823,22 @@ static void notify_resume(event_conn_state_t *cs, ap_run_resume_connection(cs->c, cs->r); } +static void kill_connection(event_conn_state_t *cs) +{ +ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c, + "kill connection in state %i", (int)cs->pub.state); +close_socket_nonblocking(cs->pfd.desc.s); +ap_queue_info_push_pool(worker_queue_info, cs->p); +} + +static int kill_connection_nonblocking(event_conn_state_t *cs) +{ +kill_connection(cs); +if (dying) +ap_queue_interrupt_one(worker_queue); +return 1; +} + /* * Close our side of the connection, flushing data to the client first. * Pre-condition: cs is not in any timeout queue and not in the pollset, @@ -827,12 +849,11 @@ static void notify_resume(event_conn_state_t *cs, */ static int start_lingering_close_blocking(event_conn_state_t *cs) { -apr_socket_t *csd = cs->pfd.desc.s; - +ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c, + "lingering close in state %i", (int)cs->pub.state
Re: child exit on self-proxy
> Am 08.07.2021 um 14:14 schrieb Yann Ylavic : > > On Thu, Jul 8, 2021 at 11:47 AM Stefan Eissing > wrote: >> >> Some day, I knew I had to learn more about mpm_event. =) >> >> Adding more DEBUGs, I see in the example below that 2 connections were >> idling at start of the graceful and the get added to the linger_chain. 2 >> workers are then woken up and process the socket. connection_count stays at >> 2 however. As I read it, that count drops when the connection pool is >> destroyed/cleanup'ed. This normally seem to happen on the worker_queue_info, >> but in this example this just does not happen. >> >> Is this a correct read? > > You proxy to a local server, so the 2 connections are the incoming > ones on the local proxy vhost and the local server vhost. > But mod_proxy backend connections are recycled and end up in a reslist > which is not checked until reuse. > So when on graceful restart the local server starts lingering close > with its kept alive connection, it's just ignored by mod_proxy and the > MAX_SECS_TO_LINGER timeout (30s) applies. > > I think we shouldn't do lingering on keepalive connections when they > expire or get killed (graceful or max workers), this does not help the > client anyway because any data sent on the connection is doomed, the > sooner we RESET the faster it will try on another connection. > > So how about the attached patch that closes the connections when their > keepalive expires (including when it's shortened)? > There are other changes in there, like more trace logs that helped me > debug things, they are worth it too IMHO. > Also I changed reqevents = POLLIN to POLLIN|POLLHUP (by generalizing > and using update_reqevents_from_sense() in more places), because I'm > not sure that we catch the connections closed remotely otherwise > (connection_count should go down to 1 more quickly in the local > proxy+server case because the local server's client connection should > have responded to the lingering close almost instantly, but this one > seems to timeout too). > > Cheers; > Yann. > This seems to be it! Yann strikes again! And I learned something...\o/ I needed to make small tweaks, because no all previous close checked the return value and I got assert failures on your patch. I attach my version below. Instead of the param, there could also be 2 functions, one without checks. A matter of taste. So, I try to summarise: - the effect was triggered by the keep-alive connections remote endpoints slumbering in a proxy connection pool where no one was monitoring the sockets and reacting to TCP FIN packets. - This triggered a long LINGER timeout that blocked the listener from exiting. Ultimately leading to the parent killing the child. - This situation could also happen, I assume, when a client drops from the network, e.g. a cell phone entering bad coverage, where no one in the network path really knows what to do? Or a network connectivity issue or a comp crashing behind a NAT? Cheers, Stefan event_ka_no_lingerv2.diff Description: Binary data
Re: child exit on self-proxy
On Thu, Jul 8, 2021 at 2:55 PM Stefan Eissing wrote: > > This seems to be it! Yann strikes again! And I learned something...\o/ Thanks Stefan :) > > I needed to make small tweaks, because no all previous close checked the > return value and I got assert failures on your patch. Interesting, what error value was that? EBADF, EAGAIN? > I attach my version > below. Instead of the param, there could also be 2 functions, one without > checks. A matter of taste. Depending on the error cases, we might want to fix something or ignore or relax the assertion maybe. Let's see.. > > So, I try to summarise: > > - the effect was triggered by the keep-alive connections remote endpoints > slumbering in a proxy connection pool where no one was monitoring the > sockets and reacting to TCP FIN packets. > - This triggered a long LINGER timeout that blocked the listener from > exiting. Ultimately leading to the parent killing the child. > - This situation could also happen, I assume, when a client drops from > the network, e.g. a cell phone entering bad coverage, where no one > in the network path really knows what to do? Or a network connectivity > issue or a comp crashing behind a NAT? Exactly, I think that closing kept alive connections is the right thing to do here, lingering close helps neither MPM event nor the client. I first tried to use the short linger timeout instead of the long one, but I don't see a reason to linger those connections, even a little. Other opinions maybe? Cheers; Yann.
Re: child exit on self-proxy
On Thu, Jul 8, 2021 at 3:33 PM Yann Ylavic wrote: > > On Thu, Jul 8, 2021 at 2:55 PM Stefan Eissing > wrote: > > > > This seems to be it! Yann strikes again! And I learned something...\o/ > > Thanks Stefan :) > > > > > I needed to make small tweaks, because no all previous close checked the > > return value and I got assert failures on your patch. > > Interesting, what error value was that? EBADF, EAGAIN? > > > I attach my version > > below. Instead of the param, there could also be 2 functions, one without > > checks. A matter of taste. > > Depending on the error cases, we might want to fix something or ignore > or relax the assertion maybe. > Let's see.. > > > > > So, I try to summarise: > > > > - the effect was triggered by the keep-alive connections remote endpoints > > slumbering in a proxy connection pool where no one was monitoring the > > sockets and reacting to TCP FIN packets. > > - This triggered a long LINGER timeout that blocked the listener from > > exiting. Ultimately leading to the parent killing the child. > > - This situation could also happen, I assume, when a client drops from > > the network, e.g. a cell phone entering bad coverage, where no one > > in the network path really knows what to do? Or a network connectivity > > issue or a comp crashing behind a NAT? > > Exactly, I think that closing kept alive connections is the right > thing to do here, lingering close helps neither MPM event nor the > client. > I first tried to use the short linger timeout instead of the long one, > but I don't see a reason to linger those connections, even a little. > Other opinions maybe? > > > Cheers; > Yann.
Re: child exit on self-proxy
> Am 08.07.2021 um 15:33 schrieb Yann Ylavic : > > On Thu, Jul 8, 2021 at 2:55 PM Stefan Eissing > wrote: >> >> This seems to be it! Yann strikes again! And I learned something...\o/ > > Thanks Stefan :) > >> >> I needed to make small tweaks, because no all previous close checked the >> return value and I got assert failures on your patch. > > Interesting, what error value was that? EBADF, EAGAIN? EBADF is what I saw. > >> I attach my version >> below. Instead of the param, there could also be 2 functions, one without >> checks. A matter of taste. > > Depending on the error cases, we might want to fix something or ignore > or relax the assertion maybe. > Let's see.. > >> >> So, I try to summarise: >> >> - the effect was triggered by the keep-alive connections remote endpoints >> slumbering in a proxy connection pool where no one was monitoring the >> sockets and reacting to TCP FIN packets. >> - This triggered a long LINGER timeout that blocked the listener from >> exiting. Ultimately leading to the parent killing the child. >> - This situation could also happen, I assume, when a client drops from >> the network, e.g. a cell phone entering bad coverage, where no one >> in the network path really knows what to do? Or a network connectivity >> issue or a comp crashing behind a NAT? > > Exactly, I think that closing kept alive connections is the right > thing to do here, lingering close helps neither MPM event nor the > client. > I first tried to use the short linger timeout instead of the long one, > but I don't see a reason to linger those connections, even a little. > Other opinions maybe? > > > Cheers; > Yann.
Re: child exit on self-proxy
> I first tried to use the short linger timeout instead of the long one, > but I don't see a reason to linger those connections, even a little. IIUC this would only be useful for KeepAliveTimeout of something like 2s or less which is kind of academic. I would go with whatever is simpler [to maintain].
Re: child exit on self-proxy
On Thu, Jul 8, 2021 at 3:35 PM Stefan Eissing wrote: > >> > >> I needed to make small tweaks, because no all previous close checked the > >> return value and I got assert failures on your patch. > > > > Interesting, what error value was that? EBADF, EAGAIN? > > EBADF is what I saw. Could you track somehow where the first close happened? Maybe: static void close_socket_nonblocking_ex(apr_socket_t *csd, const char *from) { apr_status_t rv; apr_os_sock_t fd = -1; apr_os_sock_get(&fd, csd); apr_socket_timeout_set(csd, 0); rv = apr_socket_close(csd); if (rv != APR_SUCCESS) { ap_log_error(APLOG_MARK, APLOG_ERR, rv, ap_server_conf, APLOGNO(00468) "error closing socket %i/%pp from %s", (int)fd, csd, from); /*AP_DEBUG_ASSERT(0);*/ } else { ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf, "socket %i/%pp closed from %s", (int)fd, csd, from); } } #define close_socket_nonblocking(csd) close_socket_nonblocking_ex(csd, __FUNCTION__) to see if close_socket_nonblocking() is called twice or something.
Re: child exit on self-proxy
I see many of those: [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 123145411510272] (9)Bad file descriptor: AH00468: error closing socket -1/7ff9cf0086b0 from process_socket which come from event.c#1263: rc = start_lingering_close_blocking(cs); calling event.c#864: kill_connection_ex(cs, from); and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from); Patch attached. event_ka_no_lingerv3.diff Description: Binary data > Am 08.07.2021 um 15:46 schrieb Yann Ylavic : > > On Thu, Jul 8, 2021 at 3:35 PM Stefan Eissing > wrote: I needed to make small tweaks, because no all previous close checked the return value and I got assert failures on your patch. >>> >>> Interesting, what error value was that? EBADF, EAGAIN? >> >> EBADF is what I saw. > > Could you track somehow where the first close happened? > Maybe: > static void close_socket_nonblocking_ex(apr_socket_t *csd, const char *from) > { >apr_status_t rv; >apr_os_sock_t fd = -1; >apr_os_sock_get(&fd, csd); >apr_socket_timeout_set(csd, 0); >rv = apr_socket_close(csd); >if (rv != APR_SUCCESS) { >ap_log_error(APLOG_MARK, APLOG_ERR, rv, ap_server_conf, APLOGNO(00468) > "error closing socket %i/%pp from %s", (int)fd, csd, > from); >/*AP_DEBUG_ASSERT(0);*/ >} >else { >ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf, > "socket %i/%pp closed from %s", (int)fd, csd, from); >} > } > #define close_socket_nonblocking(csd) close_socket_nonblocking_ex(csd, > __FUNCTION__) > > to see if close_socket_nonblocking() is called twice or something.
Re: child exit on self-proxy
On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing wrote: > > I see many of those: > [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid > 123145411510272] (9)Bad file descriptor: AH00468: error closing socket > -1/7ff9cf0086b0 from process_socket > > which come from event.c#1263: rc = start_lingering_close_blocking(cs); > calling event.c#864: kill_connection_ex(cs, from); > and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from); OK, so I think this could be addressed by the attached patch. This is the same as v0 plus some changes in ap_start_lingering_close() (from server/connection.c) to not close the socket on error. ap_start_lingering_close() did not close the socket consistently, so the caller had to call apr_socket_close() anyway and sometimes fail.. The close on failure is moved to ap_lingering_close() which is the one that should care. Still the AP_DEBUG_ASSERT(0) triggering with this? Index: server/mpm/event/event.c === --- server/mpm/event/event.c (revision 1891217) +++ server/mpm/event/event.c (working copy) @@ -554,9 +554,10 @@ static APR_INLINE int connections_above_limit(int return 1; } -static void abort_socket_nonblocking(apr_socket_t *csd) +static void close_socket_nonblocking(apr_socket_t *csd) { apr_status_t rv; + apr_socket_timeout_set(csd, 0); rv = apr_socket_close(csd); if (rv != APR_SUCCESS) { @@ -573,7 +574,7 @@ static void close_worker_sockets(void) apr_socket_t *csd = worker_sockets[i]; if (csd) { worker_sockets[i] = NULL; -abort_socket_nonblocking(csd); +close_socket_nonblocking(csd); } } for (;;) { @@ -587,12 +588,16 @@ static void close_worker_sockets(void) continue; } cs->chain = NULL; -abort_socket_nonblocking(cs->pfd.desc.s); +close_socket_nonblocking(cs->pfd.desc.s); } } -static void wakeup_listener(void) +static void shutdown_listener(void) { +ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, ap_server_conf, + "shutting down listener (%s)", + listener_may_exit ? "again" : "first"); + listener_may_exit = 1; disable_listensocks(); @@ -643,7 +648,7 @@ static void signal_threads(int mode) /* in case we weren't called from the listener thread, wake up the * listener thread */ -wakeup_listener(); +shutdown_listener(); /* for ungraceful termination, let the workers exit now; * for graceful termination, the listener thread will notify the @@ -779,6 +784,8 @@ static apr_status_t decrement_connection_count(voi { int is_last_connection; event_conn_state_t *cs = cs_; +ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c, + "cleanup connection in state %i", (int)cs->pub.state); switch (cs->pub.state) { case CONN_STATE_LINGER_NORMAL: case CONN_STATE_LINGER_SHORT: @@ -817,6 +824,22 @@ static void notify_resume(event_conn_state_t *cs, ap_run_resume_connection(cs->c, cs->r); } +static void kill_connection(event_conn_state_t *cs) +{ +ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c, + "kill connection in state %i", (int)cs->pub.state); +close_socket_nonblocking(cs->pfd.desc.s); +ap_queue_info_push_pool(worker_queue_info, cs->p); +} + +static int kill_connection_nonblocking(event_conn_state_t *cs) +{ +kill_connection(cs); +if (dying) +ap_queue_interrupt_one(worker_queue); +return 1; +} + /* * Close our side of the connection, flushing data to the client first. * Pre-condition: cs is not in any timeout queue and not in the pollset, @@ -827,12 +850,11 @@ static void notify_resume(event_conn_state_t *cs, */ static int start_lingering_close_blocking(event_conn_state_t *cs) { -apr_socket_t *csd = cs->pfd.desc.s; - +ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c, + "lingering close in state %i", (int)cs->pub.state); if (ap_start_lingering_close(cs->c)) { notify_suspend(cs); -apr_socket_close(csd); -ap_queue_info_push_pool(worker_queue_info, cs->p); +kill_connection(cs); return DONE; } @@ -839,11 +861,11 @@ static int start_lingering_close_blocking(event_co #ifdef AP_DEBUG { apr_status_t rv; -rv = apr_socket_timeout_set(csd, 0); +rv = apr_socket_timeout_set(cs->pfd.desc.s, 0); AP_DEBUG_ASSERT(rv == APR_SUCCESS); } #else -apr_socket_timeout_set(csd, 0); +apr_socket_timeout_set(cs->pfd.desc.s, 0); #endif cs->queue_timestamp = apr_time_now(); @@ -874,9 +896,11 @@ static int start_lingering_close_blocking(event_co */ static int start_lingering_close_nonblocking(event_conn_state_t *cs) { -event_conn_state_t *chain; +ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c, + "lingering close nonblocking in
Re: child exit on self-proxy
*stands in the pentagram of blood drawn by Yann* "I'll do thy bidding and report back!" > Am 08.07.2021 um 20:17 schrieb Yann Ylavic : > > On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing > wrote: >> >> I see many of those: >> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid >> 123145411510272] (9)Bad file descriptor: AH00468: error closing socket >> -1/7ff9cf0086b0 from process_socket >> >> which come from event.c#1263: rc = start_lingering_close_blocking(cs); >> calling event.c#864: kill_connection_ex(cs, from); >> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from); > > OK, so I think this could be addressed by the attached patch. > This is the same as v0 plus some changes in ap_start_lingering_close() > (from server/connection.c) to not close the socket on error. > ap_start_lingering_close() did not close the socket consistently, so > the caller had to call apr_socket_close() anyway and sometimes fail.. > The close on failure is moved to ap_lingering_close() which is the one > that should care. > > Still the AP_DEBUG_ASSERT(0) triggering with this? >
Re: child exit on self-proxy
> Am 08.07.2021 um 20:17 schrieb Yann Ylavic : > > On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing > wrote: >> >> I see many of those: >> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid >> 123145411510272] (9)Bad file descriptor: AH00468: error closing socket >> -1/7ff9cf0086b0 from process_socket >> >> which come from event.c#1263: rc = start_lingering_close_blocking(cs); >> calling event.c#864: kill_connection_ex(cs, from); >> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from); > > OK, so I think this could be addressed by the attached patch. > This is the same as v0 plus some changes in ap_start_lingering_close() > (from server/connection.c) to not close the socket on error. > ap_start_lingering_close() did not close the socket consistently, so > the caller had to call apr_socket_close() anyway and sometimes fail.. > The close on failure is moved to ap_lingering_close() which is the one > that should care. > > Still the AP_DEBUG_ASSERT(0) triggering with this? > No assertion failures or mpm_event:error logs with this one, however (if it was this patch or already on one of them before), the "pre_close" hook does not run during shutdown. This results in logs as: [Fri Jul 09 07:23:12.256163 2021] [http2:warn] [pid 97909:tid 123145335771136] [remote 127.0.0.1:59875] AH10020: h2_session(146,IDLE,0): session cleanup triggered by pool cleanup. this should have happened earlier already. [Fri Jul 09 07:23:12.256197 2021] [http2:warn] [pid 97909:tid 123145335771136] [remote 127.0.0.1:59875] AH03199: h2_session(146,IDLE,0): connection disappeared without proper goodbye, clients will be confused, should not happen I think omitting the last GOAWAY frame in case of a server restart, is not a tragedy. For the sake of an easier to maintain shutdown, I think we can live with that. Or, if you can think of a better interface to a pre_close hook, I could adjust. Maybe it would be easier for pre_close to get passed a bucket brigade where last bytes can be added to? In case we say, we live with the incorrect h2 shutdown, I would amend the log level on a stopping server. - Stefan
Re: child exit on self-proxy
Rarely seeing a close_socket_nonblocking + 83 (event.c:566) assertion failure. :-/ > Am 09.07.2021 um 09:38 schrieb Stefan Eissing : > > > >> Am 08.07.2021 um 20:17 schrieb Yann Ylavic : >> >> On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing >> wrote: >>> >>> I see many of those: >>> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid >>> 123145411510272] (9)Bad file descriptor: AH00468: error closing socket >>> -1/7ff9cf0086b0 from process_socket >>> >>> which come from event.c#1263: rc = start_lingering_close_blocking(cs); >>> calling event.c#864: kill_connection_ex(cs, from); >>> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from); >> >> OK, so I think this could be addressed by the attached patch. >> This is the same as v0 plus some changes in ap_start_lingering_close() >> (from server/connection.c) to not close the socket on error. >> ap_start_lingering_close() did not close the socket consistently, so >> the caller had to call apr_socket_close() anyway and sometimes fail.. >> The close on failure is moved to ap_lingering_close() which is the one >> that should care. >> >> Still the AP_DEBUG_ASSERT(0) triggering with this? >> > > No assertion failures or mpm_event:error logs with this one, however (if it > was this patch or already on one of them before), the "pre_close" hook does > not run during shutdown. This results in logs as: > > [Fri Jul 09 07:23:12.256163 2021] [http2:warn] [pid 97909:tid > 123145335771136] [remote 127.0.0.1:59875] AH10020: h2_session(146,IDLE,0): > session cleanup triggered by pool cleanup. this should have happened earlier > already. > [Fri Jul 09 07:23:12.256197 2021] [http2:warn] [pid 97909:tid > 123145335771136] [remote 127.0.0.1:59875] AH03199: h2_session(146,IDLE,0): > connection disappeared without proper goodbye, clients will be confused, > should not happen > > I think omitting the last GOAWAY frame in case of a server restart, is not a > tragedy. For the sake of an easier to maintain shutdown, I think we can live > with that. Or, if you can think of a better interface to a pre_close hook, I > could adjust. Maybe it would be easier for pre_close to get passed a bucket > brigade where last bytes can be added to? > > In case we say, we live with the incorrect h2 shutdown, I would amend the log > level on a stopping server. > > - Stefan >
Re: child exit on self-proxy
Coming back to this: I vote for applying this to trunk. On a fresh setup, I do not see and assert failures in my test. - Stefan > Am 08.07.2021 um 20:17 schrieb Yann Ylavic : > > On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing > wrote: >> >> I see many of those: >> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid >> 123145411510272] (9)Bad file descriptor: AH00468: error closing socket >> -1/7ff9cf0086b0 from process_socket >> >> which come from event.c#1263: rc = start_lingering_close_blocking(cs); >> calling event.c#864: kill_connection_ex(cs, from); >> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from); > > OK, so I think this could be addressed by the attached patch. > This is the same as v0 plus some changes in ap_start_lingering_close() > (from server/connection.c) to not close the socket on error. > ap_start_lingering_close() did not close the socket consistently, so > the caller had to call apr_socket_close() anyway and sometimes fail.. > The close on failure is moved to ap_lingering_close() which is the one > that should care. > > Still the AP_DEBUG_ASSERT(0) triggering with this? >
Re: child exit on self-proxy
Found one condition where the assertion failure happens. The test case I have does this: sock = socket.create_connection((host, int(env.https_port))) try: sock.settimeout(0.5) sock.recv(1024) assert False except Exception as ex: print(f"as expected: {ex}") sock.close() Which is all that's needed to trigger the assertion. The server is not stopped. The log for this thread only gives: [Tue Jul 13 13:51:29.476747 2021] [mpm_event:trace4] [pid 31582:tid 123145480441856] event.c(858): [client 127.0.0.1:51768] lingering close in state 5 [Tue Jul 13 13:51:29.476769 2021] [mpm_event:trace4] [pid 31582:tid 123145480441856] event.c(833): [remote 127.0.0.1:51768] kill connection in state 5 [Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid 123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from process_socket [Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid 123145480441856] (9)Bad file descriptor: AH00468: error closing socket -1/7fa24681d0b0 from process_socket [Tue Jul 13 13:51:29.476799 2021] [core:crit] [pid 31582:tid 123145480441856] AH00102: [Tue Jul 13 13:51:29 2021] file event.c, line 569, assertion "0" failed I added a trace4 "closing socket" before the apr_socket_close(), grepping for that in the log shows: [Tue Jul 13 13:51:28.467677 2021] [mpm_event:trace4] [pid 31582:tid 123145479905280] event.c(563): closing socket 15/7fa24681d0a0 from process_lingering_close [Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid 123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from process_socket [Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid 123145480441856] (9)Bad file descriptor: AH00468: error closing socket -1/7fa24681d0b0 from process_socket [Tue Jul 13 13:51:29.984132 2021] [mpm_event:trace4] [pid 31584:tid 123145470783488] event.c(563): closing socket -1/7fa24681f0b0 from process_socket [Tue Jul 13 13:51:29.984138 2021] [mpm_event:error] [pid 31584:tid 123145470783488] (9)Bad file descriptor: AH00468: error closing socket -1/7fa24681f0b0 from process_socket Thread 21 Crashed: 0 libsystem_kernel.dylib 0x7fff202e292e __pthread_kill + 10 1 libsystem_pthread.dylib 0x7fff203115bd pthread_kill + 263 2 libsystem_c.dylib 0x7fff202664ab __abort + 139 3 libsystem_c.dylib 0x7fff20266420 abort + 135 4 httpd 0x00010dbfdb72 ap_log_assert + 130 (log.c:1649) 5 mod_mpm_event.so0x00010e5154bf close_socket_nonblocking_ex + 223 (event.c:569) 6 mod_mpm_event.so0x00010e515158 kill_connection_ex + 67 (event.c:835) [inlined] 7 mod_mpm_event.so0x00010e515158 start_lingering_close_blocking_ex + 184 (event.c:862) 8 mod_mpm_event.so0x00010e514da9 process_socket + 1289 (event.c:1261) 9 mod_mpm_event.so0x00010e514769 worker_thread + 809 (event.c:2394) 10 libapr-1.0.dylib0x00010ddb9341 dummy_worker + 17 (thread.c:147) 11 libsystem_pthread.dylib 0x7fff203118fc _pthread_start + 224 12 libsystem_pthread.dylib 0x7fff2030d443 thread_start + 15 - Stefan > Am 13.07.2021 um 12:39 schrieb Stefan Eissing : > > Coming back to this: I vote for applying this to trunk. On a fresh setup, I > do not see and assert failures in my test. > > - Stefan > >> Am 08.07.2021 um 20:17 schrieb Yann Ylavic : >> >> On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing >> wrote: >>> >>> I see many of those: >>> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid >>> 123145411510272] (9)Bad file descriptor: AH00468: error closing socket >>> -1/7ff9cf0086b0 from process_socket >>> >>> which come from event.c#1263: rc = start_lingering_close_blocking(cs); >>> calling event.c#864: kill_connection_ex(cs, from); >>> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from); >> >> OK, so I think this could be addressed by the attached patch. >> This is the same as v0 plus some changes in ap_start_lingering_close() >> (from server/connection.c) to not close the socket on error. >> ap_start_lingering_close() did not close the socket consistently, so >> the caller had to call apr_socket_close() anyway and sometimes fail.. >> The close on failure is moved to ap_lingering_close() which is the one >> that should care. >> >> Still the AP_DEBUG_ASSERT(0) triggering with this? >> >
Re: child exit on self-proxy
Wait a minute...need to confirm that I am not lying and used an older patch version... > Am 13.07.2021 um 16:01 schrieb Stefan Eissing : > > Found one condition where the assertion failure happens. The test case I have > does this: > > sock = socket.create_connection((host, int(env.https_port))) > try: >sock.settimeout(0.5) >sock.recv(1024) >assert False > except Exception as ex: >print(f"as expected: {ex}") > sock.close() > > Which is all that's needed to trigger the assertion. The server is not > stopped. > > The log for this thread only gives: > [Tue Jul 13 13:51:29.476747 2021] [mpm_event:trace4] [pid 31582:tid > 123145480441856] event.c(858): [client 127.0.0.1:51768] lingering close in > state 5 > [Tue Jul 13 13:51:29.476769 2021] [mpm_event:trace4] [pid 31582:tid > 123145480441856] event.c(833): [remote 127.0.0.1:51768] kill connection in > state 5 > [Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid > 123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from > process_socket > [Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid > 123145480441856] (9)Bad file descriptor: AH00468: error closing socket > -1/7fa24681d0b0 from process_socket > [Tue Jul 13 13:51:29.476799 2021] [core:crit] [pid 31582:tid 123145480441856] > AH00102: [Tue Jul 13 13:51:29 2021] file event.c, line 569, assertion "0" > failed > > I added a trace4 "closing socket" before the apr_socket_close(), grepping for > that in the log shows: > [Tue Jul 13 13:51:28.467677 2021] [mpm_event:trace4] [pid 31582:tid > 123145479905280] event.c(563): closing socket 15/7fa24681d0a0 from > process_lingering_close > [Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid > 123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from > process_socket > [Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid > 123145480441856] (9)Bad file descriptor: AH00468: error closing socket > -1/7fa24681d0b0 from process_socket > [Tue Jul 13 13:51:29.984132 2021] [mpm_event:trace4] [pid 31584:tid > 123145470783488] event.c(563): closing socket -1/7fa24681f0b0 from > process_socket > [Tue Jul 13 13:51:29.984138 2021] [mpm_event:error] [pid 31584:tid > 123145470783488] (9)Bad file descriptor: AH00468: error closing socket > -1/7fa24681f0b0 from process_socket > > > > Thread 21 Crashed: > 0 libsystem_kernel.dylib0x7fff202e292e __pthread_kill + 10 > 1 libsystem_pthread.dylib 0x7fff203115bd pthread_kill + 263 > 2 libsystem_c.dylib 0x7fff202664ab __abort + 139 > 3 libsystem_c.dylib 0x7fff20266420 abort + 135 > 4 httpd 0x00010dbfdb72 ap_log_assert + 130 > (log.c:1649) > 5 mod_mpm_event.so 0x00010e5154bf > close_socket_nonblocking_ex + 223 (event.c:569) > 6 mod_mpm_event.so 0x00010e515158 kill_connection_ex + > 67 (event.c:835) [inlined] > 7 mod_mpm_event.so 0x00010e515158 > start_lingering_close_blocking_ex + 184 (event.c:862) > 8 mod_mpm_event.so 0x00010e514da9 process_socket + > 1289 (event.c:1261) > 9 mod_mpm_event.so 0x00010e514769 worker_thread + 809 > (event.c:2394) > 10 libapr-1.0.dylib 0x00010ddb9341 dummy_worker + 17 > (thread.c:147) > 11 libsystem_pthread.dylib 0x7fff203118fc _pthread_start + 224 > 12 libsystem_pthread.dylib 0x7fff2030d443 thread_start + 15 > > > > - Stefan > >> Am 13.07.2021 um 12:39 schrieb Stefan Eissing : >> >> Coming back to this: I vote for applying this to trunk. On a fresh setup, I >> do not see and assert failures in my test. >> >> - Stefan >> >>> Am 08.07.2021 um 20:17 schrieb Yann Ylavic : >>> >>> On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing >>> wrote: I see many of those: [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 123145411510272] (9)Bad file descriptor: AH00468: error closing socket -1/7ff9cf0086b0 from process_socket which come from event.c#1263: rc = start_lingering_close_blocking(cs); calling event.c#864: kill_connection_ex(cs, from); and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from); >>> >>> OK, so I think this could be addressed by the attached patch. >>> This is the same as v0 plus some changes in ap_start_lingering_close() >>> (from server/connection.c) to not close the socket on error. >>> ap_start_lingering_close() did not close the socket consistently, so >>> the caller had to call apr_socket_close() anyway and sometimes fail.. >>> The close on failure is moved to ap_lingering_close() which is the one >>> that should care. >>> >>> Still the AP_DEBUG_ASSERT(0) triggering with this? >>> >> >
Re: child exit on self-proxy
In Yann's v3 of the patch, it triggers only when I stop the server while the test case is ongoing, Thread 2 Crashed: 0 libsystem_kernel.dylib 0x7fff202e292e __pthread_kill + 10 1 libsystem_pthread.dylib 0x7fff203115bd pthread_kill + 263 2 libsystem_c.dylib 0x7fff202664ab __abort + 139 3 libsystem_c.dylib 0x7fff20266420 abort + 135 4 httpd 0x00010625bb72 ap_log_assert + 130 (log.c:1649) 5 mod_mpm_event.so0x000106b73549 close_socket_nonblocking + 217 (event.c:569) 6 mod_mpm_event.so0x000106b731f2 kill_connection + 64 (event.c:834) [inlined] 7 mod_mpm_event.so0x000106b731f2 start_lingering_close_blocking + 178 (event.c:860) 8 mod_mpm_event.so0x000106b72e52 process_socket + 1282 (event.c:1258) 9 mod_mpm_event.so0x000106b72819 worker_thread + 809 (event.c:2390) 10 libapr-1.0.dylib0x000106412341 dummy_worker + 17 (thread.c:147) 11 libsystem_pthread.dylib 0x7fff203118fc _pthread_start + 224 12 libsystem_pthread.dylib 0x7fff2030d443 thread_start + 15 [Tue Jul 13 14:12:18.800125 2021] [mpm_event:trace4] [pid 47786:tid 123145484374016] event.c(563): closing socket 15/7f95ba07b4a0 [Tue Jul 13 14:12:19.551039 2021] [mpm_event:trace4] [pid 47787:tid 12314548824] event.c(1322): closing listeners (connection_count=0) [Tue Jul 13 14:12:19.551146 2021] [mpm_event:trace4] [pid 47786:tid 123145497251840] event.c(1322): closing listeners (connection_count=1) [Tue Jul 13 14:12:19.551176 2021] [mpm_event:trace4] [pid 47788:tid 12314548824] event.c(1322): closing listeners (connection_count=0) [Tue Jul 13 14:12:19.551209 2021] [mpm_event:trace4] [pid 47786:tid 4476272128] event.c(563): closing socket 15/7f95ba07b4b0 [Tue Jul 13 14:12:19.552305 2021] [mpm_event:trace4] [pid 47792:tid 12314548824] event.c(1322): closing listeners (connection_count=0) [Tue Jul 13 14:12:19.805283 2021] [mpm_event:trace4] [pid 47786:tid 123145484910592] event.c(563): closing socket -1/7f95ba07b4b0 [Tue Jul 13 14:12:19.805289 2021] [mpm_event:error] [pid 47786:tid 123145484910592] (9)Bad file descriptor: AH00468: error closing socket -1/7f95ba07b4b0 You can see that socket 7f95ba07b4b0 is closed twice. I attach my log and v5 patch (which is Yann's plus some extra log around closing). error_log Description: Binary data event_ka_no_lingerv5.diff Description: Binary data > Am 13.07.2021 um 16:03 schrieb Stefan Eissing : > > Wait a minute...need to confirm that I am not lying and used an older patch > version... > >> Am 13.07.2021 um 16:01 schrieb Stefan Eissing : >> >> Found one condition where the assertion failure happens. The test case I >> have does this: >> >> sock = socket.create_connection((host, int(env.https_port))) >> try: >> sock.settimeout(0.5) >> sock.recv(1024) >> assert False >> except Exception as ex: >> print(f"as expected: {ex}") >> sock.close() >> >> Which is all that's needed to trigger the assertion. The server is not >> stopped. >> >> The log for this thread only gives: >> [Tue Jul 13 13:51:29.476747 2021] [mpm_event:trace4] [pid 31582:tid >> 123145480441856] event.c(858): [client 127.0.0.1:51768] lingering close in >> state 5 >> [Tue Jul 13 13:51:29.476769 2021] [mpm_event:trace4] [pid 31582:tid >> 123145480441856] event.c(833): [remote 127.0.0.1:51768] kill connection in >> state 5 >> [Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid >> 123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from >> process_socket >> [Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid >> 123145480441856] (9)Bad file descriptor: AH00468: error closing socket >> -1/7fa24681d0b0 from process_socket >> [Tue Jul 13 13:51:29.476799 2021] [core:crit] [pid 31582:tid >> 123145480441856] AH00102: [Tue Jul 13 13:51:29 2021] file event.c, line 569, >> assertion "0" failed >> >> I added a trace4 "closing socket" before the apr_socket_close(), grepping >> for that in the log shows: >> [Tue Jul 13 13:51:28.467677 2021] [mpm_event:trace4] [pid 31582:tid >> 123145479905280] event.c(563): closing socket 15/7fa24681d0a0 from >> process_lingering_close >> [Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid >> 123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from >> process_socket >> [Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid >> 123145480441856] (9)Bad file descriptor: AH00468: error closing socket >> -1/7fa24681d0b0 from process_socket >> [Tue Jul 13 13:51:29.984132 2021] [mpm_event:trace4] [pid 31584:tid >> 123145470783488] event.c(563): closing socket -1/7fa24681f0b0 from >> process_socket >> [Tue Jul 13 13:51:29.984138 2021] [mpm_event:error] [pid 31584:tid >> 123145470783488] (9)Bad file descr
Re: child exit on self-proxy
On Tue, Jul 13, 2021 at 4:16 PM Stefan Eissing wrote: > > In Yann's v3 of the patch, it triggers only when I stop the server while the > test case is ongoing, OK thanks, I have a new v6 which should address this and also still call pre_close hooks in any case. The patch is a bit "verbose" (some cosmetics/comments changes that helped my workflow, sorry about that) but in the end I think it's simpler to maintain (as Eric asked?) which may be worth it.. > > [Tue Jul 13 14:12:18.800125 2021] [mpm_event:trace4] [pid 47786:tid > 123145484374016] event.c(563): closing socket 15/7f95ba07b4a0 > [Tue Jul 13 14:12:19.551039 2021] [mpm_event:trace4] [pid 47787:tid > 12314548824] event.c(1322): closing listeners (connection_count=0) > [Tue Jul 13 14:12:19.551146 2021] [mpm_event:trace4] [pid 47786:tid > 123145497251840] event.c(1322): closing listeners (connection_count=1) > [Tue Jul 13 14:12:19.551176 2021] [mpm_event:trace4] [pid 47788:tid > 12314548824] event.c(1322): closing listeners (connection_count=0) > [Tue Jul 13 14:12:19.551209 2021] [mpm_event:trace4] [pid 47786:tid > 4476272128] event.c(563): closing socket 15/7f95ba07b4b0 > [Tue Jul 13 14:12:19.552305 2021] [mpm_event:trace4] [pid 47792:tid > 12314548824] event.c(1322): closing listeners (connection_count=0) > [Tue Jul 13 14:12:19.805283 2021] [mpm_event:trace4] [pid 47786:tid > 123145484910592] event.c(563): closing socket -1/7f95ba07b4b0 > [Tue Jul 13 14:12:19.805289 2021] [mpm_event:error] [pid 47786:tid > 123145484910592] (9)Bad file descriptor: AH00468: error closing socket > -1/7f95ba07b4b0 > > You can see that socket 7f95ba07b4b0 is closed twice. Indeed, where does the first close come from (the caller)? If you see this with the new patch (LogLevel mpm_event:trace8), can you confirm that it's from close_worker_sockets() (i.e. for ungraceful restart) only? Cheers; Yann. Index: server/mpm/event/event.c === --- server/mpm/event/event.c (revision 1891501) +++ server/mpm/event/event.c (working copy) @@ -254,6 +254,8 @@ struct event_conn_state_t { conn_state_t pub; /** chaining in defer_linger_chain */ struct event_conn_state_t *chain; +/** Is lingering close from defer_lingering_close()? */ +int deferred_linger; }; APR_RING_HEAD(timeout_head_t, event_conn_state_t); @@ -289,9 +291,10 @@ static volatile apr_time_t queues_next_expiry; * Macros for accessing struct timeout_queue. * For TO_QUEUE_APPEND and TO_QUEUE_REMOVE, timeout_mutex must be held. */ -static void TO_QUEUE_APPEND(struct timeout_queue *q, event_conn_state_t *el) +static void TO_QUEUE_APPEND(struct timeout_queue *q, event_conn_state_t *el, +int do_wakeup) { -apr_time_t q_expiry; +apr_time_t elem_expiry; apr_time_t next_expiry; APR_RING_INSERT_TAIL(&q->head, el, event_conn_state_t, timeout_list); @@ -298,16 +301,16 @@ static volatile apr_time_t queues_next_expiry; ++*q->total; ++q->count; -/* Cheaply update the overall queues' next expiry according to the - * first entry of this queue (oldest), if necessary. +/* Cheaply update the global queues_next_expiry with the one of the + * first entry of this queue (oldest) if it expires before. */ el = APR_RING_FIRST(&q->head); -q_expiry = el->queue_timestamp + q->timeout; +elem_expiry = el->queue_timestamp + q->timeout; next_expiry = queues_next_expiry; -if (!next_expiry || next_expiry > q_expiry + TIMEOUT_FUDGE_FACTOR) { -queues_next_expiry = q_expiry; +if (!next_expiry || next_expiry > elem_expiry + TIMEOUT_FUDGE_FACTOR) { +queues_next_expiry = elem_expiry; /* Unblock the poll()ing listener for it to update its timeout. */ -if (listener_is_wakeable) { +if (do_wakeup && listener_is_wakeable) { apr_pollset_wakeup(event_pollset); } } @@ -554,9 +557,20 @@ static APR_INLINE int connections_above_limit(int return 1; } -static void abort_socket_nonblocking(apr_socket_t *csd) +static void close_socket_nonblocking_(apr_socket_t *csd, + const char *from, int line) { apr_status_t rv; +apr_os_sock_t fd = -1; + +/* close_worker_sockets() may have closed it already */ +rv = apr_os_sock_get(&fd, csd); +ap_log_error(APLOG_MARK, APLOG_TRACE8, 0, ap_server_conf, +"closing socket %i/%pp from %s:%i", (int)fd, csd, from, line); +if (rv == APR_SUCCESS && fd == -1) { +return; +} + apr_socket_timeout_set(csd, 0); rv = apr_socket_close(csd); if (rv != APR_SUCCESS) { @@ -565,6 +579,8 @@ static APR_INLINE int connections_above_limit(int AP_DEBUG_ASSERT(0); } } +#define close_socket_nonblocking(csd) \ +close_socket_nonblocking_(csd, __FUNCTION__, __LINE__) static void close_worker_sockets(void) { @@ -573,26 +589,16 @@ static void close_worker_s
Re: child exit on self-proxy
On Wed, Jul 14, 2021 at 3:42 PM Yann Ylavic wrote: > > On Tue, Jul 13, 2021 at 4:16 PM Stefan Eissing > wrote: > > > > In Yann's v3 of the patch, it triggers only when I stop the server while > > the test case is ongoing, > > OK thanks, I have a new v6 which should address this and also still > call pre_close hooks in any case. > The patch is a bit "verbose" (some cosmetics/comments changes that > helped my workflow, sorry about that) but in the end I think it's > simpler to maintain (as Eric asked?) which may be worth it.. I have staged this a bit in https://github.com/apache/httpd/pull/208
Re: child exit on self-proxy
> Am 14.07.2021 um 23:47 schrieb Yann Ylavic : > > On Wed, Jul 14, 2021 at 3:42 PM Yann Ylavic wrote: >> >> On Tue, Jul 13, 2021 at 4:16 PM Stefan Eissing >> wrote: >>> >>> In Yann's v3 of the patch, it triggers only when I stop the server while >>> the test case is ongoing, >> >> OK thanks, I have a new v6 which should address this and also still >> call pre_close hooks in any case. >> The patch is a bit "verbose" (some cosmetics/comments changes that >> helped my workflow, sorry about that) but in the end I think it's >> simpler to maintain (as Eric asked?) which may be worth it.. > > I have staged this a bit in https://github.com/apache/httpd/pull/208 Works flawless in my test scenarios! \o/ - Stefan
Re: child exit on self-proxy
On 14 Jul 2021, at 23:47, Yann Ylavic wrote: > On Wed, Jul 14, 2021 at 3:42 PM Yann Ylavic wrote: >> >> On Tue, Jul 13, 2021 at 4:16 PM Stefan Eissing >> wrote: >>> >>> In Yann's v3 of the patch, it triggers only when I stop the server while >>> the test case is ongoing, >> >> OK thanks, I have a new v6 which should address this and also still >> call pre_close hooks in any case. >> The patch is a bit "verbose" (some cosmetics/comments changes that >> helped my workflow, sorry about that) but in the end I think it's >> simpler to maintain (as Eric asked?) which may be worth it.. > > I have staged this a bit in https://github.com/apache/httpd/pull/208 Lovely - well done - this fixes things on our end. Dw