Re: child exit on self-proxy

2021-07-07 Thread Ruediger Pluem



On 7/7/21 11:45 AM, Stefan Eissing wrote:
> In my h2 test suite, I do a setup where I use proxy configs against the 
> server itself. We seem to have a problem performing a clean child exit with 
> that. Test in 2.4.48 and trunk:
> - run tests with several graceful restarts
> - no proxied request, clean exit
> - with proxied requests
>   AH00045: child process 53682 still did not exit, sending a SIGTERM
>   AH00045: child process 53682 still did not exit, sending a SIGTERM
>   [often stops here, sometimes]
>   ...
>   AH00046: child process 53682 still did not exit, sending a SIGKILL
> 
> Question:
> - is such a test setup doomed to fail in general?
> - are we sure that we cannot encounter such states in "normal" setups?
> 
> If someone wants at log at whatever LogLevels, let me know. It's seems highly 
> reproducible.

Do you have stack traces where these processes are hanging and a simple config 
that causes this?

Regards

Rüdiger



Re: child exit on self-proxy

2021-07-07 Thread Stefan Eissing



> Am 07.07.2021 um 11:49 schrieb Ruediger Pluem :
> 
> 
> 
> On 7/7/21 11:45 AM, Stefan Eissing wrote:
>> In my h2 test suite, I do a setup where I use proxy configs against the 
>> server itself. We seem to have a problem performing a clean child exit with 
>> that. Test in 2.4.48 and trunk:
>> - run tests with several graceful restarts
>> - no proxied request, clean exit
>> - with proxied requests
>>  AH00045: child process 53682 still did not exit, sending a SIGTERM
>>  AH00045: child process 53682 still did not exit, sending a SIGTERM
>>  [often stops here, sometimes]
>>  ...
>>  AH00046: child process 53682 still did not exit, sending a SIGKILL
>> 
>> Question:
>> - is such a test setup doomed to fail in general?
>> - are we sure that we cannot encounter such states in "normal" setups?
>> 
>> If someone wants at log at whatever LogLevels, let me know. It's seems 
>> highly reproducible.
> 
> Do you have stack traces where these processes are hanging and a simple 
> config that causes this?

I added a TRACE1 log in event.c before/after join_workers (line 2921) and see:

[Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid 4493635072] 
event.c(2921): graceful termination received, join workers
[Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid 
123145435672576] event.c(1799): All workers are busy or dying, will close 1 
keep-alive connections
[Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid 
123145435672576] event.c(1799): All workers are busy or dying, will close 0 
keep-alive connections
[Wed Jul 07 10:06:10.78 2021] [core:warn] [pid 72813:tid 4493635072] 
AH00045: child process 72886 still did not exit, sending a SIGTERM
[Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] 
AH00045: child process 72886 still did not exit, sending a SIGTERM
[Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] 
AH00045: child process 72886 still did not exit, sending a SIGTERM
[Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] 
AH00046: child process 72886 still did not exit, sending a SIGKILL

So, I assume the keep-alive connection is the mod_proxy_http connection to the 
sever itself. Since the join_workers() never returns, there seems to be a 
thread not finishing.

On another run, I got a stacktrace of the child:

Call graph:
264 Thread_36766542   DispatchQueue_1: com.apple.main-thread  (serial)
+ 264 start  (in libdyld.dylib) + 1  [0x7fff2032cf5d]
+   264 main  (in httpd) + 2278  [0x104729b86]  main.c:862
+ 264 ap_run_mpm  (in httpd) + 75  [0x10473917b]  mpm_common.c:100
+   264 event_run  (in mod_mpm_event.so) + 2994  [0x10502ae62]  
event.c:3398
+ 264 make_child  (in mod_mpm_event.so) + 436  [0x10502bbc4]  
event.c:2997
+   264 child_main  (in mod_mpm_event.so) + 1734  [0x10502c2e6]  
event.c:2924
+ 264 join_workers  (in mod_mpm_event.so) + 386  [0x10502cc72]  
event.c:2717
+   264 apr_thread_join  (in libapr-2.0.dylib) + 44  
[0x1048b347c]  thread.c:256
+ 264 _pthread_join  (in libsystem_pthread.dylib) + 362  
[0x7fff20312f60]
+   264 __ulock_wait  (in libsystem_kernel.dylib) + 10  
[0x7fff202dd9ee]
264 Thread_36766548
+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
+   264 _pthread_start  (in libsystem_pthread.dylib) + 224  [0x7fff203118fc]
+ 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
thread.c:148
+   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
h2_workers.c:260
+ 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
[0x7fff20311e49]
+   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
[0x7fff202decde]
264 Thread_36766549
+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
+   264 _pthread_start  (in libsystem_pthread.dylib) + 224  [0x7fff203118fc]
+ 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
thread.c:148
+   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
h2_workers.c:260
+ 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
[0x7fff20311e49]
+   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
[0x7fff202decde]
264 Thread_36766550
+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
+   264 _pthread_start  (in libsystem_pthread.dylib) + 224  [0x7fff203118fc]
+ 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
thread.c:148
+   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
h2_workers.c:260
+ 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
[0x7fff20311e49]
+   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
[0x7fff202decde]
264 Thread_36766551
+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
+   

Re: child exit on self-proxy

2021-07-07 Thread Yann Ylavic
On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing
 wrote:
>
> I added a TRACE1 log in event.c before/after join_workers (line 2921) and see:
>
> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid 
> 4493635072] event.c(2921): graceful termination received, join workers
> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid 
> 123145435672576] event.c(1799): All workers are busy or dying, will close 1 
> keep-alive connections
> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid 
> 123145435672576] event.c(1799): All workers are busy or dying, will close 0 
> keep-alive connections
> [Wed Jul 07 10:06:10.78 2021] [core:warn] [pid 72813:tid 4493635072] 
> AH00045: child process 72886 still did not exit, sending a SIGTERM
> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] 
> AH00045: child process 72886 still did not exit, sending a SIGTERM
> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] 
> AH00045: child process 72886 still did not exit, sending a SIGTERM
> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] 
> AH00046: child process 72886 still did not exit, sending a SIGKILL
>
> So, I assume the keep-alive connection is the mod_proxy_http connection to 
> the sever itself. Since the join_workers() never returns, there seems to be a 
> thread not finishing.

No, MPM event doesn't handle mod_proxy kept alive connections, those
are client connections.

>
> On another run, I got a stacktrace of the child:
>
> Call graph:
> 264 Thread_36766542   DispatchQueue_1: com.apple.main-thread  (serial)
> + 264 start  (in libdyld.dylib) + 1  [0x7fff2032cf5d]
> +   264 main  (in httpd) + 2278  [0x104729b86]  main.c:862
> + 264 ap_run_mpm  (in httpd) + 75  [0x10473917b]  mpm_common.c:100
> +   264 event_run  (in mod_mpm_event.so) + 2994  [0x10502ae62]  
> event.c:3398
> + 264 make_child  (in mod_mpm_event.so) + 436  [0x10502bbc4]  
> event.c:2997
> +   264 child_main  (in mod_mpm_event.so) + 1734  [0x10502c2e6]  
> event.c:2924
> + 264 join_workers  (in mod_mpm_event.so) + 386  
> [0x10502cc72]  event.c:2717
> +   264 apr_thread_join  (in libapr-2.0.dylib) + 44  
> [0x1048b347c]  thread.c:256
> + 264 _pthread_join  (in libsystem_pthread.dylib) + 362  
> [0x7fff20312f60]
> +   264 __ulock_wait  (in libsystem_kernel.dylib) + 10  
> [0x7fff202dd9ee]

OK, that's where the child waits for worker threads.

> 264 Thread_36766548
> + 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
> +   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
> [0x7fff203118fc]
> + 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
> thread.c:148
> +   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
> h2_workers.c:260
> + 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
> [0x7fff20311e49]
> +   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
> [0x7fff202decde]
> 264 Thread_36766549
> + 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
> +   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
> [0x7fff203118fc]
> + 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
> thread.c:148
> +   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
> h2_workers.c:260
> + 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
> [0x7fff20311e49]
> +   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
> [0x7fff202decde]
> 264 Thread_36766550
> + 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
> +   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
> [0x7fff203118fc]
> + 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
> thread.c:148
> +   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
> h2_workers.c:260
> + 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
> [0x7fff20311e49]
> +   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
> [0x7fff202decde]
> 264 Thread_36766551
> + 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
> +   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
> [0x7fff203118fc]
> + 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
> thread.c:148
> +   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
> h2_workers.c:260
> + 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
> [0x7fff20311e49]
> +   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
> [0x7fff202decde]

The above mod_http2 threads are not accounted for by the MPM I suppose
(they should be stopped by the future child_exit hook, but still the
MPM will only care about its own worker threads, i.e. thos

Re: child exit on self-proxy

2021-07-07 Thread Stefan Eissing
Was busy on other things. Will try to get answers...

> Am 07.07.2021 um 13:51 schrieb Yann Ylavic :
> 
> On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing
>  wrote:
>> 
>> I added a TRACE1 log in event.c before/after join_workers (line 2921) and 
>> see:
>> 
>> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid 
>> 4493635072] event.c(2921): graceful termination received, join workers
>> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid 
>> 123145435672576] event.c(1799): All workers are busy or dying, will close 1 
>> keep-alive connections
>> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid 
>> 123145435672576] event.c(1799): All workers are busy or dying, will close 0 
>> keep-alive connections
>> [Wed Jul 07 10:06:10.78 2021] [core:warn] [pid 72813:tid 4493635072] 
>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] 
>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] 
>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] 
>> AH00046: child process 72886 still did not exit, sending a SIGKILL
>> 
>> So, I assume the keep-alive connection is the mod_proxy_http connection to 
>> the sever itself. Since the join_workers() never returns, there seems to be 
>> a thread not finishing.
> 
> No, MPM event doesn't handle mod_proxy kept alive connections, those
> are client connections.
> 
>> 
>> On another run, I got a stacktrace of the child:
>> 
>> Call graph:
>>264 Thread_36766542   DispatchQueue_1: com.apple.main-thread  (serial)
>>+ 264 start  (in libdyld.dylib) + 1  [0x7fff2032cf5d]
>>+   264 main  (in httpd) + 2278  [0x104729b86]  main.c:862
>>+ 264 ap_run_mpm  (in httpd) + 75  [0x10473917b]  mpm_common.c:100
>>+   264 event_run  (in mod_mpm_event.so) + 2994  [0x10502ae62]  
>> event.c:3398
>>+ 264 make_child  (in mod_mpm_event.so) + 436  [0x10502bbc4]  
>> event.c:2997
>>+   264 child_main  (in mod_mpm_event.so) + 1734  [0x10502c2e6]  
>> event.c:2924
>>+ 264 join_workers  (in mod_mpm_event.so) + 386  
>> [0x10502cc72]  event.c:2717
>>+   264 apr_thread_join  (in libapr-2.0.dylib) + 44  
>> [0x1048b347c]  thread.c:256
>>+ 264 _pthread_join  (in libsystem_pthread.dylib) + 362  
>> [0x7fff20312f60]
>>+   264 __ulock_wait  (in libsystem_kernel.dylib) + 10  
>> [0x7fff202dd9ee]
> 
> OK, that's where the child waits for worker threads.
> 
>>264 Thread_36766548
>>+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>+   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>> [0x7fff203118fc]
>>+ 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>> thread.c:148
>>+   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>> h2_workers.c:260
>>+ 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>> [0x7fff20311e49]
>>+   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>> [0x7fff202decde]
>>264 Thread_36766549
>>+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>+   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>> [0x7fff203118fc]
>>+ 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>> thread.c:148
>>+   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>> h2_workers.c:260
>>+ 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>> [0x7fff20311e49]
>>+   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>> [0x7fff202decde]
>>264 Thread_36766550
>>+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>+   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>> [0x7fff203118fc]
>>+ 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>> thread.c:148
>>+   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>> h2_workers.c:260
>>+ 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>> [0x7fff20311e49]
>>+   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>> [0x7fff202decde]
>>264 Thread_36766551
>>+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>+   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>> [0x7fff203118fc]
>>+ 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>> thread.c:148
>>+   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>> h2_workers.c:260
>>+ 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>> [0x7fff20311e49]
>>+   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>> [0x7fff202decde]
> 
> The above

Re: child exit on self-proxy

2021-07-07 Thread Stefan Eissing



> Am 07.07.2021 um 13:51 schrieb Yann Ylavic :
> 
> On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing
>  wrote:
>> 
>> I added a TRACE1 log in event.c before/after join_workers (line 2921) and 
>> see:
>> 
>> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid 
>> 4493635072] event.c(2921): graceful termination received, join workers
>> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid 
>> 123145435672576] event.c(1799): All workers are busy or dying, will close 1 
>> keep-alive connections
>> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid 
>> 123145435672576] event.c(1799): All workers are busy or dying, will close 0 
>> keep-alive connections
>> [Wed Jul 07 10:06:10.78 2021] [core:warn] [pid 72813:tid 4493635072] 
>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] 
>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] 
>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] 
>> AH00046: child process 72886 still did not exit, sending a SIGKILL
>> 
>> So, I assume the keep-alive connection is the mod_proxy_http connection to 
>> the sever itself. Since the join_workers() never returns, there seems to be 
>> a thread not finishing.
> 
> No, MPM event doesn't handle mod_proxy kept alive connections, those
> are client connections.
> 
>> 
>> On another run, I got a stacktrace of the child:
>> 
>> Call graph:
>>264 Thread_36766542   DispatchQueue_1: com.apple.main-thread  (serial)
>>+ 264 start  (in libdyld.dylib) + 1  [0x7fff2032cf5d]
>>+   264 main  (in httpd) + 2278  [0x104729b86]  main.c:862
>>+ 264 ap_run_mpm  (in httpd) + 75  [0x10473917b]  mpm_common.c:100
>>+   264 event_run  (in mod_mpm_event.so) + 2994  [0x10502ae62]  
>> event.c:3398
>>+ 264 make_child  (in mod_mpm_event.so) + 436  [0x10502bbc4]  
>> event.c:2997
>>+   264 child_main  (in mod_mpm_event.so) + 1734  [0x10502c2e6]  
>> event.c:2924
>>+ 264 join_workers  (in mod_mpm_event.so) + 386  
>> [0x10502cc72]  event.c:2717
>>+   264 apr_thread_join  (in libapr-2.0.dylib) + 44  
>> [0x1048b347c]  thread.c:256
>>+ 264 _pthread_join  (in libsystem_pthread.dylib) + 362  
>> [0x7fff20312f60]
>>+   264 __ulock_wait  (in libsystem_kernel.dylib) + 10  
>> [0x7fff202dd9ee]
> 
> OK, that's where the child waits for worker threads.
> 
>>264 Thread_36766548
>>+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>+   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>> [0x7fff203118fc]
>>+ 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>> thread.c:148
>>+   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>> h2_workers.c:260
>>+ 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>> [0x7fff20311e49]
>>+   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>> [0x7fff202decde]
>>264 Thread_36766549
>>+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>+   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>> [0x7fff203118fc]
>>+ 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>> thread.c:148
>>+   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>> h2_workers.c:260
>>+ 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>> [0x7fff20311e49]
>>+   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>> [0x7fff202decde]
>>264 Thread_36766550
>>+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>+   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>> [0x7fff203118fc]
>>+ 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>> thread.c:148
>>+   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>> h2_workers.c:260
>>+ 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>> [0x7fff20311e49]
>>+   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>> [0x7fff202decde]
>>264 Thread_36766551
>>+ 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>+   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>> [0x7fff203118fc]
>>+ 264 dummy_worker  (in libapr-2.0.dylib) + 30  [0x1048b33ee]  
>> thread.c:148
>>+   264 slot_run  (in mod_http2.so) + 189  [0x104f99a8d]  
>> h2_workers.c:260
>>+ 264 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1298  
>> [0x7fff20311e49]
>>+   264 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  
>> [0x7fff202decde]
> 
> The above mod_http2 threads are not accounted for by the MPM

Re: child exit on self-proxy

2021-07-07 Thread Stefan Eissing
In order to reproduce the logs I see on this weird behaviour, I'll attach the 
patch I made:



event_debug.diff
Description: Binary data


With this, I see in a hanging process:

[Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid 123145448968192] 
core_filters.c(429): (32)Broken pipe: [client 127.0.0.1:49824] 
core_output_filter: writing data to the network
[Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid 4525448704] 
event.c(599): AH: wakeup_listener: start
[Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid 4525448704] 
event.c(603): AH: wakeup_listener: apr_pollset_wakeup
[Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid 4525448704] 
event.c(610): AH: wakeup_listener: ap_queue_info_term
[Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid 
123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset 
returned listener_may_exit=0
[Wed Jul 07 15:20:23.044817 2021] [mpm_event:debug] [pid 42471:tid 4525448704] 
event.c(629): AH: wakeup_listener: pthread_kill
[Wed Jul 07 15:20:23.044954 2021] [mpm_event:trace1] [pid 42471:tid 
123145461309440] event.c(1809): All workers are busy or dying, will close 1 
keep-alive connections
[Wed Jul 07 15:20:23.044987 2021] [mpm_event:debug] [pid 42471:tid 4525448704] 
event.c(635): AH: wakeup_listener: end
[Wed Jul 07 15:20:23.045154 2021] [mpm_event:trace1] [pid 42471:tid 4525448704] 
event.c(2940): graceful termination received, join workers
[Wed Jul 07 15:20:25.927437 2021] [mpm_event:debug] [pid 42471:tid 
123145461309440] event.c(1985): (70007)The timeout specified has expired: AH: 
pollset returned listener_may_exit=0
[Wed Jul 07 15:20:25.927468 2021] [mpm_event:trace1] [pid 42471:tid 
123145461309440] event.c(1809): All workers are busy or dying, will close 0 
keep-alive connections

Which says (to my eyes) that the pollset_wakeup happened, but the listener saw 
"listener_may_exit == 0". Is the volatile maybe not enough?


> Am 07.07.2021 um 16:56 schrieb Stefan Eissing :
> 
> 
> 
>> Am 07.07.2021 um 13:51 schrieb Yann Ylavic :
>> 
>> On Wed, Jul 7, 2021 at 12:16 PM Stefan Eissing
>>  wrote:
>>> 
>>> I added a TRACE1 log in event.c before/after join_workers (line 2921) and 
>>> see:
>>> 
>>> [Wed Jul 07 10:06:03.144044 2021] [mpm_event:trace1] [pid 72886:tid 
>>> 4493635072] event.c(2921): graceful termination received, join workers
>>> [Wed Jul 07 10:06:03.144213 2021] [mpm_event:trace1] [pid 72886:tid 
>>> 123145435672576] event.c(1799): All workers are busy or dying, will close 1 
>>> keep-alive connections
>>> [Wed Jul 07 10:06:08.079690 2021] [mpm_event:trace1] [pid 72886:tid 
>>> 123145435672576] event.c(1799): All workers are busy or dying, will close 0 
>>> keep-alive connections
>>> [Wed Jul 07 10:06:10.78 2021] [core:warn] [pid 72813:tid 4493635072] 
>>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>>> [Wed Jul 07 10:06:12.789335 2021] [core:warn] [pid 72813:tid 4493635072] 
>>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>>> [Wed Jul 07 10:06:14.791281 2021] [core:warn] [pid 72813:tid 4493635072] 
>>> AH00045: child process 72886 still did not exit, sending a SIGTERM
>>> [Wed Jul 07 10:06:16.792983 2021] [core:error] [pid 72813:tid 4493635072] 
>>> AH00046: child process 72886 still did not exit, sending a SIGKILL
>>> 
>>> So, I assume the keep-alive connection is the mod_proxy_http connection to 
>>> the sever itself. Since the join_workers() never returns, there seems to be 
>>> a thread not finishing.
>> 
>> No, MPM event doesn't handle mod_proxy kept alive connections, those
>> are client connections.
>> 
>>> 
>>> On another run, I got a stacktrace of the child:
>>> 
>>> Call graph:
>>>   264 Thread_36766542   DispatchQueue_1: com.apple.main-thread  (serial)
>>>   + 264 start  (in libdyld.dylib) + 1  [0x7fff2032cf5d]
>>>   +   264 main  (in httpd) + 2278  [0x104729b86]  main.c:862
>>>   + 264 ap_run_mpm  (in httpd) + 75  [0x10473917b]  mpm_common.c:100
>>>   +   264 event_run  (in mod_mpm_event.so) + 2994  [0x10502ae62]  
>>> event.c:3398
>>>   + 264 make_child  (in mod_mpm_event.so) + 436  [0x10502bbc4]  
>>> event.c:2997
>>>   +   264 child_main  (in mod_mpm_event.so) + 1734  [0x10502c2e6]  
>>> event.c:2924
>>>   + 264 join_workers  (in mod_mpm_event.so) + 386  
>>> [0x10502cc72]  event.c:2717
>>>   +   264 apr_thread_join  (in libapr-2.0.dylib) + 44  
>>> [0x1048b347c]  thread.c:256
>>>   + 264 _pthread_join  (in libsystem_pthread.dylib) + 362  
>>> [0x7fff20312f60]
>>>   +   264 __ulock_wait  (in libsystem_kernel.dylib) + 10  
>>> [0x7fff202dd9ee]
>> 
>> OK, that's where the child waits for worker threads.
>> 
>>>   264 Thread_36766548
>>>   + 264 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff2030d443]
>>>   +   264 _pthread_start  (in libsystem_pthread.dylib) + 224  
>>> [0x7fff203118fc]
>>>   + 264 dummy_worker  (in l

Re: child exit on self-proxy

2021-07-07 Thread Ruediger Pluem



On 7/7/21 5:25 PM, Stefan Eissing wrote:
> In order to reproduce the logs I see on this weird behaviour, I'll attach the 
> patch I made:
> 
> 
> 
> 
> With this, I see in a hanging process:
> 
> [Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid 
> 123145448968192] core_filters.c(429): (32)Broken pipe: [client 
> 127.0.0.1:49824] core_output_filter: writing data to the network
> [Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid 
> 4525448704] event.c(599): AH: wakeup_listener: start
> [Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid 
> 4525448704] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
> [Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid 
> 4525448704] event.c(610): AH: wakeup_listener: ap_queue_info_term
> [Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid 
> 123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset 
> returned listener_may_exit=0

Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the 
actual value of listener_may_exit in the log message?

> [Wed Jul 07 15:20:23.044817 2021] [mpm_event:debug] [pid 42471:tid 
> 4525448704] event.c(629): AH: wakeup_listener: pthread_kill
> [Wed Jul 07 15:20:23.044954 2021] [mpm_event:trace1] [pid 42471:tid 
> 123145461309440] event.c(1809): All workers are busy or dying, will close 1 
> keep-alive connections
> [Wed Jul 07 15:20:23.044987 2021] [mpm_event:debug] [pid 42471:tid 
> 4525448704] event.c(635): AH: wakeup_listener: end
> [Wed Jul 07 15:20:23.045154 2021] [mpm_event:trace1] [pid 42471:tid 
> 4525448704] event.c(2940): graceful termination received, join workers
> [Wed Jul 07 15:20:25.927437 2021] [mpm_event:debug] [pid 42471:tid 
> 123145461309440] event.c(1985): (70007)The timeout specified has expired: AH: 
> pollset returned listener_may_exit=0

Looks like we get back in the poll and have to wait for the timeout.

> [Wed Jul 07 15:20:25.927468 2021] [mpm_event:trace1] [pid 42471:tid 
> 123145461309440] event.c(1809): All workers are busy or dying, will close 0 
> keep-alive connections
> 
> Which says (to my eyes) that the pollset_wakeup happened, but the listener 
> saw "listener_may_exit == 0". Is the volatile maybe not enough?

Regards

Rüdiger



Re: child exit on self-proxy

2021-07-07 Thread Stefan Eissing



> Am 07.07.2021 um 17:39 schrieb Ruediger Pluem :
> 
> 
> 
> On 7/7/21 5:25 PM, Stefan Eissing wrote:
>> In order to reproduce the logs I see on this weird behaviour, I'll attach 
>> the patch I made:
>> 
>> 
>> 
>> 
>> With this, I see in a hanging process:
>> 
>> [Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid 
>> 123145448968192] core_filters.c(429): (32)Broken pipe: [client 
>> 127.0.0.1:49824] core_output_filter: writing data to the network
>> [Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid 
>> 4525448704] event.c(599): AH: wakeup_listener: start
>> [Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid 
>> 4525448704] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
>> [Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid 
>> 4525448704] event.c(610): AH: wakeup_listener: ap_queue_info_term
>> [Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid 
>> 123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset 
>> returned listener_may_exit=0
> 
> Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the 
> actual value of listener_may_exit in the log message?

I put it statically since that was the value in the "if ()" 6 lines above. it 
should have been visible then and there to work.

> 
>> [Wed Jul 07 15:20:23.044817 2021] [mpm_event:debug] [pid 42471:tid 
>> 4525448704] event.c(629): AH: wakeup_listener: pthread_kill
>> [Wed Jul 07 15:20:23.044954 2021] [mpm_event:trace1] [pid 42471:tid 
>> 123145461309440] event.c(1809): All workers are busy or dying, will close 1 
>> keep-alive connections
>> [Wed Jul 07 15:20:23.044987 2021] [mpm_event:debug] [pid 42471:tid 
>> 4525448704] event.c(635): AH: wakeup_listener: end
>> [Wed Jul 07 15:20:23.045154 2021] [mpm_event:trace1] [pid 42471:tid 
>> 4525448704] event.c(2940): graceful termination received, join workers
>> [Wed Jul 07 15:20:25.927437 2021] [mpm_event:debug] [pid 42471:tid 
>> 123145461309440] event.c(1985): (70007)The timeout specified has expired: 
>> AH: pollset returned listener_may_exit=0
> 
> Looks like we get back in the poll and have to wait for the timeout.
> 
>> [Wed Jul 07 15:20:25.927468 2021] [mpm_event:trace1] [pid 42471:tid 
>> 123145461309440] event.c(1809): All workers are busy or dying, will close 0 
>> keep-alive connections
>> 
>> Which says (to my eyes) that the pollset_wakeup happened, but the listener 
>> saw "listener_may_exit == 0". Is the volatile maybe not enough?
> 
> Regards
> 
> Rüdiger



Re: child exit on self-proxy

2021-07-07 Thread Yann Ylavic
On Wed, Jul 7, 2021 at 5:39 PM Ruediger Pluem  wrote:
>
>
>
> On 7/7/21 5:25 PM, Stefan Eissing wrote:
> > In order to reproduce the logs I see on this weird behaviour, I'll attach 
> > the patch I made:
> >
> >
> >
> >
> > With this, I see in a hanging process:
> >
> > [Wed Jul 07 15:20:23.014980 2021] [core:trace1] [pid 42471:tid 
> > 123145448968192] core_filters.c(429): (32)Broken pipe: [client 
> > 127.0.0.1:49824] core_output_filter: writing data to the network
> > [Wed Jul 07 15:20:23.044541 2021] [mpm_event:debug] [pid 42471:tid 
> > 4525448704] event.c(599): AH: wakeup_listener: start
> > [Wed Jul 07 15:20:23.044601 2021] [mpm_event:debug] [pid 42471:tid 
> > 4525448704] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
> > [Wed Jul 07 15:20:23.044715 2021] [mpm_event:debug] [pid 42471:tid 
> > 4525448704] event.c(610): AH: wakeup_listener: ap_queue_info_term
> > [Wed Jul 07 15:20:23.044861 2021] [mpm_event:debug] [pid 42471:tid 
> > 123145461309440] event.c(1985): (4)Interrupted system call: AH: pollset 
> > returned listener_may_exit=0
>
> Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the 
> actual value of listener_may_exit in the log message?

It would be interesting to log apr_atomic_read32(&connection_count) too.


Re: child exit on self-proxy

2021-07-07 Thread Yann Ylavic
On Wed, Jul 7, 2021 at 5:45 PM Stefan Eissing
 wrote:
>
> > Am 07.07.2021 um 17:39 schrieb Ruediger Pluem :
> >
> > Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the 
> > actual value of listener_may_exit in the log message?
>
> I put it statically since that was the value in the "if ()" 6 lines above. it 
> should have been visible then and there to work.

The above if () does not always break.


Cheers;
Yann.


Re: child exit on self-proxy

2021-07-07 Thread Stefan Eissing
The things I do for you...

[Wed Jul 07 15:48:52.591983 2021] [mpm_event:debug] [pid 48747:tid 4385197568] 
event.c(599): AH: wakeup_listener: start
[Wed Jul 07 15:48:52.592000 2021] [mpm_event:debug] [pid 48747:tid 4385197568] 
event.c(603): AH: wakeup_listener: apr_pollset_wakeup
[Wed Jul 07 15:48:52.592046 2021] [mpm_event:debug] [pid 48747:tid 4385197568] 
event.c(610): AH: wakeup_listener: ap_queue_info_term
[Wed Jul 07 15:48:52.592188 2021] [mpm_event:debug] [pid 48747:tid 4385197568] 
event.c(629): AH: wakeup_listener: pthread_kill
[Wed Jul 07 15:48:52.592166 2021] [mpm_event:debug] [pid 48747:tid 
123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset 
returned listener_may_exit=1 connection_count=2
[Wed Jul 07 15:48:52.592523 2021] [mpm_event:debug] [pid 48747:tid 4385197568] 
event.c(635): AH: wakeup_listener: end
[Wed Jul 07 15:48:52.592727 2021] [mpm_event:trace1] [pid 48747:tid 4385197568] 
event.c(2942): graceful termination received, join workers
[Wed Jul 07 15:48:52.592515 2021] [mpm_event:trace1] [pid 48747:tid 
123145502474240] event.c(1809): All workers are busy or dying, will close 2 
keep-alive connections
[Wed Jul 07 15:48:52.593381 2021] [mpm_event:debug] [pid 48747:tid 
123145502474240] event.c(1985): AH: pollset returned listener_may_exit=1 
connection_count=2
[Wed Jul 07 15:48:52.593478 2021] [mpm_event:debug] [pid 48747:tid 
123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset 
returned listener_may_exit=1 connection_count=1
[Wed Jul 07 15:48:57.552623 2021] [mpm_event:debug] [pid 48747:tid 
123145502474240] event.c(1985): (70007)The timeout specified has expired: AH: 
pollset returned listener_may_exit=1 connection_count=1
[Wed Jul 07 15:48:57.552658 2021] [mpm_event:trace1] [pid 48747:tid 
123145502474240] event.c(1809): All workers are busy or dying, will close 0 
keep-alive connections



event_debugv2.diff
Description: Binary data


Btw. I opened a bottle of nice wine - this is entertaining. ;-)

- Stefan

> Am 07.07.2021 um 17:48 schrieb Yann Ylavic :
> 
> On Wed, Jul 7, 2021 at 5:45 PM Stefan Eissing
>  wrote:
>> 
>>> Am 07.07.2021 um 17:39 schrieb Ruediger Pluem :
>>> 
>>> Hm. The ap_log_error statically writes listener_may_exit=0. Can you put the 
>>> actual value of listener_may_exit in the log message?
>> 
>> I put it statically since that was the value in the "if ()" 6 lines above. 
>> it should have been visible then and there to work.
> 
> The above if () does not always break.
> 
> 
> Cheers;
> Yann.



Re: child exit on self-proxy

2021-07-07 Thread Yann Ylavic
Does this attached patch help? This is to verify an hypothesis where
killed keepalive connections are still in the defer_linger_chain
without a worker to handle them.

I don't see "interrupted, but continuing" triggering in your log but
possibly it happened earlier..

On Wed, Jul 7, 2021 at 5:51 PM Stefan Eissing
 wrote:
>
> The things I do for you...
>
> [Wed Jul 07 15:48:52.591983 2021] [mpm_event:debug] [pid 48747:tid 
> 4385197568] event.c(599): AH: wakeup_listener: start
> [Wed Jul 07 15:48:52.592000 2021] [mpm_event:debug] [pid 48747:tid 
> 4385197568] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
> [Wed Jul 07 15:48:52.592046 2021] [mpm_event:debug] [pid 48747:tid 
> 4385197568] event.c(610): AH: wakeup_listener: ap_queue_info_term
> [Wed Jul 07 15:48:52.592188 2021] [mpm_event:debug] [pid 48747:tid 
> 4385197568] event.c(629): AH: wakeup_listener: pthread_kill
> [Wed Jul 07 15:48:52.592166 2021] [mpm_event:debug] [pid 48747:tid 
> 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset 
> returned listener_may_exit=1 connection_count=2
> [Wed Jul 07 15:48:52.592523 2021] [mpm_event:debug] [pid 48747:tid 
> 4385197568] event.c(635): AH: wakeup_listener: end
> [Wed Jul 07 15:48:52.592727 2021] [mpm_event:trace1] [pid 48747:tid 
> 4385197568] event.c(2942): graceful termination received, join workers
> [Wed Jul 07 15:48:52.592515 2021] [mpm_event:trace1] [pid 48747:tid 
> 123145502474240] event.c(1809): All workers are busy or dying, will close 2 
> keep-alive connections
> [Wed Jul 07 15:48:52.593381 2021] [mpm_event:debug] [pid 48747:tid 
> 123145502474240] event.c(1985): AH: pollset returned listener_may_exit=1 
> connection_count=2
> [Wed Jul 07 15:48:52.593478 2021] [mpm_event:debug] [pid 48747:tid 
> 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset 
> returned listener_may_exit=1 connection_count=1
> [Wed Jul 07 15:48:57.552623 2021] [mpm_event:debug] [pid 48747:tid 
> 123145502474240] event.c(1985): (70007)The timeout specified has expired: AH: 
> pollset returned listener_may_exit=1 connection_count=1
> [Wed Jul 07 15:48:57.552658 2021] [mpm_event:trace1] [pid 48747:tid 
> 123145502474240] event.c(1809): All workers are busy or dying, will close 0 
> keep-alive connections
>
>
>
> Btw. I opened a bottle of nice wine - this is entertaining. ;-)
>
> - Stefan
>
> > Am 07.07.2021 um 17:48 schrieb Yann Ylavic :
> >
> > On Wed, Jul 7, 2021 at 5:45 PM Stefan Eissing
> >  wrote:
> >>
> >>> Am 07.07.2021 um 17:39 schrieb Ruediger Pluem :
> >>>
> >>> Hm. The ap_log_error statically writes listener_may_exit=0. Can you put 
> >>> the actual value of listener_may_exit in the log message?
> >>
> >> I put it statically since that was the value in the "if ()" 6 lines above. 
> >> it should have been visible then and there to work.
> >
> > The above if () does not always break.
> >
> >
> > Cheers;
> > Yann.
>
Index: server/mpm/event/event.c
===
--- server/mpm/event/event.c	(revision 1891217)
+++ server/mpm/event/event.c	(working copy)
@@ -1940,15 +1940,6 @@ static void * APR_THREAD_FUNC listener_thread(apr_
 rc = apr_pollset_poll(event_pollset, timeout_interval, &num, &out_pfd);
 if (rc != APR_SUCCESS) {
 if (APR_STATUS_IS_EINTR(rc)) {
-/* Woken up, if we are exiting or listeners are disabled we
- * must fall through to kill kept-alive connections or test
- * whether listeners should be re-enabled. Otherwise we only
- * need to update timeouts (logic is above, so simply restart
- * the loop).
- */
-if (!listener_may_exit && !listeners_disabled()) {
-continue;
-}
 timeout_time = 0;
 }
 else if (!APR_STATUS_IS_TIMEUP(rc)) {


Re: child exit on self-proxy

2021-07-07 Thread Stefan Eissing



> Am 07.07.2021 um 18:34 schrieb Yann Ylavic :
> 
> Does this attached patch help? This is to verify an hypothesis where
> killed keepalive connections are still in the defer_linger_chain
> without a worker to handle them.
> 
> I don't see "interrupted, but continuing" triggering in your log but
> possibly it happened earlier..


[Wed Jul 07 16:48:04.237865 2021] [mpm_event:debug] [pid 55134:tid 4643098112] 
event.c(599): AH: wakeup_listener: start
[Wed Jul 07 16:48:04.237895 2021] [mpm_event:debug] [pid 55134:tid 4643098112] 
event.c(603): AH: wakeup_listener: apr_pollset_wakeup
[Wed Jul 07 16:48:04.238042 2021] [mpm_event:debug] [pid 55134:tid 4643098112] 
event.c(610): AH: wakeup_listener: ap_queue_info_term
[Wed Jul 07 16:48:04.238224 2021] [mpm_event:debug] [pid 55134:tid 
123145362530304] event.c(1971): (4)Interrupted system call: AH: pollset 
returned listener_may_exit=1 connection_count=1
[Wed Jul 07 16:48:04.238381 2021] [mpm_event:trace1] [pid 55134:tid 
123145362530304] event.c(1809): All workers are busy or dying, will close 1 
keep-alive connections
[Wed Jul 07 16:48:04.238196 2021] [mpm_event:debug] [pid 55134:tid 4643098112] 
event.c(629): AH: wakeup_listener: pthread_kill
[Wed Jul 07 16:48:04.238513 2021] [mpm_event:debug] [pid 55134:tid 
123145362530304] event.c(1971): (4)Interrupted system call: AH: pollset 
returned listener_may_exit=1 connection_count=1
[Wed Jul 07 16:48:04.238506 2021] [mpm_event:debug] [pid 55134:tid 4643098112] 
event.c(635): AH: wakeup_listener: end
[Wed Jul 07 16:48:04.238555 2021] [mpm_event:trace1] [pid 55134:tid 4643098112] 
event.c(2928): graceful termination received, join workers
[Wed Jul 07 16:48:08.443130 2021] [mpm_event:debug] [pid 55134:tid 
123145362530304] event.c(1971): (70007)The timeout specified has expired: AH: 
pollset returned listener_may_exit=1 connection_count=1
[Wed Jul 07 16:48:08.443149 2021] [mpm_event:trace1] [pid 55134:tid 
123145362530304] event.c(1809): All workers are busy or dying, will close 0 
keep-alive connections

> 
> On Wed, Jul 7, 2021 at 5:51 PM Stefan Eissing
>  wrote:
>> 
>> The things I do for you...
>> 
>> [Wed Jul 07 15:48:52.591983 2021] [mpm_event:debug] [pid 48747:tid 
>> 4385197568] event.c(599): AH: wakeup_listener: start
>> [Wed Jul 07 15:48:52.592000 2021] [mpm_event:debug] [pid 48747:tid 
>> 4385197568] event.c(603): AH: wakeup_listener: apr_pollset_wakeup
>> [Wed Jul 07 15:48:52.592046 2021] [mpm_event:debug] [pid 48747:tid 
>> 4385197568] event.c(610): AH: wakeup_listener: ap_queue_info_term
>> [Wed Jul 07 15:48:52.592188 2021] [mpm_event:debug] [pid 48747:tid 
>> 4385197568] event.c(629): AH: wakeup_listener: pthread_kill
>> [Wed Jul 07 15:48:52.592166 2021] [mpm_event:debug] [pid 48747:tid 
>> 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset 
>> returned listener_may_exit=1 connection_count=2
>> [Wed Jul 07 15:48:52.592523 2021] [mpm_event:debug] [pid 48747:tid 
>> 4385197568] event.c(635): AH: wakeup_listener: end
>> [Wed Jul 07 15:48:52.592727 2021] [mpm_event:trace1] [pid 48747:tid 
>> 4385197568] event.c(2942): graceful termination received, join workers
>> [Wed Jul 07 15:48:52.592515 2021] [mpm_event:trace1] [pid 48747:tid 
>> 123145502474240] event.c(1809): All workers are busy or dying, will close 2 
>> keep-alive connections
>> [Wed Jul 07 15:48:52.593381 2021] [mpm_event:debug] [pid 48747:tid 
>> 123145502474240] event.c(1985): AH: pollset returned listener_may_exit=1 
>> connection_count=2
>> [Wed Jul 07 15:48:52.593478 2021] [mpm_event:debug] [pid 48747:tid 
>> 123145502474240] event.c(1985): (4)Interrupted system call: AH: pollset 
>> returned listener_may_exit=1 connection_count=1
>> [Wed Jul 07 15:48:57.552623 2021] [mpm_event:debug] [pid 48747:tid 
>> 123145502474240] event.c(1985): (70007)The timeout specified has expired: 
>> AH: pollset returned listener_may_exit=1 connection_count=1
>> [Wed Jul 07 15:48:57.552658 2021] [mpm_event:trace1] [pid 48747:tid 
>> 123145502474240] event.c(1809): All workers are busy or dying, will close 0 
>> keep-alive connections
>> 
>> 
>> 
>> Btw. I opened a bottle of nice wine - this is entertaining. ;-)
>> 
>> - Stefan
>> 
>>> Am 07.07.2021 um 17:48 schrieb Yann Ylavic :
>>> 
>>> On Wed, Jul 7, 2021 at 5:45 PM Stefan Eissing
>>>  wrote:
 
> Am 07.07.2021 um 17:39 schrieb Ruediger Pluem :
> 
> Hm. The ap_log_error statically writes listener_may_exit=0. Can you put 
> the actual value of listener_may_exit in the log message?
 
 I put it statically since that was the value in the "if ()" 6 lines above. 
 it should have been visible then and there to work.
>>> 
>>> The above if () does not always break.
>>> 
>>> 
>>> Cheers;
>>> Yann.
>> 
> 



Re: child exit on self-proxy

2021-07-08 Thread Stefan Eissing
Some day, I knew I had to learn more about mpm_event. =)

Adding more DEBUGs, I see in the example below that 2 connections were idling 
at start of the graceful and the get added to the linger_chain. 2 workers are 
then woken up and process the socket. connection_count stays at 2 however. As I 
read it, that count drops when the connection pool is destroyed/cleanup'ed. 
This normally seem to happen on the worker_queue_info, but in this example this 
just does not happen.

Is this a correct read?

[Thu Jul 08 09:39:32.423198 2021] [mpm_event:trace1] [pid 9272:tid 4494732800] 
event.c(2973): graceful termination received, join workers
[Thu Jul 08 09:39:32.423032 2021] [mpm_event:debug] [pid 9272:tid 
123145324126208] event.c(2250): (4)Interrupted system call: AH: listener: 
process defer_linger_chain have_idle_worker=1 listener_may_exit=1 
connection_count=2
[Thu Jul 08 09:39:32.422765 2021] [mpm_event:debug] [pid 9272:tid 
123145321443328] event.c(2431): (4)Interrupted system call: AH: worker 
pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.422251 2021] [mpm_event:debug] [pid 9272:tid 
123145311784960] event.c(2431): (4)Interrupted system call: AH: worker 
pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.421638 2021] [mpm_event:debug] [pid 9272:tid 
123145318760448] event.c(2431): (4)Interrupted system call: AH: worker 
pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.423541 2021] [mpm_event:debug] [pid 9272:tid 
123145324126208] event.c(2259): (4)Interrupted system call: AH: listener: 
process defer_linger_chain pushed to worker, listener_may_exit=1 
connection_count=2
[Thu Jul 08 09:39:32.422212 2021] [mpm_event:debug] [pid 9272:tid 
123145311248384] event.c(2408): (4)Interrupted system call: AH: worker 
pop_something, workers_may_exit=0, dying=1
[Thu Jul 08 09:39:32.421806 2021] [mpm_event:debug] [pid 9272:tid 
123145312858112] event.c(2431): (4)Interrupted system call: AH: worker 
pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.422789 2021] [mpm_event:debug] [pid 9272:tid 
123145323053056] event.c(2431): (4)Interrupted system call: AH: worker 
pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.423346 2021] [mpm_event:debug] [pid 9272:tid 
123145317150720] event.c(2431): (4)Interrupted system call: AH: worker 
pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.423755 2021] [mpm_event:debug] [pid 9272:tid 
123145311248384] event.c(2431): (4)Interrupted system call: AH: worker 
pop_something interrupted, goto worker_pop
[Thu Jul 08 09:39:32.424230 2021] [mpm_event:debug] [pid 9272:tid 
123145317150720] event.c(2402): (4)Interrupted system call: AH: worker read 
worker_queue, workers_may_exit=0, dying=1
[Thu Jul 08 09:39:32.424373 2021] [mpm_event:debug] [pid 9272:tid 
123145311248384] event.c(2402): (4)Interrupted system call: AH: worker read 
worker_queue, workers_may_exit=0, dying=1
[Thu Jul 08 09:39:32.424384 2021] [mpm_event:debug] [pid 9272:tid 
123145317150720] event.c(2408): AH: worker pop_something, workers_may_exit=0, 
dying=1
[Thu Jul 08 09:39:32.424477 2021] [mpm_event:debug] [pid 9272:tid 
123145317150720] event.c(2338): AH: process_defer_linger, processing socket, 
connection_count=2
[Thu Jul 08 09:39:32.424653 2021] [mpm_event:debug] [pid 9272:tid 
123145317150720] event.c(2342): AH: process_defer_linger, processed socket, 
connection_count=2
[Thu Jul 08 09:39:32.424664 2021] [mpm_event:debug] [pid 9272:tid 
123145317150720] event.c(2338): AH: process_defer_linger, processing socket, 
connection_count=2
[Thu Jul 08 09:39:32.424680 2021] [mpm_event:debug] [pid 9272:tid 
123145317150720] event.c(2342): AH: process_defer_linger, processed socket, 
connection_count=2
[Thu Jul 08 09:39:32.424708 2021] [mpm_event:debug] [pid 9272:tid 
123145317150720] event.c(2319): AH: process_defer_linger, nothing more to do
[Thu Jul 08 09:39:32.424754 2021] [mpm_event:debug] [pid 9272:tid 
123145317150720] event.c(2402): AH: worker read worker_queue, 
workers_may_exit=0, dying=1
[Thu Jul 08 09:39:32.424756 2021] [mpm_event:debug] [pid 9272:tid 
123145324126208] event.c(1978): AH: pollset returned listener_may_exit=1 
connection_count=2
[Thu Jul 08 09:39:32.424830 2021] [mpm_event:debug] [pid 9272:tid 
123145311248384] event.c(2408): AH: worker pop_something, workers_may_exit=0, 
dying=1
[Thu Jul 08 09:39:32.424910 2021] [mpm_event:debug] [pid 9272:tid 
123145311248384] event.c(808): AH: decrement_connection_count: 
last_connection=0 listener_may_exit=1 connection_count=1
[Thu Jul 08 09:39:32.424924 2021] [mpm_event:debug] [pid 9272:tid 
123145311248384] event.c(816): AH: decrement_connection_count: pollset_wakeup
[Thu Jul 08 09:39:32.424958 2021] [mpm_event:debug] [pid 9272:tid 
123145311248384] event.c(2319): AH: process_defer_linger, nothing more to do
[Thu Jul 08 09:39:32.424960 2021] [mpm_event:debug] [pid 9272:tid 
123145324126208] event.c(1978): (4)Interrupted system call: AH: pollset 
returned listener_may_exit=1 conn

Re: child exit on self-proxy

2021-07-08 Thread Yann Ylavic
On Thu, Jul 8, 2021 at 11:47 AM Stefan Eissing
 wrote:
>
> Some day, I knew I had to learn more about mpm_event. =)
>
> Adding more DEBUGs, I see in the example below that 2 connections were idling 
> at start of the graceful and the get added to the linger_chain. 2 workers are 
> then woken up and process the socket. connection_count stays at 2 however. As 
> I read it, that count drops when the connection pool is destroyed/cleanup'ed. 
> This normally seem to happen on the worker_queue_info, but in this example 
> this just does not happen.
>
> Is this a correct read?

You proxy to a local server, so the 2 connections are the incoming
ones on the local proxy vhost and the local server vhost.
But mod_proxy backend connections are recycled and end up in a reslist
which is not checked until reuse.
So when on graceful restart the local server starts lingering close
with its kept alive connection, it's just ignored by mod_proxy and the
MAX_SECS_TO_LINGER timeout (30s) applies.

I think we shouldn't do lingering on keepalive connections when they
expire or get killed (graceful or max workers), this does not help the
client anyway because any data sent on the connection is doomed, the
sooner we RESET the faster it will try on another connection.

So how about the attached patch that closes the connections when their
keepalive expires (including when it's shortened)?
There are other changes in there, like more trace logs that helped me
debug things, they are worth it too IMHO.
Also I changed reqevents = POLLIN to POLLIN|POLLHUP (by generalizing
and using update_reqevents_from_sense() in more places), because I'm
not sure that we catch the connections closed remotely otherwise
(connection_count should go down to 1 more quickly in the local
proxy+server case because the local server's client connection should
have responded to the lingering close almost instantly, but this one
seems to timeout too).

Cheers;
Yann.
Index: server/mpm/event/event.c
===
--- server/mpm/event/event.c	(revision 1891217)
+++ server/mpm/event/event.c	(working copy)
@@ -554,7 +554,7 @@ static APR_INLINE int connections_above_limit(int
 return 1;
 }
 
-static void abort_socket_nonblocking(apr_socket_t *csd)
+static void close_socket_nonblocking(apr_socket_t *csd)
 {
 apr_status_t rv;
 apr_socket_timeout_set(csd, 0);
@@ -573,7 +573,7 @@ static void close_worker_sockets(void)
 apr_socket_t *csd = worker_sockets[i];
 if (csd) {
 worker_sockets[i] = NULL;
-abort_socket_nonblocking(csd);
+close_socket_nonblocking(csd);
 }
 }
 for (;;) {
@@ -587,12 +587,16 @@ static void close_worker_sockets(void)
 continue;
 }
 cs->chain = NULL;
-abort_socket_nonblocking(cs->pfd.desc.s);
+close_socket_nonblocking(cs->pfd.desc.s);
 }
 }
 
-static void wakeup_listener(void)
+static void shutdown_listener(void)
 {
+ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, ap_server_conf,
+ "shutting down listener (%s)",
+ listener_may_exit ? "again" : "first");
+
 listener_may_exit = 1;
 disable_listensocks();
 
@@ -643,7 +647,7 @@ static void signal_threads(int mode)
 /* in case we weren't called from the listener thread, wake up the
  * listener thread
  */
-wakeup_listener();
+shutdown_listener();
 
 /* for ungraceful termination, let the workers exit now;
  * for graceful termination, the listener thread will notify the
@@ -779,6 +783,8 @@ static apr_status_t decrement_connection_count(voi
 {
 int is_last_connection;
 event_conn_state_t *cs = cs_;
+ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c,
+  "cleanup connection in state %i", (int)cs->pub.state);
 switch (cs->pub.state) {
 case CONN_STATE_LINGER_NORMAL:
 case CONN_STATE_LINGER_SHORT:
@@ -817,6 +823,22 @@ static void notify_resume(event_conn_state_t *cs,
 ap_run_resume_connection(cs->c, cs->r);
 }
 
+static void kill_connection(event_conn_state_t *cs)
+{
+ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c,
+  "kill connection in state %i", (int)cs->pub.state);
+close_socket_nonblocking(cs->pfd.desc.s);
+ap_queue_info_push_pool(worker_queue_info, cs->p);
+}
+
+static int kill_connection_nonblocking(event_conn_state_t *cs)
+{
+kill_connection(cs);
+if (dying)
+ap_queue_interrupt_one(worker_queue);
+return 1;
+}
+
 /*
  * Close our side of the connection, flushing data to the client first.
  * Pre-condition: cs is not in any timeout queue and not in the pollset,
@@ -827,12 +849,11 @@ static void notify_resume(event_conn_state_t *cs,
  */
 static int start_lingering_close_blocking(event_conn_state_t *cs)
 {
-apr_socket_t *csd = cs->pfd.desc.s;
-
+ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c,
+  "lingering close in state %i", (int)cs->pub.state

Re: child exit on self-proxy

2021-07-08 Thread Stefan Eissing


> Am 08.07.2021 um 14:14 schrieb Yann Ylavic :
> 
> On Thu, Jul 8, 2021 at 11:47 AM Stefan Eissing
>  wrote:
>> 
>> Some day, I knew I had to learn more about mpm_event. =)
>> 
>> Adding more DEBUGs, I see in the example below that 2 connections were 
>> idling at start of the graceful and the get added to the linger_chain. 2 
>> workers are then woken up and process the socket. connection_count stays at 
>> 2 however. As I read it, that count drops when the connection pool is 
>> destroyed/cleanup'ed. This normally seem to happen on the worker_queue_info, 
>> but in this example this just does not happen.
>> 
>> Is this a correct read?
> 
> You proxy to a local server, so the 2 connections are the incoming
> ones on the local proxy vhost and the local server vhost.
> But mod_proxy backend connections are recycled and end up in a reslist
> which is not checked until reuse.
> So when on graceful restart the local server starts lingering close
> with its kept alive connection, it's just ignored by mod_proxy and the
> MAX_SECS_TO_LINGER timeout (30s) applies.
> 
> I think we shouldn't do lingering on keepalive connections when they
> expire or get killed (graceful or max workers), this does not help the
> client anyway because any data sent on the connection is doomed, the
> sooner we RESET the faster it will try on another connection.
> 
> So how about the attached patch that closes the connections when their
> keepalive expires (including when it's shortened)?
> There are other changes in there, like more trace logs that helped me
> debug things, they are worth it too IMHO.
> Also I changed reqevents = POLLIN to POLLIN|POLLHUP (by generalizing
> and using update_reqevents_from_sense() in more places), because I'm
> not sure that we catch the connections closed remotely otherwise
> (connection_count should go down to 1 more quickly in the local
> proxy+server case because the local server's client connection should
> have responded to the lingering close almost instantly, but this one
> seems to timeout too).
> 
> Cheers;
> Yann.
> 


This seems to be it! Yann strikes again! And I learned something...\o/

I needed to make small tweaks, because no all previous close checked the
return value and I got assert failures on your patch. I attach my version
below. Instead of the param, there could also be 2 functions, one without
checks. A matter of taste.

So, I try to summarise:

- the effect was triggered by the keep-alive connections remote endpoints
  slumbering in a proxy connection pool where no one was monitoring the 
  sockets and reacting to TCP FIN packets.
- This triggered a long LINGER timeout that blocked the listener from
  exiting. Ultimately leading to the parent killing the child.
- This situation could also happen, I assume, when a client drops from
  the network, e.g. a cell phone entering bad coverage, where no one
  in the network path really knows what to do? Or a network connectivity
  issue or a comp crashing behind a NAT?

Cheers, Stefan


event_ka_no_lingerv2.diff
Description: Binary data


Re: child exit on self-proxy

2021-07-08 Thread Yann Ylavic
On Thu, Jul 8, 2021 at 2:55 PM Stefan Eissing
 wrote:
>
> This seems to be it! Yann strikes again! And I learned something...\o/

Thanks Stefan :)

>
> I needed to make small tweaks, because no all previous close checked the
> return value and I got assert failures on your patch.

Interesting, what error value was that? EBADF, EAGAIN?

> I attach my version
> below. Instead of the param, there could also be 2 functions, one without
> checks. A matter of taste.

Depending on the error cases, we might want to fix something or ignore
or relax the assertion maybe.
Let's see..

>
> So, I try to summarise:
>
> - the effect was triggered by the keep-alive connections remote endpoints
>   slumbering in a proxy connection pool where no one was monitoring the
>   sockets and reacting to TCP FIN packets.
> - This triggered a long LINGER timeout that blocked the listener from
>   exiting. Ultimately leading to the parent killing the child.
> - This situation could also happen, I assume, when a client drops from
>   the network, e.g. a cell phone entering bad coverage, where no one
>   in the network path really knows what to do? Or a network connectivity
>   issue or a comp crashing behind a NAT?

Exactly, I think that closing kept alive connections is the right
thing to do here, lingering close helps neither MPM event nor the
client.
I first tried to use the short linger timeout instead of the long one,
but I don't see a reason to linger those connections, even a little.
Other opinions maybe?


Cheers;
Yann.


Re: child exit on self-proxy

2021-07-08 Thread Yann Ylavic
On Thu, Jul 8, 2021 at 3:33 PM Yann Ylavic  wrote:
>
> On Thu, Jul 8, 2021 at 2:55 PM Stefan Eissing
>  wrote:
> >
> > This seems to be it! Yann strikes again! And I learned something...\o/
>
> Thanks Stefan :)
>
> >
> > I needed to make small tweaks, because no all previous close checked the
> > return value and I got assert failures on your patch.
>
> Interesting, what error value was that? EBADF, EAGAIN?
>
> > I attach my version
> > below. Instead of the param, there could also be 2 functions, one without
> > checks. A matter of taste.
>
> Depending on the error cases, we might want to fix something or ignore
> or relax the assertion maybe.
> Let's see..
>
> >
> > So, I try to summarise:
> >
> > - the effect was triggered by the keep-alive connections remote endpoints
> >   slumbering in a proxy connection pool where no one was monitoring the
> >   sockets and reacting to TCP FIN packets.
> > - This triggered a long LINGER timeout that blocked the listener from
> >   exiting. Ultimately leading to the parent killing the child.
> > - This situation could also happen, I assume, when a client drops from
> >   the network, e.g. a cell phone entering bad coverage, where no one
> >   in the network path really knows what to do? Or a network connectivity
> >   issue or a comp crashing behind a NAT?
>
> Exactly, I think that closing kept alive connections is the right
> thing to do here, lingering close helps neither MPM event nor the
> client.
> I first tried to use the short linger timeout instead of the long one,
> but I don't see a reason to linger those connections, even a little.
> Other opinions maybe?
>
>
> Cheers;
> Yann.


Re: child exit on self-proxy

2021-07-08 Thread Stefan Eissing



> Am 08.07.2021 um 15:33 schrieb Yann Ylavic :
> 
> On Thu, Jul 8, 2021 at 2:55 PM Stefan Eissing
>  wrote:
>> 
>> This seems to be it! Yann strikes again! And I learned something...\o/
> 
> Thanks Stefan :)
> 
>> 
>> I needed to make small tweaks, because no all previous close checked the
>> return value and I got assert failures on your patch.
> 
> Interesting, what error value was that? EBADF, EAGAIN?

EBADF is what I saw.

> 
>> I attach my version
>> below. Instead of the param, there could also be 2 functions, one without
>> checks. A matter of taste.
> 
> Depending on the error cases, we might want to fix something or ignore
> or relax the assertion maybe.
> Let's see..
> 
>> 
>> So, I try to summarise:
>> 
>> - the effect was triggered by the keep-alive connections remote endpoints
>>  slumbering in a proxy connection pool where no one was monitoring the
>>  sockets and reacting to TCP FIN packets.
>> - This triggered a long LINGER timeout that blocked the listener from
>>  exiting. Ultimately leading to the parent killing the child.
>> - This situation could also happen, I assume, when a client drops from
>>  the network, e.g. a cell phone entering bad coverage, where no one
>>  in the network path really knows what to do? Or a network connectivity
>>  issue or a comp crashing behind a NAT?
> 
> Exactly, I think that closing kept alive connections is the right
> thing to do here, lingering close helps neither MPM event nor the
> client.
> I first tried to use the short linger timeout instead of the long one,
> but I don't see a reason to linger those connections, even a little.
> Other opinions maybe?
> 
> 
> Cheers;
> Yann.



Re: child exit on self-proxy

2021-07-08 Thread Eric Covener
> I first tried to use the short linger timeout instead of the long one,
> but I don't see a reason to linger those connections, even a little.

IIUC this would only be useful for KeepAliveTimeout of something like
2s or less which is kind of academic.
I would go with whatever is simpler [to maintain].


Re: child exit on self-proxy

2021-07-08 Thread Yann Ylavic
On Thu, Jul 8, 2021 at 3:35 PM Stefan Eissing
 wrote:
> >>
> >> I needed to make small tweaks, because no all previous close checked the
> >> return value and I got assert failures on your patch.
> >
> > Interesting, what error value was that? EBADF, EAGAIN?
>
> EBADF is what I saw.

Could you track somehow where the first close happened?
Maybe:
static void close_socket_nonblocking_ex(apr_socket_t *csd, const char *from)
{
apr_status_t rv;
apr_os_sock_t fd = -1;
apr_os_sock_get(&fd, csd);
apr_socket_timeout_set(csd, 0);
rv = apr_socket_close(csd);
if (rv != APR_SUCCESS) {
ap_log_error(APLOG_MARK, APLOG_ERR, rv, ap_server_conf, APLOGNO(00468)
 "error closing socket %i/%pp from %s", (int)fd, csd, from);
/*AP_DEBUG_ASSERT(0);*/
}
else {
ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf,
 "socket %i/%pp closed from %s", (int)fd, csd, from);
}
}
#define close_socket_nonblocking(csd) close_socket_nonblocking_ex(csd,
__FUNCTION__)

to see if close_socket_nonblocking() is called twice or something.


Re: child exit on self-proxy

2021-07-08 Thread Stefan Eissing
I see many of those:
[Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 
123145411510272] (9)Bad file descriptor: AH00468: error closing socket 
-1/7ff9cf0086b0 from process_socket

which come from event.c#1263: rc = start_lingering_close_blocking(cs);
calling event.c#864: kill_connection_ex(cs, from);
and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from);

Patch attached.



event_ka_no_lingerv3.diff
Description: Binary data

> Am 08.07.2021 um 15:46 schrieb Yann Ylavic :
> 
> On Thu, Jul 8, 2021 at 3:35 PM Stefan Eissing
>  wrote:
 
 I needed to make small tweaks, because no all previous close checked the
 return value and I got assert failures on your patch.
>>> 
>>> Interesting, what error value was that? EBADF, EAGAIN?
>> 
>> EBADF is what I saw.
> 
> Could you track somehow where the first close happened?
> Maybe:
> static void close_socket_nonblocking_ex(apr_socket_t *csd, const char *from)
> {
>apr_status_t rv;
>apr_os_sock_t fd = -1;
>apr_os_sock_get(&fd, csd);
>apr_socket_timeout_set(csd, 0);
>rv = apr_socket_close(csd);
>if (rv != APR_SUCCESS) {
>ap_log_error(APLOG_MARK, APLOG_ERR, rv, ap_server_conf, APLOGNO(00468)
> "error closing socket %i/%pp from %s", (int)fd, csd, 
> from);
>/*AP_DEBUG_ASSERT(0);*/
>}
>else {
>ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf,
> "socket %i/%pp closed from %s", (int)fd, csd, from);
>}
> }
> #define close_socket_nonblocking(csd) close_socket_nonblocking_ex(csd,
> __FUNCTION__)
> 
> to see if close_socket_nonblocking() is called twice or something.



Re: child exit on self-proxy

2021-07-08 Thread Yann Ylavic
On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing
 wrote:
>
> I see many of those:
> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 
> 123145411510272] (9)Bad file descriptor: AH00468: error closing socket 
> -1/7ff9cf0086b0 from process_socket
>
> which come from event.c#1263: rc = start_lingering_close_blocking(cs);
> calling event.c#864: kill_connection_ex(cs, from);
> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from);

OK, so I think this could be addressed by the attached patch.
This is the same as v0 plus some changes in ap_start_lingering_close()
(from server/connection.c) to not close the socket on error.
ap_start_lingering_close() did not close the socket consistently, so
the caller had to call apr_socket_close() anyway and sometimes fail..
The close on failure is moved to ap_lingering_close() which is the one
that should care.

Still the AP_DEBUG_ASSERT(0) triggering with this?
Index: server/mpm/event/event.c
===
--- server/mpm/event/event.c	(revision 1891217)
+++ server/mpm/event/event.c	(working copy)
@@ -554,9 +554,10 @@ static APR_INLINE int connections_above_limit(int
 return 1;
 }
 
-static void abort_socket_nonblocking(apr_socket_t *csd)
+static void close_socket_nonblocking(apr_socket_t *csd)
 {
 apr_status_t rv;
+
 apr_socket_timeout_set(csd, 0);
 rv = apr_socket_close(csd);
 if (rv != APR_SUCCESS) {
@@ -573,7 +574,7 @@ static void close_worker_sockets(void)
 apr_socket_t *csd = worker_sockets[i];
 if (csd) {
 worker_sockets[i] = NULL;
-abort_socket_nonblocking(csd);
+close_socket_nonblocking(csd);
 }
 }
 for (;;) {
@@ -587,12 +588,16 @@ static void close_worker_sockets(void)
 continue;
 }
 cs->chain = NULL;
-abort_socket_nonblocking(cs->pfd.desc.s);
+close_socket_nonblocking(cs->pfd.desc.s);
 }
 }
 
-static void wakeup_listener(void)
+static void shutdown_listener(void)
 {
+ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, ap_server_conf,
+ "shutting down listener (%s)",
+ listener_may_exit ? "again" : "first");
+
 listener_may_exit = 1;
 disable_listensocks();
 
@@ -643,7 +648,7 @@ static void signal_threads(int mode)
 /* in case we weren't called from the listener thread, wake up the
  * listener thread
  */
-wakeup_listener();
+shutdown_listener();
 
 /* for ungraceful termination, let the workers exit now;
  * for graceful termination, the listener thread will notify the
@@ -779,6 +784,8 @@ static apr_status_t decrement_connection_count(voi
 {
 int is_last_connection;
 event_conn_state_t *cs = cs_;
+ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c,
+  "cleanup connection in state %i", (int)cs->pub.state);
 switch (cs->pub.state) {
 case CONN_STATE_LINGER_NORMAL:
 case CONN_STATE_LINGER_SHORT:
@@ -817,6 +824,22 @@ static void notify_resume(event_conn_state_t *cs,
 ap_run_resume_connection(cs->c, cs->r);
 }
 
+static void kill_connection(event_conn_state_t *cs)
+{
+ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c,
+  "kill connection in state %i", (int)cs->pub.state);
+close_socket_nonblocking(cs->pfd.desc.s);
+ap_queue_info_push_pool(worker_queue_info, cs->p);
+}
+
+static int kill_connection_nonblocking(event_conn_state_t *cs)
+{
+kill_connection(cs);
+if (dying)
+ap_queue_interrupt_one(worker_queue);
+return 1;
+}
+
 /*
  * Close our side of the connection, flushing data to the client first.
  * Pre-condition: cs is not in any timeout queue and not in the pollset,
@@ -827,12 +850,11 @@ static void notify_resume(event_conn_state_t *cs,
  */
 static int start_lingering_close_blocking(event_conn_state_t *cs)
 {
-apr_socket_t *csd = cs->pfd.desc.s;
-
+ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c,
+  "lingering close in state %i", (int)cs->pub.state);
 if (ap_start_lingering_close(cs->c)) {
 notify_suspend(cs);
-apr_socket_close(csd);
-ap_queue_info_push_pool(worker_queue_info, cs->p);
+kill_connection(cs);
 return DONE;
 }
 
@@ -839,11 +861,11 @@ static int start_lingering_close_blocking(event_co
 #ifdef AP_DEBUG
 {
 apr_status_t rv;
-rv = apr_socket_timeout_set(csd, 0);
+rv = apr_socket_timeout_set(cs->pfd.desc.s, 0);
 AP_DEBUG_ASSERT(rv == APR_SUCCESS);
 }
 #else
-apr_socket_timeout_set(csd, 0);
+apr_socket_timeout_set(cs->pfd.desc.s, 0);
 #endif
 
 cs->queue_timestamp = apr_time_now();
@@ -874,9 +896,11 @@ static int start_lingering_close_blocking(event_co
  */
 static int start_lingering_close_nonblocking(event_conn_state_t *cs)
 {
-event_conn_state_t *chain;
+ap_log_cerror(APLOG_MARK, APLOG_TRACE4, 0, cs->c,
+  "lingering close nonblocking in 

Re: child exit on self-proxy

2021-07-09 Thread Stefan Eissing
*stands in the pentagram of blood drawn by Yann*

"I'll do thy bidding and report back!"

> Am 08.07.2021 um 20:17 schrieb Yann Ylavic :
> 
> On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing
>  wrote:
>> 
>> I see many of those:
>> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 
>> 123145411510272] (9)Bad file descriptor: AH00468: error closing socket 
>> -1/7ff9cf0086b0 from process_socket
>> 
>> which come from event.c#1263: rc = start_lingering_close_blocking(cs);
>> calling event.c#864: kill_connection_ex(cs, from);
>> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from);
> 
> OK, so I think this could be addressed by the attached patch.
> This is the same as v0 plus some changes in ap_start_lingering_close()
> (from server/connection.c) to not close the socket on error.
> ap_start_lingering_close() did not close the socket consistently, so
> the caller had to call apr_socket_close() anyway and sometimes fail..
> The close on failure is moved to ap_lingering_close() which is the one
> that should care.
> 
> Still the AP_DEBUG_ASSERT(0) triggering with this?
> 



Re: child exit on self-proxy

2021-07-09 Thread Stefan Eissing



> Am 08.07.2021 um 20:17 schrieb Yann Ylavic :
> 
> On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing
>  wrote:
>> 
>> I see many of those:
>> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 
>> 123145411510272] (9)Bad file descriptor: AH00468: error closing socket 
>> -1/7ff9cf0086b0 from process_socket
>> 
>> which come from event.c#1263: rc = start_lingering_close_blocking(cs);
>> calling event.c#864: kill_connection_ex(cs, from);
>> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from);
> 
> OK, so I think this could be addressed by the attached patch.
> This is the same as v0 plus some changes in ap_start_lingering_close()
> (from server/connection.c) to not close the socket on error.
> ap_start_lingering_close() did not close the socket consistently, so
> the caller had to call apr_socket_close() anyway and sometimes fail..
> The close on failure is moved to ap_lingering_close() which is the one
> that should care.
> 
> Still the AP_DEBUG_ASSERT(0) triggering with this?
> 

No assertion failures or mpm_event:error logs with this one, however (if it was 
this patch or already on one of them before), the "pre_close" hook does not run 
during shutdown. This results in logs as:

[Fri Jul 09 07:23:12.256163 2021] [http2:warn] [pid 97909:tid 123145335771136] 
[remote 127.0.0.1:59875] AH10020: h2_session(146,IDLE,0): session cleanup 
triggered by pool cleanup. this should have happened earlier already.
[Fri Jul 09 07:23:12.256197 2021] [http2:warn] [pid 97909:tid 123145335771136] 
[remote 127.0.0.1:59875] AH03199: h2_session(146,IDLE,0): connection 
disappeared without proper goodbye, clients will be confused, should not happen

I think omitting the last GOAWAY frame in case of a server restart, is not a 
tragedy. For the sake of an easier to maintain shutdown, I think we can live 
with that. Or, if you can think of a better interface to a pre_close hook, I 
could adjust. Maybe it would be easier for pre_close to get passed a bucket 
brigade where last bytes can be added to?

In case we say, we live with the incorrect h2 shutdown, I would amend the log 
level on a stopping server.

- Stefan



Re: child exit on self-proxy

2021-07-09 Thread Stefan Eissing
Rarely seeing a

close_socket_nonblocking + 83 (event.c:566) 

assertion failure. :-/

> Am 09.07.2021 um 09:38 schrieb Stefan Eissing :
> 
> 
> 
>> Am 08.07.2021 um 20:17 schrieb Yann Ylavic :
>> 
>> On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing
>>  wrote:
>>> 
>>> I see many of those:
>>> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 
>>> 123145411510272] (9)Bad file descriptor: AH00468: error closing socket 
>>> -1/7ff9cf0086b0 from process_socket
>>> 
>>> which come from event.c#1263: rc = start_lingering_close_blocking(cs);
>>> calling event.c#864: kill_connection_ex(cs, from);
>>> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from);
>> 
>> OK, so I think this could be addressed by the attached patch.
>> This is the same as v0 plus some changes in ap_start_lingering_close()
>> (from server/connection.c) to not close the socket on error.
>> ap_start_lingering_close() did not close the socket consistently, so
>> the caller had to call apr_socket_close() anyway and sometimes fail..
>> The close on failure is moved to ap_lingering_close() which is the one
>> that should care.
>> 
>> Still the AP_DEBUG_ASSERT(0) triggering with this?
>> 
> 
> No assertion failures or mpm_event:error logs with this one, however (if it 
> was this patch or already on one of them before), the "pre_close" hook does 
> not run during shutdown. This results in logs as:
> 
> [Fri Jul 09 07:23:12.256163 2021] [http2:warn] [pid 97909:tid 
> 123145335771136] [remote 127.0.0.1:59875] AH10020: h2_session(146,IDLE,0): 
> session cleanup triggered by pool cleanup. this should have happened earlier 
> already.
> [Fri Jul 09 07:23:12.256197 2021] [http2:warn] [pid 97909:tid 
> 123145335771136] [remote 127.0.0.1:59875] AH03199: h2_session(146,IDLE,0): 
> connection disappeared without proper goodbye, clients will be confused, 
> should not happen
> 
> I think omitting the last GOAWAY frame in case of a server restart, is not a 
> tragedy. For the sake of an easier to maintain shutdown, I think we can live 
> with that. Or, if you can think of a better interface to a pre_close hook, I 
> could adjust. Maybe it would be easier for pre_close to get passed a bucket 
> brigade where last bytes can be added to?
> 
> In case we say, we live with the incorrect h2 shutdown, I would amend the log 
> level on a stopping server.
> 
> - Stefan
> 



Re: child exit on self-proxy

2021-07-13 Thread Stefan Eissing
Coming back to this: I vote for applying this to trunk. On a fresh setup, I do 
not see and assert failures in my test.

- Stefan

> Am 08.07.2021 um 20:17 schrieb Yann Ylavic :
> 
> On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing
>  wrote:
>> 
>> I see many of those:
>> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 
>> 123145411510272] (9)Bad file descriptor: AH00468: error closing socket 
>> -1/7ff9cf0086b0 from process_socket
>> 
>> which come from event.c#1263: rc = start_lingering_close_blocking(cs);
>> calling event.c#864: kill_connection_ex(cs, from);
>> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from);
> 
> OK, so I think this could be addressed by the attached patch.
> This is the same as v0 plus some changes in ap_start_lingering_close()
> (from server/connection.c) to not close the socket on error.
> ap_start_lingering_close() did not close the socket consistently, so
> the caller had to call apr_socket_close() anyway and sometimes fail..
> The close on failure is moved to ap_lingering_close() which is the one
> that should care.
> 
> Still the AP_DEBUG_ASSERT(0) triggering with this?
> 



Re: child exit on self-proxy

2021-07-13 Thread Stefan Eissing
Found one condition where the assertion failure happens. The test case I have 
does this:

sock = socket.create_connection((host, int(env.https_port)))
try:
sock.settimeout(0.5)
sock.recv(1024)
assert False
except Exception as ex:
print(f"as expected: {ex}")
sock.close()

Which is all that's needed to trigger the assertion. The server is not stopped.

The log for this thread only gives:
[Tue Jul 13 13:51:29.476747 2021] [mpm_event:trace4] [pid 31582:tid 
123145480441856] event.c(858): [client 127.0.0.1:51768] lingering close in 
state 5
[Tue Jul 13 13:51:29.476769 2021] [mpm_event:trace4] [pid 31582:tid 
123145480441856] event.c(833): [remote 127.0.0.1:51768] kill connection in 
state 5
[Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid 
123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from 
process_socket
[Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid 
123145480441856] (9)Bad file descriptor: AH00468: error closing socket 
-1/7fa24681d0b0 from process_socket
[Tue Jul 13 13:51:29.476799 2021] [core:crit] [pid 31582:tid 123145480441856] 
AH00102: [Tue Jul 13 13:51:29 2021] file event.c, line 569, assertion "0" failed

I added a trace4 "closing socket" before the apr_socket_close(), grepping for 
that in the log shows:
[Tue Jul 13 13:51:28.467677 2021] [mpm_event:trace4] [pid 31582:tid 
123145479905280] event.c(563): closing socket 15/7fa24681d0a0 from 
process_lingering_close
[Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid 
123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from 
process_socket
[Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid 
123145480441856] (9)Bad file descriptor: AH00468: error closing socket 
-1/7fa24681d0b0 from process_socket
[Tue Jul 13 13:51:29.984132 2021] [mpm_event:trace4] [pid 31584:tid 
123145470783488] event.c(563): closing socket -1/7fa24681f0b0 from 
process_socket
[Tue Jul 13 13:51:29.984138 2021] [mpm_event:error] [pid 31584:tid 
123145470783488] (9)Bad file descriptor: AH00468: error closing socket 
-1/7fa24681f0b0 from process_socket



Thread 21 Crashed:
0   libsystem_kernel.dylib  0x7fff202e292e __pthread_kill + 10
1   libsystem_pthread.dylib 0x7fff203115bd pthread_kill + 263
2   libsystem_c.dylib   0x7fff202664ab __abort + 139
3   libsystem_c.dylib   0x7fff20266420 abort + 135
4   httpd   0x00010dbfdb72 ap_log_assert + 130 
(log.c:1649)
5   mod_mpm_event.so0x00010e5154bf 
close_socket_nonblocking_ex + 223 (event.c:569)
6   mod_mpm_event.so0x00010e515158 kill_connection_ex + 
67 (event.c:835) [inlined]
7   mod_mpm_event.so0x00010e515158 
start_lingering_close_blocking_ex + 184 (event.c:862)
8   mod_mpm_event.so0x00010e514da9 process_socket + 
1289 (event.c:1261)
9   mod_mpm_event.so0x00010e514769 worker_thread + 809 
(event.c:2394)
10  libapr-1.0.dylib0x00010ddb9341 dummy_worker + 17 
(thread.c:147)
11  libsystem_pthread.dylib 0x7fff203118fc _pthread_start + 224
12  libsystem_pthread.dylib 0x7fff2030d443 thread_start + 15



- Stefan

> Am 13.07.2021 um 12:39 schrieb Stefan Eissing :
> 
> Coming back to this: I vote for applying this to trunk. On a fresh setup, I 
> do not see and assert failures in my test.
> 
> - Stefan
> 
>> Am 08.07.2021 um 20:17 schrieb Yann Ylavic :
>> 
>> On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing
>>  wrote:
>>> 
>>> I see many of those:
>>> [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 
>>> 123145411510272] (9)Bad file descriptor: AH00468: error closing socket 
>>> -1/7ff9cf0086b0 from process_socket
>>> 
>>> which come from event.c#1263: rc = start_lingering_close_blocking(cs);
>>> calling event.c#864: kill_connection_ex(cs, from);
>>> and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from);
>> 
>> OK, so I think this could be addressed by the attached patch.
>> This is the same as v0 plus some changes in ap_start_lingering_close()
>> (from server/connection.c) to not close the socket on error.
>> ap_start_lingering_close() did not close the socket consistently, so
>> the caller had to call apr_socket_close() anyway and sometimes fail..
>> The close on failure is moved to ap_lingering_close() which is the one
>> that should care.
>> 
>> Still the AP_DEBUG_ASSERT(0) triggering with this?
>> 
> 



Re: child exit on self-proxy

2021-07-13 Thread Stefan Eissing
Wait a minute...need to confirm that I am not lying and used an older patch 
version...

> Am 13.07.2021 um 16:01 schrieb Stefan Eissing :
> 
> Found one condition where the assertion failure happens. The test case I have 
> does this:
> 
> sock = socket.create_connection((host, int(env.https_port)))
> try:
>sock.settimeout(0.5)
>sock.recv(1024)
>assert False
> except Exception as ex:
>print(f"as expected: {ex}")
> sock.close()
> 
> Which is all that's needed to trigger the assertion. The server is not 
> stopped.
> 
> The log for this thread only gives:
> [Tue Jul 13 13:51:29.476747 2021] [mpm_event:trace4] [pid 31582:tid 
> 123145480441856] event.c(858): [client 127.0.0.1:51768] lingering close in 
> state 5
> [Tue Jul 13 13:51:29.476769 2021] [mpm_event:trace4] [pid 31582:tid 
> 123145480441856] event.c(833): [remote 127.0.0.1:51768] kill connection in 
> state 5
> [Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid 
> 123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from 
> process_socket
> [Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid 
> 123145480441856] (9)Bad file descriptor: AH00468: error closing socket 
> -1/7fa24681d0b0 from process_socket
> [Tue Jul 13 13:51:29.476799 2021] [core:crit] [pid 31582:tid 123145480441856] 
> AH00102: [Tue Jul 13 13:51:29 2021] file event.c, line 569, assertion "0" 
> failed
> 
> I added a trace4 "closing socket" before the apr_socket_close(), grepping for 
> that in the log shows:
> [Tue Jul 13 13:51:28.467677 2021] [mpm_event:trace4] [pid 31582:tid 
> 123145479905280] event.c(563): closing socket 15/7fa24681d0a0 from 
> process_lingering_close
> [Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid 
> 123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from 
> process_socket
> [Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid 
> 123145480441856] (9)Bad file descriptor: AH00468: error closing socket 
> -1/7fa24681d0b0 from process_socket
> [Tue Jul 13 13:51:29.984132 2021] [mpm_event:trace4] [pid 31584:tid 
> 123145470783488] event.c(563): closing socket -1/7fa24681f0b0 from 
> process_socket
> [Tue Jul 13 13:51:29.984138 2021] [mpm_event:error] [pid 31584:tid 
> 123145470783488] (9)Bad file descriptor: AH00468: error closing socket 
> -1/7fa24681f0b0 from process_socket
> 
> 
> 
> Thread 21 Crashed:
> 0   libsystem_kernel.dylib0x7fff202e292e __pthread_kill + 10
> 1   libsystem_pthread.dylib   0x7fff203115bd pthread_kill + 263
> 2   libsystem_c.dylib 0x7fff202664ab __abort + 139
> 3   libsystem_c.dylib 0x7fff20266420 abort + 135
> 4   httpd 0x00010dbfdb72 ap_log_assert + 130 
> (log.c:1649)
> 5   mod_mpm_event.so  0x00010e5154bf 
> close_socket_nonblocking_ex + 223 (event.c:569)
> 6   mod_mpm_event.so  0x00010e515158 kill_connection_ex + 
> 67 (event.c:835) [inlined]
> 7   mod_mpm_event.so  0x00010e515158 
> start_lingering_close_blocking_ex + 184 (event.c:862)
> 8   mod_mpm_event.so  0x00010e514da9 process_socket + 
> 1289 (event.c:1261)
> 9   mod_mpm_event.so  0x00010e514769 worker_thread + 809 
> (event.c:2394)
> 10  libapr-1.0.dylib  0x00010ddb9341 dummy_worker + 17 
> (thread.c:147)
> 11  libsystem_pthread.dylib   0x7fff203118fc _pthread_start + 224
> 12  libsystem_pthread.dylib   0x7fff2030d443 thread_start + 15
> 
> 
> 
> - Stefan
> 
>> Am 13.07.2021 um 12:39 schrieb Stefan Eissing :
>> 
>> Coming back to this: I vote for applying this to trunk. On a fresh setup, I 
>> do not see and assert failures in my test.
>> 
>> - Stefan
>> 
>>> Am 08.07.2021 um 20:17 schrieb Yann Ylavic :
>>> 
>>> On Thu, Jul 8, 2021 at 4:21 PM Stefan Eissing
>>>  wrote:
 
 I see many of those:
 [Thu Jul 08 14:16:55.301670 2021] [mpm_event:error] [pid 81101:tid 
 123145411510272] (9)Bad file descriptor: AH00468: error closing socket 
 -1/7ff9cf0086b0 from process_socket
 
 which come from event.c#1263: rc = start_lingering_close_blocking(cs);
 calling event.c#864: kill_connection_ex(cs, from);
 and event.c#837: close_socket_nonblocking_ex(cs->pfd.desc.s, from);
>>> 
>>> OK, so I think this could be addressed by the attached patch.
>>> This is the same as v0 plus some changes in ap_start_lingering_close()
>>> (from server/connection.c) to not close the socket on error.
>>> ap_start_lingering_close() did not close the socket consistently, so
>>> the caller had to call apr_socket_close() anyway and sometimes fail..
>>> The close on failure is moved to ap_lingering_close() which is the one
>>> that should care.
>>> 
>>> Still the AP_DEBUG_ASSERT(0) triggering with this?
>>> 
>> 
> 



Re: child exit on self-proxy

2021-07-13 Thread Stefan Eissing
In Yann's v3 of the patch, it triggers only when I stop the server while the 
test case is ongoing,

Thread 2 Crashed:
0   libsystem_kernel.dylib  0x7fff202e292e __pthread_kill + 10
1   libsystem_pthread.dylib 0x7fff203115bd pthread_kill + 263
2   libsystem_c.dylib   0x7fff202664ab __abort + 139
3   libsystem_c.dylib   0x7fff20266420 abort + 135
4   httpd   0x00010625bb72 ap_log_assert + 130 
(log.c:1649)
5   mod_mpm_event.so0x000106b73549 
close_socket_nonblocking + 217 (event.c:569)
6   mod_mpm_event.so0x000106b731f2 kill_connection + 64 
(event.c:834) [inlined]
7   mod_mpm_event.so0x000106b731f2 
start_lingering_close_blocking + 178 (event.c:860)
8   mod_mpm_event.so0x000106b72e52 process_socket + 
1282 (event.c:1258)
9   mod_mpm_event.so0x000106b72819 worker_thread + 809 
(event.c:2390)
10  libapr-1.0.dylib0x000106412341 dummy_worker + 17 
(thread.c:147)
11  libsystem_pthread.dylib 0x7fff203118fc _pthread_start + 224
12  libsystem_pthread.dylib 0x7fff2030d443 thread_start + 15


[Tue Jul 13 14:12:18.800125 2021] [mpm_event:trace4] [pid 47786:tid 
123145484374016] event.c(563): closing socket 15/7f95ba07b4a0
[Tue Jul 13 14:12:19.551039 2021] [mpm_event:trace4] [pid 47787:tid 
12314548824] event.c(1322): closing listeners (connection_count=0)
[Tue Jul 13 14:12:19.551146 2021] [mpm_event:trace4] [pid 47786:tid 
123145497251840] event.c(1322): closing listeners (connection_count=1)
[Tue Jul 13 14:12:19.551176 2021] [mpm_event:trace4] [pid 47788:tid 
12314548824] event.c(1322): closing listeners (connection_count=0)
[Tue Jul 13 14:12:19.551209 2021] [mpm_event:trace4] [pid 47786:tid 4476272128] 
event.c(563): closing socket 15/7f95ba07b4b0
[Tue Jul 13 14:12:19.552305 2021] [mpm_event:trace4] [pid 47792:tid 
12314548824] event.c(1322): closing listeners (connection_count=0)
[Tue Jul 13 14:12:19.805283 2021] [mpm_event:trace4] [pid 47786:tid 
123145484910592] event.c(563): closing socket -1/7f95ba07b4b0
[Tue Jul 13 14:12:19.805289 2021] [mpm_event:error] [pid 47786:tid 
123145484910592] (9)Bad file descriptor: AH00468: error closing socket 
-1/7f95ba07b4b0

You can see that socket 7f95ba07b4b0 is closed twice. I attach my log and v5 
patch (which is Yann's plus some extra log around closing).


error_log
Description: Binary data


event_ka_no_lingerv5.diff
Description: Binary data

> Am 13.07.2021 um 16:03 schrieb Stefan Eissing :
> 
> Wait a minute...need to confirm that I am not lying and used an older patch 
> version...
> 
>> Am 13.07.2021 um 16:01 schrieb Stefan Eissing :
>> 
>> Found one condition where the assertion failure happens. The test case I 
>> have does this:
>> 
>> sock = socket.create_connection((host, int(env.https_port)))
>> try:
>>   sock.settimeout(0.5)
>>   sock.recv(1024)
>>   assert False
>> except Exception as ex:
>>   print(f"as expected: {ex}")
>> sock.close()
>> 
>> Which is all that's needed to trigger the assertion. The server is not 
>> stopped.
>> 
>> The log for this thread only gives:
>> [Tue Jul 13 13:51:29.476747 2021] [mpm_event:trace4] [pid 31582:tid 
>> 123145480441856] event.c(858): [client 127.0.0.1:51768] lingering close in 
>> state 5
>> [Tue Jul 13 13:51:29.476769 2021] [mpm_event:trace4] [pid 31582:tid 
>> 123145480441856] event.c(833): [remote 127.0.0.1:51768] kill connection in 
>> state 5
>> [Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid 
>> 123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from 
>> process_socket
>> [Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid 
>> 123145480441856] (9)Bad file descriptor: AH00468: error closing socket 
>> -1/7fa24681d0b0 from process_socket
>> [Tue Jul 13 13:51:29.476799 2021] [core:crit] [pid 31582:tid 
>> 123145480441856] AH00102: [Tue Jul 13 13:51:29 2021] file event.c, line 569, 
>> assertion "0" failed
>> 
>> I added a trace4 "closing socket" before the apr_socket_close(), grepping 
>> for that in the log shows:
>> [Tue Jul 13 13:51:28.467677 2021] [mpm_event:trace4] [pid 31582:tid 
>> 123145479905280] event.c(563): closing socket 15/7fa24681d0a0 from 
>> process_lingering_close
>> [Tue Jul 13 13:51:29.476775 2021] [mpm_event:trace4] [pid 31582:tid 
>> 123145480441856] event.c(563): closing socket -1/7fa24681d0b0 from 
>> process_socket
>> [Tue Jul 13 13:51:29.476779 2021] [mpm_event:error] [pid 31582:tid 
>> 123145480441856] (9)Bad file descriptor: AH00468: error closing socket 
>> -1/7fa24681d0b0 from process_socket
>> [Tue Jul 13 13:51:29.984132 2021] [mpm_event:trace4] [pid 31584:tid 
>> 123145470783488] event.c(563): closing socket -1/7fa24681f0b0 from 
>> process_socket
>> [Tue Jul 13 13:51:29.984138 2021] [mpm_event:error] [pid 31584:tid 
>> 123145470783488] (9)Bad file descr

Re: child exit on self-proxy

2021-07-14 Thread Yann Ylavic
On Tue, Jul 13, 2021 at 4:16 PM Stefan Eissing
 wrote:
>
> In Yann's v3 of the patch, it triggers only when I stop the server while the 
> test case is ongoing,

OK thanks, I have a new v6 which should address this and also still
call pre_close hooks in any case.
The patch is a bit "verbose" (some cosmetics/comments changes that
helped my workflow, sorry about that) but in the end I think it's
simpler to maintain (as Eric asked?) which may be worth it..

>
> [Tue Jul 13 14:12:18.800125 2021] [mpm_event:trace4] [pid 47786:tid 
> 123145484374016] event.c(563): closing socket 15/7f95ba07b4a0
> [Tue Jul 13 14:12:19.551039 2021] [mpm_event:trace4] [pid 47787:tid 
> 12314548824] event.c(1322): closing listeners (connection_count=0)
> [Tue Jul 13 14:12:19.551146 2021] [mpm_event:trace4] [pid 47786:tid 
> 123145497251840] event.c(1322): closing listeners (connection_count=1)
> [Tue Jul 13 14:12:19.551176 2021] [mpm_event:trace4] [pid 47788:tid 
> 12314548824] event.c(1322): closing listeners (connection_count=0)
> [Tue Jul 13 14:12:19.551209 2021] [mpm_event:trace4] [pid 47786:tid 
> 4476272128] event.c(563): closing socket 15/7f95ba07b4b0
> [Tue Jul 13 14:12:19.552305 2021] [mpm_event:trace4] [pid 47792:tid 
> 12314548824] event.c(1322): closing listeners (connection_count=0)
> [Tue Jul 13 14:12:19.805283 2021] [mpm_event:trace4] [pid 47786:tid 
> 123145484910592] event.c(563): closing socket -1/7f95ba07b4b0
> [Tue Jul 13 14:12:19.805289 2021] [mpm_event:error] [pid 47786:tid 
> 123145484910592] (9)Bad file descriptor: AH00468: error closing socket 
> -1/7f95ba07b4b0
>
> You can see that socket 7f95ba07b4b0 is closed twice.

Indeed, where does the first close come from (the caller)?
If you see this with the new patch (LogLevel mpm_event:trace8), can
you confirm that it's from close_worker_sockets() (i.e. for ungraceful
restart) only?

Cheers;
Yann.
Index: server/mpm/event/event.c
===
--- server/mpm/event/event.c	(revision 1891501)
+++ server/mpm/event/event.c	(working copy)
@@ -254,6 +254,8 @@ struct event_conn_state_t {
 conn_state_t pub;
 /** chaining in defer_linger_chain */
 struct event_conn_state_t *chain;
+/** Is lingering close from defer_lingering_close()? */
+int deferred_linger;
 };
 
 APR_RING_HEAD(timeout_head_t, event_conn_state_t);
@@ -289,9 +291,10 @@ static volatile apr_time_t  queues_next_expiry;
  * Macros for accessing struct timeout_queue.
  * For TO_QUEUE_APPEND and TO_QUEUE_REMOVE, timeout_mutex must be held.
  */
-static void TO_QUEUE_APPEND(struct timeout_queue *q, event_conn_state_t *el)
+static void TO_QUEUE_APPEND(struct timeout_queue *q, event_conn_state_t *el,
+int do_wakeup)
 {
-apr_time_t q_expiry;
+apr_time_t elem_expiry;
 apr_time_t next_expiry;
 
 APR_RING_INSERT_TAIL(&q->head, el, event_conn_state_t, timeout_list);
@@ -298,16 +301,16 @@ static volatile apr_time_t  queues_next_expiry;
 ++*q->total;
 ++q->count;
 
-/* Cheaply update the overall queues' next expiry according to the
- * first entry of this queue (oldest), if necessary.
+/* Cheaply update the global queues_next_expiry with the one of the
+ * first entry of this queue (oldest) if it expires before.
  */
 el = APR_RING_FIRST(&q->head);
-q_expiry = el->queue_timestamp + q->timeout;
+elem_expiry = el->queue_timestamp + q->timeout;
 next_expiry = queues_next_expiry;
-if (!next_expiry || next_expiry > q_expiry + TIMEOUT_FUDGE_FACTOR) {
-queues_next_expiry = q_expiry;
+if (!next_expiry || next_expiry > elem_expiry + TIMEOUT_FUDGE_FACTOR) {
+queues_next_expiry = elem_expiry;
 /* Unblock the poll()ing listener for it to update its timeout. */
-if (listener_is_wakeable) {
+if (do_wakeup && listener_is_wakeable) {
 apr_pollset_wakeup(event_pollset);
 }
 }
@@ -554,9 +557,20 @@ static APR_INLINE int connections_above_limit(int
 return 1;
 }
 
-static void abort_socket_nonblocking(apr_socket_t *csd)
+static void close_socket_nonblocking_(apr_socket_t *csd,
+  const char *from, int line)
 {
 apr_status_t rv;
+apr_os_sock_t fd = -1;
+
+/* close_worker_sockets() may have closed it already */
+rv = apr_os_sock_get(&fd, csd);
+ap_log_error(APLOG_MARK, APLOG_TRACE8, 0, ap_server_conf,
+"closing socket %i/%pp from %s:%i", (int)fd, csd, from, line);
+if (rv == APR_SUCCESS && fd == -1) {
+return;
+}
+
 apr_socket_timeout_set(csd, 0);
 rv = apr_socket_close(csd);
 if (rv != APR_SUCCESS) {
@@ -565,6 +579,8 @@ static APR_INLINE int connections_above_limit(int
 AP_DEBUG_ASSERT(0);
 }
 }
+#define close_socket_nonblocking(csd) \
+close_socket_nonblocking_(csd, __FUNCTION__, __LINE__)
 
 static void close_worker_sockets(void)
 {
@@ -573,26 +589,16 @@ static void close_worker_s

Re: child exit on self-proxy

2021-07-14 Thread Yann Ylavic
On Wed, Jul 14, 2021 at 3:42 PM Yann Ylavic  wrote:
>
> On Tue, Jul 13, 2021 at 4:16 PM Stefan Eissing
>  wrote:
> >
> > In Yann's v3 of the patch, it triggers only when I stop the server while 
> > the test case is ongoing,
>
> OK thanks, I have a new v6 which should address this and also still
> call pre_close hooks in any case.
> The patch is a bit "verbose" (some cosmetics/comments changes that
> helped my workflow, sorry about that) but in the end I think it's
> simpler to maintain (as Eric asked?) which may be worth it..

I have staged this a bit in https://github.com/apache/httpd/pull/208


Re: child exit on self-proxy

2021-07-15 Thread Stefan Eissing



> Am 14.07.2021 um 23:47 schrieb Yann Ylavic :
> 
> On Wed, Jul 14, 2021 at 3:42 PM Yann Ylavic  wrote:
>> 
>> On Tue, Jul 13, 2021 at 4:16 PM Stefan Eissing
>>  wrote:
>>> 
>>> In Yann's v3 of the patch, it triggers only when I stop the server while 
>>> the test case is ongoing,
>> 
>> OK thanks, I have a new v6 which should address this and also still
>> call pre_close hooks in any case.
>> The patch is a bit "verbose" (some cosmetics/comments changes that
>> helped my workflow, sorry about that) but in the end I think it's
>> simpler to maintain (as Eric asked?) which may be worth it..
> 
> I have staged this a bit in https://github.com/apache/httpd/pull/208

Works flawless in my test scenarios! \o/

- Stefan

Re: child exit on self-proxy

2021-07-15 Thread Dirk-Willem van Gulik


On 14 Jul 2021, at 23:47, Yann Ylavic  wrote:
> On Wed, Jul 14, 2021 at 3:42 PM Yann Ylavic  wrote:
>> 
>> On Tue, Jul 13, 2021 at 4:16 PM Stefan Eissing
>>  wrote:
>>> 
>>> In Yann's v3 of the patch, it triggers only when I stop the server while 
>>> the test case is ongoing,
>> 
>> OK thanks, I have a new v6 which should address this and also still
>> call pre_close hooks in any case.
>> The patch is a bit "verbose" (some cosmetics/comments changes that
>> helped my workflow, sorry about that) but in the end I think it's
>> simpler to maintain (as Eric asked?) which may be worth it..
> 
> I have staged this a bit in https://github.com/apache/httpd/pull/208

Lovely - well done - this fixes things on our end.

Dw