Re: [Patch] possible mutex starvation issue affects all nginx Linux versions.

2013-12-11 Thread itpp2012
From the patch author:

Hi Maxim,
I apologize for my late reply: I just had now time to sort this out.

The short answer to your remarks is: 
the first patch is partially correct, just incomplete, and could be easily
completed with the addition of a call to ngx_disable_accept_events(...)
addressing the issue of 2 workers accepting connections at the same time.
The second one is windows only and as correct and atomical as the unix only
ngx_unlock_mutexes(), addressing the very same issue that one is for (which
btw showed up during my tests) but being just one line long.

The long answer follows and, well, is quite long.


So before everything else, and  for the sake of clarity:

accept mutex patching has been performed on codebase derived by nginx 1.5.6
after properly re-enabling it (the accept_mutex) removing the following
disable lines in ngx_event.c

#if (NGX_WIN32)

/*
 * disable accept mutex on win32 as it may cause deadlock if
 * grabbed by a process which can't accept connections
 */

ngx_use_accept_mutex = 0;

#endif

Thus submitted patch lines are not useless and are part of a larger patch
included in Catepillar 1.5.7.1.
The latter (Caterpillar) fully works distributing correctly the workload to
any number of configured process workers (not just one of the configured
number of them).

Now getting to specific proposed patches:

A) about the added line ngx_accept_mutex_held=0 in ngx_event.c

 259 if (ngx_accept_mutex_held) {
 --+ ngx_accept_mutex_held=0;
 260 ngx_shmtx_unlock(ngx_accept_mutex);
 261 }

that is not wrong, it's just incomplete.
In fact, first of all, it pairs with the similar one in 

src/event/ngx_event_accept.c, line 402 

which was patched in Caterpillar and that you too found later in your in
1.5.7.
The problem with this one (line 402) was that when ngx_accept_mutex_held's
process *local* variable and the accept_mutex *global* to all nginx
processes (being it allocated in shared memory) got out of synch.
This resulted (as it has been showed in tests) in more than a worker process
locally 'thinking' to have the ownership of the accept_mutex at the same
time.
The latter interfers in the call of their respective
ngx_disable_accept_events(...) in turn leading, because of nasty time
synching, to no worker being able to enabling them anymore and amounting to
all workers (so the whole server) unable to handle any further connection.

The line above, between 259 and 260, tried to fix this too, but, there, a
call to ngx_disable_accept_events(...) is missing.
In fact to be completely correct and not resulting in partially incorrect
behaviour as you correctly pointed out, such a call must be added too.
This can be done moving that 'if' completely patched code  

if (ngx_accept_mutex_held) {
ngx_disable_accept_events(cycle);
ngx_accept_mutex_held=0;
ngx_shmtx_unlock(ngx_accept_mutex);
}

to ngx_event_accept.c (being ngx_disable_accept_events(...) internal
linkage) in a dedicated function, for example 

void ngx_release_accept_mutex(ngx_cycle_t *cycle){ /* the if above */ }

which then gets called at 259 in ngx_event.c instead of the 'if' itself.
BTW to make the patch complete the 'if', in ngx_trylock_accept_mutex, where
ngx_disable_accept_events() is called should be removed since substituted by
that ngx_release_accept_mutex() call.

All of this is to avoid a partial incorrect behaviour that the 'if', as it
is now, 

 259 if (ngx_accept_mutex_held) {
 260 ngx_shmtx_unlock(ngx_accept_mutex);
 261 }

causes at the end of an iteration when the accept mutex was held:
the mutex is released in the 'if' above but the
ngx_disable_accept_events(...) won't be called until the next iteration with
ngx_trylock_accept_mutex(...).
Thus in the meanwhile another worker could succeed to acquire the accept
mutex, via ngx_trylock_accept_mutex(...), and start to accept connections as
well ( ngx_enable_accept_events() is called in ngx_trylock_accept_mutex when
mutex is acquired successfully).
This means that 2 workers at the same time can accept connections and this
goes against the reason of an accept mutex itself reasonably not being not
that the intended behaviour.


B) About the second proposed patch:
the 3 lines 

if(*ngx_accept_mutex.lock==ngx_processes[n].pid) {
*ngx_accept_mutex.lock=0;
}

make a patch to prevent server's accept_mutex deadlock when its owning
worker process dies unexpectedly (program crash while processing a
request/answer or task manager's  abrupt termination).
This is a scenario that occurred several times while testing and, when
showing up, lead to server's workers unable to accept any further
connection.

Given that, unlike in the unix version with its ngx_unlock_mutexes(), this
scenario is not addressed in the windows version and, as said, the above
lines are meant to fix it.
They are correct and thread-safe (the operation is atomic) *in the mentioned
specific scenario in which they are meant to be executed* because:


1) they are 

Re: [Patch] possible mutex starvation issue affects all nginx Linux versions.

2013-12-11 Thread Maxim Dounin
Hello!

On Wed, Dec 11, 2013 at 07:57:32AM -0500, itpp2012 wrote:

[...]

 A) about the added line ngx_accept_mutex_held=0 in ngx_event.c

[...]

 The line above, between 259 and 260, tried to fix this too, but, there, a
 call to ngx_disable_accept_events(...) is missing.
 In fact to be completely correct and not resulting in partially incorrect
 behaviour as you correctly pointed out, such a call must be added too.

This is wrong as well.  There is no reason to disable accept 
events till we are sure that we wasn't able to re-aquire accept 
mutex before going into the kernel to wait for events.  It's just 
waste of resources.

You are misunderstanding the code, probably becase the 
ngx_accept_mutex_held variable has somewhat misleading name.  It 
is not expected to mean we have the accept mutex locked, it means 
we had the accept mutex locked on previous iteration, we have to 
disable accept events if we won't be able to re-aquire it.

There is no need to fix it, it's not broken.

[...]

 B) About the second proposed patch:

[...]

 Hopefully at this point it should be clear enough that releasing the
 accept_mutex directly with
 
 *ngx_accept_mutex.lock=0;
 
 is as much safe as calling ngx_shmtx_force_unlock(), the choice of which one
 I just deem cosmetic (they're functionally equivalent): my choice then went
 for the more performant, straight variable assignment.

Even considering the code currently there for win32 version, there 
is no guarantee that reading *ngx_accept_mutex.lock is atomic.  
While usually it is, in theory it can be non-atomic, and the code 
will start doing wrong things due to the if() check unexpectedly 
succeeding.

 NOt wanting to make this post longer than it is I conclude just mentioning
 that, as of now, the accept mutex is the only mutex living in shared memory
 so unlocking other possibly shared mutexes is as well unrequired.

That's not true.  Something like

$ grep -r shmtx_lock src/

will give you an idea where shared memory mutexes are used in 
nginx.

While it's tricky to get all of this working on modern Windows 
versions with ASLR, the accept mutex is certainly not the only 
shared memory mutex used in nginx.

As I already wrote, I don't object adding an unlock here, but I 
would like to see the code which is correct and in-line with the 
unix version of the code.

-- 
Maxim Dounin
http://nginx.org/

___
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx


Re: [Patch] possible mutex starvation issue affects all nginx Linux versions.

2013-12-02 Thread Maxim Dounin
Hello!

On Mon, Dec 02, 2013 at 07:15:52AM -0500, itpp2012 wrote:

 Here is a patch for a possible mutex starvation issue which affects all
 nginx Linux versions.
 Already solved for Windows since nginx 1.5.7.1 Caterpillar.
 Can be reproduced when nginx reloads the config  worker holding mutex dies
 or hangs.
 
 Fixed by Vittorio Francesco Digilio, commercially sponsered solution by
 ITPP.
 target source mainline 1.5.8 - 30-11-2013.
 src/event/ngx_event_accept.c, line 402 was correctly found (starvation fix)
 but missed in:
 src/event/ngx_event.c, line 259-260, 1 line added;
 255 if (ngx_posted_accept_events) {
 256 ngx_event_process_posted(cycle, ngx_posted_accept_events);
 257 }
 258
 259 if (ngx_accept_mutex_held) {
 --+ ngx_accept_mutex_held=0;
 260 ngx_shmtx_unlock(ngx_accept_mutex);
 261 }
 262
 263 if (delta) {
 264 ngx_event_expire_timers();
 265 }

This patch is wrong.  The ngx_accept_mutex_held don't need to be 
unset here.  The ngx_accept_mutex_held is used as an idicator that on 
previous iteration the lock was held by the process, and unsetting 
it will result in incorrect behaviour.

 Also applies to; src/os/win32/ngx_process_cycle.c
 line 507-508, 3 lines added;
 504 if (ngx_processes[n].handle != h) {
 505 continue;
 506 }
 507
 --+ if(*ngx_accept_mutex.lock==ngx_processes[n].pid) {
 --+ *ngx_accept_mutex.lock=0;
 --+ }
 508 if (GetExitCodeProcess(h, code) == 0) {
 509 ngx_log_error(NGX_LOG_ALERT, cycle-log, ngx_errno,
 510 GetExitCodeProcess(%P) failed,
 511 ngx_processes[n].pid);
 512 }

This patch is also wrong.  In the current state of the win32 
version it is not assumed that accept mutex is used at all.  But 
if it is, correct aproach to unlock shared memory mutexes on 
abnormal process termination is to port (or move to 
platform-independed place) the ngx_unlock_mutexes() function from 
src/os/unix/ngx_process.c.  It correctly uses 
ngx_shmtx_force_unlock() to properly unlock shared memory mutexes 
using atomic operations.

Note well that unlocking shared memory mutexes on abnormal 
process termination is an emergency mechanism.  If it actually 
happens, it indicate that something is really wrong in other 
places of the system.

Please also consider reading 
http://nginx.org/en/docs/contributing_changes.html.

-- 
Maxim Dounin
http://nginx.org/en/donation.html

___
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx