Re: Late(r) stop of children processes on restart
Thanks for the headroom explanation Yann, good reading! Rainer Am 25.08.2021 um 13:23 schrieb Yann Ylavic: On Tue, Jun 29, 2021 at 3:00 PM Rainer Jung wrote: Am 29.06.2021 um 14:31 schrieb Stefan Eissing: Can comment really on the diff, but totally agree on the goal to minimize the unresponsive time and make graceful less disruptive. So +1 for that. +1 on the intention as well. Checked in trunk (r1892587 + r1892595). Not sure, whether that means people would need more headroom in the scoreboard (which would probably warrant a sentence in CHANGES or docs about that) or whether it just means the duration during which that headroom is used changes (which I wouldn't care about). The restart delay between stop and start is now minimal (no reload in between), but the headroom needed does not change AIUI. We still have the situation where connections (worker threads) are active for both the new and old generations of children processes, and its duration depends mainly on the actual lifetime of the connections. So the current tunings still hold I think. What changes now is that for both graceful and ungraceful restarts the main process fully consumes one CPU (to reload) while children are actively running (the old generation keeps accepting/processing connections during reload), whereas before the children were tearing down thus easing the CPUs (but filling the sockets backlogs, potentially until exhaustion..). So there might be a greater load spike (overall) than before on reload. A note on the headroom while at it: mpm_event is possibly less consumer of children (hence scoreboard slots) on restart, because when a child is dying it stops (and thus doesn't account for) the worker threads above the remaining number of connections, which will accurately create children of the new generation to scale. mpm_worker never stops threads (this improvement never made it there AFAICT), thus by accounting for inactive threads as active it will finally create more children of the new generation as connections arrive (eventually reaching the limits earlier, or blocking/waiting for worker threads in the new generation of children overflowed by incoming connections which the main process thinks are evenly distributed across all the children, including old generation's). I don't know how hard/worthy it is to align mpm_worker with mpm_event on this, just a note.. Cheers; Yann.
Re: Late(r) stop of children processes on restart
On Tue, Jun 29, 2021 at 3:00 PM Rainer Jung wrote: > > Am 29.06.2021 um 14:31 schrieb Stefan Eissing: > > Can comment really on the diff, but totally agree on the goal to minimize > > the unresponsive time and make graceful less disruptive. > > > > So +1 for that. > > +1 on the intention as well. Checked in trunk (r1892587 + r1892595). > > Not sure, whether that means people would need more headroom in the > scoreboard (which would probably warrant a sentence in CHANGES or docs > about that) or whether it just means the duration during which that > headroom is used changes (which I wouldn't care about). The restart delay between stop and start is now minimal (no reload in between), but the headroom needed does not change AIUI. We still have the situation where connections (worker threads) are active for both the new and old generations of children processes, and its duration depends mainly on the actual lifetime of the connections. So the current tunings still hold I think. What changes now is that for both graceful and ungraceful restarts the main process fully consumes one CPU (to reload) while children are actively running (the old generation keeps accepting/processing connections during reload), whereas before the children were tearing down thus easing the CPUs (but filling the sockets backlogs, potentially until exhaustion..). So there might be a greater load spike (overall) than before on reload. A note on the headroom while at it: mpm_event is possibly less consumer of children (hence scoreboard slots) on restart, because when a child is dying it stops (and thus doesn't account for) the worker threads above the remaining number of connections, which will accurately create children of the new generation to scale. mpm_worker never stops threads (this improvement never made it there AFAICT), thus by accounting for inactive threads as active it will finally create more children of the new generation as connections arrive (eventually reaching the limits earlier, or blocking/waiting for worker threads in the new generation of children overflowed by incoming connections which the main process thinks are evenly distributed across all the children, including old generation's). I don't know how hard/worthy it is to align mpm_worker with mpm_event on this, just a note.. Cheers; Yann.
Re: Late(r) stop of children processes on restart
Am 29.06.2021 um 14:31 schrieb Stefan Eissing: Can comment really on the diff, but totally agree on the goal to minimize the unresponsive time and make graceful less disruptive. So +1 for that. +1 on the intention as well. Not sure, whether that means people would need more headroom in the scoreboard (which would probably warrant a sentence in CHANGES or docs about that) or whether it just means the duration during which that headroom is used changes (which I wouldn't care about). Thanks and regards, Rainer Am 28.06.2021 um 16:25 schrieb Yann Ylavic : When the MPM event/worker is restarting, it first signals the children's processes to stop (via POD), then reload the configuration, and finally start the new generation. This may be problematic when the reload takes some time to complete because incoming connections are no longer processed. A module at day $job is loading quite some regexes and JSON schemas for each vhost, and I have seen restarts take tens of seconds to complete with a large number of vhosts. I suppose this can happen with many RewriteRules too. How about we wait for the reload to complete before stopping the old generation, like in the attached patch (MPM event only for now, changes in worker would be quite similar)? This is achieved by creating the PODs and listeners buckets from a generation pool (gen_pool), with a different lifetime than pconf. gen_pool survives restarts and is created/cleared after the old generation is stopped, entirely in the run_mpm hook, so the stop and PODs and buckets handling is moved there (most changes are cut/paste). WDYT? Regards; Yann.
Re: Late(r) stop of children processes on restart
Can comment really on the diff, but totally agree on the goal to minimize the unresponsive time and make graceful less disruptive. So +1 for that. > Am 28.06.2021 um 16:25 schrieb Yann Ylavic : > > When the MPM event/worker is restarting, it first signals the > children's processes to stop (via POD), then reload the configuration, > and finally start the new generation. > > This may be problematic when the reload takes some time to complete > because incoming connections are no longer processed. > A module at day $job is loading quite some regexes and JSON schemas > for each vhost, and I have seen restarts take tens of seconds to > complete with a large number of vhosts. I suppose this can happen with > many RewriteRules too. > > How about we wait for the reload to complete before stopping the old > generation, like in the attached patch (MPM event only for now, > changes in worker would be quite similar)? > > This is achieved by creating the PODs and listeners buckets from a > generation pool (gen_pool), with a different lifetime than pconf. > gen_pool survives restarts and is created/cleared after the old > generation is stopped, entirely in the run_mpm hook, so the stop and > PODs and buckets handling is moved there (most changes are cut/paste). > > WDYT? > > Regards; > Yann. >