Re: [systemd-devel] systemctl stop going through timeout even though all processes have exited

2023-10-23 Thread Martin Schwenke
Hi Mantas,

/sys/fs/cgroup/systemd/release_agent contains
/usr/lib/systemd/systemd-cgroups-agent.  This is inside the container.
I'm not sure if systemd (inside the container) would see anything
different.  I'm also not sure if this is correct.

I used:

  bpftrace -e 'tracepoint:syscalls:sys_enter_exec*{ printf("pid: %d, comm: %s, 
args: ", pid, comm); join(args->argv); }'

to trace exec()s.

Although I see other executables being run, though nothing related to
systemd, I don't see systemd-cgroups-agent being run.

Thanks again for any ideas...

peace & happiness,
martin

On Fri, 13 Oct 2023 12:01:33 +0300, Mantas Mikulėnas
 wrote:

> What value do you have in /sys/fs/cgroup/systemd/release_agent, as seen by
> systemd? Does it point to the correct executable?
> 
> Does e.g. forkstat (or execsnoop or similar) show that the executable is
> being run when the cgroup empties?
> 
> On Fri, Oct 13, 2023, 04:20 Martin Schwenke  wrote:
> 
> > Hi Mantas,
> >
> > Yes, it looks like cgroups v1.
> >
> > Would this be a kernel bug?  systemd bug?
> >
> > Thanks...
> >
> > peace & happiness,
> > martin
> >
> > On Wed, 11 Oct 2023 08:19:59 +0300, Mantas Mikulėnas
> >  wrote:
> >  
> > > Is this with cgroups v1 or v2? If cgroups v1 is involved (thanks  
> > Docker), I  
> > > recall it was a bit complex for systemd to get notified when the cgroup
> > > actually empties – via /sys/fs/cgroup/systemd/release_agent that  
> > specifies  
> > > a helper executable that the kernel runs... I wonder if that mechanism is
> > > broken on your system.
> > >
> > > On Wed, Oct 11, 2023 at 7:38 AM Martin Schwenke   
> > wrote:  
> > >  
> > > > I'm seeing "systemctl stop " for several services taking a
> > > > long time because it goes through the timeout process, even though all
> > > > relevant processes have exited.
> > > >
> > > > I'll give 2 examples.  Both examples are running inside a privileged
> > > > Rocky Linux 8.8 Docker container on a Rocky Linux 8.8 host.  The
> > > > systemd version, reported by "systemctl --version" in the container
> > > > is:
> > > >
> > > >   systemd 239 (239-74.el8_8.5)
> > > >
> > > > Here is ctdb.system:
> > > >
> > > >   [Unit]
> > > >   Description=CTDB
> > > >   Documentation=man:ctdbd(1) man:ctdb(7)
> > > >   After=network-online.target time-sync.target
> > > >   ConditionFileNotEmpty=/etc/ctdb/nodes
> > > >
> > > >   [Service]
> > > >   Type=forking
> > > >   LimitCORE=infinity
> > > >   LimitNOFILE=1048576
> > > >   TasksMax=4096
> > > >   PIDFile=/var/run/ctdb/ctdbd.pid
> > > >   ExecStart=/usr/sbin/ctdbd
> > > >   ExecStop=/usr/bin/ctdb shutdown
> > > >   KillMode=control-group
> > > >   Restart=no
> > > >
> > > >   [Install]
> > > >   WantedBy=multi-user.target
> > > >
> > > > "/usr/bin/ctdb shutdown" causes a controlled shutdown.  In many cases,
> > > > starting and then stopping using systemctl works fine.  However, many
> > > > times it takes >90s to stop, as per TimeoutStopSec.  If I reduce that
> > > > value then the duration reduces accordingly.  I can confirm using both
> > > > "ps auxfww" and "systemd-cgls" that within the container there are no
> > > > relevant processes a moment after "systemctl stop ctdb" is run.  In
> > > > particular, in systemd-cgls ctdb.service is gone but "systemctl stop
> > > > ctdb" is still waiting.
> > > >
> > > > Before attempting to stop, the service is successfully started:
> > > >
> > > >   Oct 11 00:56:44 rocky1 systemd[710741]: ctdb.service: Executing:
> > > > /usr/sbin/ctdbd
> > > >   Oct 11 00:56:44 rocky1 ctdbd[710741]: CTDB logging to location
> > > > file:/var/log/log.ctdb
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: Received SIGCHLD from PID 710741
> > > > (ctdbd).
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: Child 710741 (ctdbd) died
> > > > (code=exited, status=0/SUCCESS)
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Child 710741  
> > belongs to  
> > > > ctdb.service.
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Control process  
> > exited,  
> > > > code=exited status=0
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD  
> > for  
> > > > state start.
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: New main PID 710742
> > > > belongs to service, we are happy.
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Main PID loaded:  
> > 710742  
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Changed start ->  
> > running  
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Job  
> > ctdb.service/start  
> > > > finished, result=done
> > > >   Oct 11 00:56:44 rocky1 systemd[1]: Started CTDB.
> > > >   -- Subject: Unit ctdb.service has finished start-up
> > > >   -- Defined-By: systemd
> > > >   -- Support:  
> > https://lists.freedesktop.org/mailman/listinfo/systemd-devel  
> > > >   --
> > > >   -- Unit ctdb.service has finished starting up.
> > > >   --
> > > >   -- The start-up result is done.
> > > >
> > > > The relevant part of the log while 

Re: [systemd-devel] systemctl stop going through timeout even though all processes have exited

2023-10-13 Thread Mantas Mikulėnas
What value do you have in /sys/fs/cgroup/systemd/release_agent, as seen by
systemd? Does it point to the correct executable?

Does e.g. forkstat (or execsnoop or similar) show that the executable is
being run when the cgroup empties?

On Fri, Oct 13, 2023, 04:20 Martin Schwenke  wrote:

> Hi Mantas,
>
> Yes, it looks like cgroups v1.
>
> Would this be a kernel bug?  systemd bug?
>
> Thanks...
>
> peace & happiness,
> martin
>
> On Wed, 11 Oct 2023 08:19:59 +0300, Mantas Mikulėnas
>  wrote:
>
> > Is this with cgroups v1 or v2? If cgroups v1 is involved (thanks
> Docker), I
> > recall it was a bit complex for systemd to get notified when the cgroup
> > actually empties – via /sys/fs/cgroup/systemd/release_agent that
> specifies
> > a helper executable that the kernel runs... I wonder if that mechanism is
> > broken on your system.
> >
> > On Wed, Oct 11, 2023 at 7:38 AM Martin Schwenke 
> wrote:
> >
> > > I'm seeing "systemctl stop " for several services taking a
> > > long time because it goes through the timeout process, even though all
> > > relevant processes have exited.
> > >
> > > I'll give 2 examples.  Both examples are running inside a privileged
> > > Rocky Linux 8.8 Docker container on a Rocky Linux 8.8 host.  The
> > > systemd version, reported by "systemctl --version" in the container
> > > is:
> > >
> > >   systemd 239 (239-74.el8_8.5)
> > >
> > > Here is ctdb.system:
> > >
> > >   [Unit]
> > >   Description=CTDB
> > >   Documentation=man:ctdbd(1) man:ctdb(7)
> > >   After=network-online.target time-sync.target
> > >   ConditionFileNotEmpty=/etc/ctdb/nodes
> > >
> > >   [Service]
> > >   Type=forking
> > >   LimitCORE=infinity
> > >   LimitNOFILE=1048576
> > >   TasksMax=4096
> > >   PIDFile=/var/run/ctdb/ctdbd.pid
> > >   ExecStart=/usr/sbin/ctdbd
> > >   ExecStop=/usr/bin/ctdb shutdown
> > >   KillMode=control-group
> > >   Restart=no
> > >
> > >   [Install]
> > >   WantedBy=multi-user.target
> > >
> > > "/usr/bin/ctdb shutdown" causes a controlled shutdown.  In many cases,
> > > starting and then stopping using systemctl works fine.  However, many
> > > times it takes >90s to stop, as per TimeoutStopSec.  If I reduce that
> > > value then the duration reduces accordingly.  I can confirm using both
> > > "ps auxfww" and "systemd-cgls" that within the container there are no
> > > relevant processes a moment after "systemctl stop ctdb" is run.  In
> > > particular, in systemd-cgls ctdb.service is gone but "systemctl stop
> > > ctdb" is still waiting.
> > >
> > > Before attempting to stop, the service is successfully started:
> > >
> > >   Oct 11 00:56:44 rocky1 systemd[710741]: ctdb.service: Executing:
> > > /usr/sbin/ctdbd
> > >   Oct 11 00:56:44 rocky1 ctdbd[710741]: CTDB logging to location
> > > file:/var/log/log.ctdb
> > >   Oct 11 00:56:44 rocky1 systemd[1]: Received SIGCHLD from PID 710741
> > > (ctdbd).
> > >   Oct 11 00:56:44 rocky1 systemd[1]: Child 710741 (ctdbd) died
> > > (code=exited, status=0/SUCCESS)
> > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Child 710741
> belongs to
> > > ctdb.service.
> > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Control process
> exited,
> > > code=exited status=0
> > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD
> for
> > > state start.
> > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: New main PID 710742
> > > belongs to service, we are happy.
> > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Main PID loaded:
> 710742
> > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Changed start ->
> running
> > >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Job
> ctdb.service/start
> > > finished, result=done
> > >   Oct 11 00:56:44 rocky1 systemd[1]: Started CTDB.
> > >   -- Subject: Unit ctdb.service has finished start-up
> > >   -- Defined-By: systemd
> > >   -- Support:
> https://lists.freedesktop.org/mailman/listinfo/systemd-devel
> > >   --
> > >   -- Unit ctdb.service has finished starting up.
> > >   --
> > >   -- The start-up result is done.
> > >
> > > The relevant part of the log while stopping seems to be:
> > >
> > >   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710743
> > > (ctdb-eventd).
> > >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710742 (ctdbd) died
> > > (code=exited, status=0/SUCCESS)
> > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710742
> belongs to
> > > ctdb.service.
> > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Can't open PID file
> > > /var/run/ctdb/ctdbd.pid (yet?) after stop: No such file or directory
> > >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Main process exited,
> > > code=exited, status=0/SUCCESS
> > >   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> > > sender=org.freedesktop.systemd1 destination=n/a
> > > path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> > > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> > > cookie=54 reply_cookie=0 signature=sa{sv}as 

Re: [systemd-devel] systemctl stop going through timeout even though all processes have exited

2023-10-12 Thread Martin Schwenke
Hi Mantas,

Yes, it looks like cgroups v1.

Would this be a kernel bug?  systemd bug?

Thanks...

peace & happiness,
martin

On Wed, 11 Oct 2023 08:19:59 +0300, Mantas Mikulėnas
 wrote:

> Is this with cgroups v1 or v2? If cgroups v1 is involved (thanks Docker), I
> recall it was a bit complex for systemd to get notified when the cgroup
> actually empties – via /sys/fs/cgroup/systemd/release_agent that specifies
> a helper executable that the kernel runs... I wonder if that mechanism is
> broken on your system.
> 
> On Wed, Oct 11, 2023 at 7:38 AM Martin Schwenke  wrote:
> 
> > I'm seeing "systemctl stop " for several services taking a
> > long time because it goes through the timeout process, even though all
> > relevant processes have exited.
> >
> > I'll give 2 examples.  Both examples are running inside a privileged
> > Rocky Linux 8.8 Docker container on a Rocky Linux 8.8 host.  The
> > systemd version, reported by "systemctl --version" in the container
> > is:
> >
> >   systemd 239 (239-74.el8_8.5)
> >
> > Here is ctdb.system:
> >
> >   [Unit]
> >   Description=CTDB
> >   Documentation=man:ctdbd(1) man:ctdb(7)
> >   After=network-online.target time-sync.target
> >   ConditionFileNotEmpty=/etc/ctdb/nodes
> >
> >   [Service]
> >   Type=forking
> >   LimitCORE=infinity
> >   LimitNOFILE=1048576
> >   TasksMax=4096
> >   PIDFile=/var/run/ctdb/ctdbd.pid
> >   ExecStart=/usr/sbin/ctdbd
> >   ExecStop=/usr/bin/ctdb shutdown
> >   KillMode=control-group
> >   Restart=no
> >
> >   [Install]
> >   WantedBy=multi-user.target
> >
> > "/usr/bin/ctdb shutdown" causes a controlled shutdown.  In many cases,
> > starting and then stopping using systemctl works fine.  However, many
> > times it takes >90s to stop, as per TimeoutStopSec.  If I reduce that
> > value then the duration reduces accordingly.  I can confirm using both
> > "ps auxfww" and "systemd-cgls" that within the container there are no
> > relevant processes a moment after "systemctl stop ctdb" is run.  In
> > particular, in systemd-cgls ctdb.service is gone but "systemctl stop
> > ctdb" is still waiting.
> >
> > Before attempting to stop, the service is successfully started:
> >
> >   Oct 11 00:56:44 rocky1 systemd[710741]: ctdb.service: Executing:
> > /usr/sbin/ctdbd
> >   Oct 11 00:56:44 rocky1 ctdbd[710741]: CTDB logging to location
> > file:/var/log/log.ctdb
> >   Oct 11 00:56:44 rocky1 systemd[1]: Received SIGCHLD from PID 710741
> > (ctdbd).
> >   Oct 11 00:56:44 rocky1 systemd[1]: Child 710741 (ctdbd) died
> > (code=exited, status=0/SUCCESS)
> >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Child 710741 belongs to
> > ctdb.service.
> >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Control process exited,
> > code=exited status=0
> >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for
> > state start.
> >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: New main PID 710742
> > belongs to service, we are happy.
> >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Main PID loaded: 710742
> >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Changed start -> running
> >   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Job ctdb.service/start
> > finished, result=done
> >   Oct 11 00:56:44 rocky1 systemd[1]: Started CTDB.
> >   -- Subject: Unit ctdb.service has finished start-up
> >   -- Defined-By: systemd
> >   -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
> >   --
> >   -- Unit ctdb.service has finished starting up.
> >   --
> >   -- The start-up result is done.
> >
> > The relevant part of the log while stopping seems to be:
> >
> >   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710743
> > (ctdb-eventd).
> >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710742 (ctdbd) died
> > (code=exited, status=0/SUCCESS)
> >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710742 belongs to
> > ctdb.service.
> >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Can't open PID file
> > /var/run/ctdb/ctdbd.pid (yet?) after stop: No such file or directory
> >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Main process exited,
> > code=exited, status=0/SUCCESS
> >   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> > sender=org.freedesktop.systemd1 destination=n/a
> > path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> > cookie=54 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
> >   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> > sender=org.freedesktop.systemd1 destination=n/a
> > path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> > cookie=55 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
> >   Oct 11 00:56:47 rocky1 systemd[1]: Child 710743 (ctdb-eventd) died
> > (code=exited, status=0/SUCCESS)
> >   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710743 

Re: [systemd-devel] systemctl stop going through timeout even though all processes have exited

2023-10-10 Thread Mantas Mikulėnas
Is this with cgroups v1 or v2? If cgroups v1 is involved (thanks Docker), I
recall it was a bit complex for systemd to get notified when the cgroup
actually empties – via /sys/fs/cgroup/systemd/release_agent that specifies
a helper executable that the kernel runs... I wonder if that mechanism is
broken on your system.

On Wed, Oct 11, 2023 at 7:38 AM Martin Schwenke  wrote:

> I'm seeing "systemctl stop " for several services taking a
> long time because it goes through the timeout process, even though all
> relevant processes have exited.
>
> I'll give 2 examples.  Both examples are running inside a privileged
> Rocky Linux 8.8 Docker container on a Rocky Linux 8.8 host.  The
> systemd version, reported by "systemctl --version" in the container
> is:
>
>   systemd 239 (239-74.el8_8.5)
>
> Here is ctdb.system:
>
>   [Unit]
>   Description=CTDB
>   Documentation=man:ctdbd(1) man:ctdb(7)
>   After=network-online.target time-sync.target
>   ConditionFileNotEmpty=/etc/ctdb/nodes
>
>   [Service]
>   Type=forking
>   LimitCORE=infinity
>   LimitNOFILE=1048576
>   TasksMax=4096
>   PIDFile=/var/run/ctdb/ctdbd.pid
>   ExecStart=/usr/sbin/ctdbd
>   ExecStop=/usr/bin/ctdb shutdown
>   KillMode=control-group
>   Restart=no
>
>   [Install]
>   WantedBy=multi-user.target
>
> "/usr/bin/ctdb shutdown" causes a controlled shutdown.  In many cases,
> starting and then stopping using systemctl works fine.  However, many
> times it takes >90s to stop, as per TimeoutStopSec.  If I reduce that
> value then the duration reduces accordingly.  I can confirm using both
> "ps auxfww" and "systemd-cgls" that within the container there are no
> relevant processes a moment after "systemctl stop ctdb" is run.  In
> particular, in systemd-cgls ctdb.service is gone but "systemctl stop
> ctdb" is still waiting.
>
> Before attempting to stop, the service is successfully started:
>
>   Oct 11 00:56:44 rocky1 systemd[710741]: ctdb.service: Executing:
> /usr/sbin/ctdbd
>   Oct 11 00:56:44 rocky1 ctdbd[710741]: CTDB logging to location
> file:/var/log/log.ctdb
>   Oct 11 00:56:44 rocky1 systemd[1]: Received SIGCHLD from PID 710741
> (ctdbd).
>   Oct 11 00:56:44 rocky1 systemd[1]: Child 710741 (ctdbd) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Child 710741 belongs to
> ctdb.service.
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Control process exited,
> code=exited status=0
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for
> state start.
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: New main PID 710742
> belongs to service, we are happy.
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Main PID loaded: 710742
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Changed start -> running
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Job ctdb.service/start
> finished, result=done
>   Oct 11 00:56:44 rocky1 systemd[1]: Started CTDB.
>   -- Subject: Unit ctdb.service has finished start-up
>   -- Defined-By: systemd
>   -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
>   --
>   -- Unit ctdb.service has finished starting up.
>   --
>   -- The start-up result is done.
>
> The relevant part of the log while stopping seems to be:
>
>   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710743
> (ctdb-eventd).
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710742 (ctdbd) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710742 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Can't open PID file
> /var/run/ctdb/ctdbd.pid (yet?) after stop: No such file or directory
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Main process exited,
> code=exited, status=0/SUCCESS
>   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> sender=org.freedesktop.systemd1 destination=n/a
> path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> cookie=54 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
>   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> sender=org.freedesktop.systemd1 destination=n/a
> path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> cookie=55 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710743 (ctdb-eventd) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710743 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: systemd-journald.service: Received
> EPOLLHUP on stored fd 18 (stored), closing.
>   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710860
> (ctdb).
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710860 (ctdb) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710860 belongs