Re: [linux-lvm] [PATCH 1/1] pvscan: wait for udevd

2021-02-19 Thread Zdenek Kabelac

Dne 19. 02. 21 v 17:37 David Teigland napsal(a):

On Thu, Feb 18, 2021 at 04:19:01PM +0100, Martin Wilck wrote:

Feb 10 17:24:26 archlinux lvm[643]:   pvscan[643] VG sys run
autoactivation.
Feb 10 17:24:26 archlinux lvm[643]:   /usr/bin/dmeventd: stat failed:
No such file or directory


What's going on here? pvscan trying to start dmeventd ? Why ? There's a
dedicated service for starting dmeventd (lvm2-monitor.service). I can
see that running dmeventd makes sense as you have thin pools, but I'm
at a loss why it has to be started at that early stage during boot
already.

This is a curious message, it looks as if pvscan was running from an
environment (initramfs??) where dmeventd wasn't available. The message
is repeated, and after that, pvscan appears to hang...


I've found that when pvscan activates a VG, there's a bit of code that
attempts to monitor any LVs that are already active in the VG.  Monitoring
means interacting with dmeventd.  I don't know why it's doing that, it
seems strange, but the logic around monitoring in lvm seems ad hoc and in
need of serious reworking.  In this case I'm guessing there's already an
LV active in "sys", perhaps from direct activation in initrd, and when
pvscan activates that VG it attempts to monitor the already active LV.


The existing design for lvm2 rootfs using was like:

Activate 'LV' within ramdisk by dracut - which discovers rootfs VG/LV
and activates it (by rather 'brute-force' naive approach).

Such activation is WITHOUT monitoring - as ramdisk is without 'dmeventd'
and we do not want to 'lock' the binary from ramdisk into memory.

So once the system switches to rootfs - 'vgchange --monitor y' enables 
monitoring for all activated LVs from ramdisk and process continues.


Event based activation within ramdisk is a 3rd. party initiative by Arch linux 
and thus needs to be 'reinvented' with its own problems that arise from this.


So far - in lvm2 the current dracut method is more maintainable.


Another missing piece in lvm monitoring is that we don't have a way to
start lvm2-monitor/dmeventd at the right time (I'm not sure anyone even
knows when the right time is), so we get random behavior depending on if
it's running or not at a given point.  In this case, it looks like it
happens to not be running yet.  I sometimes suggest disabling lvm2-monitor
and starting it manually once the system is up, to avoid having it
interfere during startup.


Right time is when switch is finished and we have rootfs with /usr
available - should be ensured by  lvm2-monitor.service and it dependencies.

Zdenek

___
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



Re: [linux-lvm] [PATCH 1/1] pvscan: wait for udevd

2021-02-19 Thread David Teigland
On Thu, Feb 18, 2021 at 04:19:01PM +0100, Martin Wilck wrote:
> > Feb 10 17:24:26 archlinux lvm[643]:   pvscan[643] VG sys run
> > autoactivation.
> > Feb 10 17:24:26 archlinux lvm[643]:   /usr/bin/dmeventd: stat failed:
> > No such file or directory
> 
> What's going on here? pvscan trying to start dmeventd ? Why ? There's a
> dedicated service for starting dmeventd (lvm2-monitor.service). I can
> see that running dmeventd makes sense as you have thin pools, but I'm
> at a loss why it has to be started at that early stage during boot
> already.
> 
> This is a curious message, it looks as if pvscan was running from an
> environment (initramfs??) where dmeventd wasn't available. The message
> is repeated, and after that, pvscan appears to hang...

I've found that when pvscan activates a VG, there's a bit of code that
attempts to monitor any LVs that are already active in the VG.  Monitoring
means interacting with dmeventd.  I don't know why it's doing that, it
seems strange, but the logic around monitoring in lvm seems ad hoc and in
need of serious reworking.  In this case I'm guessing there's already an
LV active in "sys", perhaps from direct activation in initrd, and when
pvscan activates that VG it attempts to monitor the already active LV.

Another missing piece in lvm monitoring is that we don't have a way to
start lvm2-monitor/dmeventd at the right time (I'm not sure anyone even
knows when the right time is), so we get random behavior depending on if
it's running or not at a given point.  In this case, it looks like it
happens to not be running yet.  I sometimes suggest disabling lvm2-monitor
and starting it manually once the system is up, to avoid having it
interfere during startup.

> > Feb 10 17:24:26 archlinux lvm[643]:   /usr/bin/dmeventd: stat failed:
> > No such file or directory
> > Feb 10 17:24:26 archlinux lvm[643]:   WARNING: Failed to monitor
> > sys/pool.
> > Feb 10 17:24:26 archlinux systemd[1]: Stopping LVM event activation
> > on device 9:0...

The unwanted failed monitoring seems to have caused the pvscan command to
exit with an error, which then leads to further mess and confusion where
systemd then thinks it should stop or kill the pvscan service, whatever
that means.  pvscan should probably never exit with an error.

Dave

___
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



Re: [linux-lvm] [PATCH 1/1] pvscan: wait for udevd

2021-02-19 Thread Martin Wilck
On Wed, 2021-02-17 at 14:38 +0100, Oleksandr Natalenko wrote:
> Hi.
> 

Thanks for the logs!

> I'm not sure this issue is reproducible with any kind of LVM layout.
> What I have is thin-LVM-on-LUKS-on-LVM:

I saw MD in your other logs...?

More comments below.

> With regard to the journal, here it is (from the same machine in the
> Arch bugreport; matches the second layout above):
> 
> 
> [~]> LC_TIME=C sudo journalctl -b -10 -u lvm2-pvscan@\*
> -- Journal begins at Fri 2020-12-18 16:33:22 CET, ends at Wed 2021-
> 02-17 14:28:05 CET. --
> Feb 10 17:24:17 archlinux systemd[1]: Starting LVM event activation
> on device 9:0...
> Feb 10 17:24:17 archlinux lvm[463]:   pvscan[463] PV /dev/md0 online,
> VG base is complete.
> Feb 10 17:24:17 archlinux lvm[463]:   pvscan[463] VG base run
> autoactivation.
> Feb 10 17:24:17 archlinux lvm[463]:   2 logical volume(s) in volume
> group "base" now active
> Feb 10 17:24:17 archlinux systemd[1]: Finished LVM event activation
> on device 9:0.
> Feb 10 17:24:26 archlinux systemd[1]: Starting LVM event activation
> on device 253:2...
> Feb 10 17:24:26 archlinux lvm[643]:   pvscan[643] PV /dev/mapper/sys
> online, VG sys is complete.

All good up to here, but then...

> Feb 10 17:24:26 archlinux lvm[643]:   pvscan[643] VG sys run
> autoactivation.
> Feb 10 17:24:26 archlinux lvm[643]:   /usr/bin/dmeventd: stat failed:
> No such file or directory

What's going on here? pvscan trying to start dmeventd ? Why ? There's a
dedicated service for starting dmeventd (lvm2-monitor.service). I can
see that running dmeventd makes sense as you have thin pools, but I'm
at a loss why it has to be started at that early stage during boot
already.

This is a curious message, it looks as if pvscan was running from an
environment (initramfs??) where dmeventd wasn't available. The message
is repeated, and after that, pvscan appears to hang...


> Feb 10 17:24:26 archlinux lvm[643]:   /usr/bin/dmeventd: stat failed:
> No such file or directory
> Feb 10 17:24:26 archlinux lvm[643]:   WARNING: Failed to monitor
> sys/pool.
> Feb 10 17:24:26 archlinux systemd[1]: Stopping LVM event activation
> on device 9:0...

Here I suppose systemd is switching root, and trying to stop jobs,
including the pvscan job.


> Feb 10 17:24:26 archlinux lvm[720]:   pvscan[720] PV /dev/md0 online.
> Feb 10 17:24:26 archlinux lvm[643]:   /usr/bin/dmeventd: stat failed:
> No such file or directory
> Feb 10 17:24:26 archlinux lvm[643]:   WARNING: Failed to monitor
> sys/pool.
> Feb 10 17:24:56 spock systemd[1]: lvm2-pvscan@253:2.service: State
> 'stop-sigterm' timed out. Killing.
> Feb 10 17:24:56 spock systemd[1]: lvm2-pvscan@253:2.service: Killing
> process 643 (lvm) with signal SIGKILL.
> Feb 10 17:24:56 spock systemd[1]: lvm2-pvscan@253:2.service: Main
> process exited, code=killed, status=9/KILL
> Feb 10 17:24:56 spock systemd[1]: lvm2-pvscan@253:2.service: Failed
> with result 'timeout'.
> Feb 10 17:24:56 spock systemd[1]: Stopped LVM event activation on
> device 253:2.

So what's timing out here is the attempt to _stop_ pvscan. That's
curious. It looks like a problem in pvscan to me, not having reacted to
a TERM signal for 30s.

It's also worth noting that the parallel pvscan process for device 9:0
terminated correctly (didn't hang).

> 
> [~]> LC_TIME=C sudo journalctl -b -10 --grep pvscan
> -- Journal begins at Fri 2020-12-18 16:33:22 CET, ends at Wed 2021-
> 02-17 14:31:27 CET. --
> Feb 10 17:24:17 archlinux systemd[1]: Created slice system-
> lvm2\x2dpvscan.slice.
> Feb 10 17:24:17 archlinux lvm[463]:   pvscan[463] PV /dev/md0 online,
> VG base is complete.
> Feb 10 17:24:17 archlinux lvm[463]:   pvscan[463] VG base run
> autoactivation.
> Feb 10 17:24:17 archlinux audit[1]: SERVICE_START pid=1 uid=0
> auid=4294967295 ses=4294967295 msg='unit=lvm2-pvscan@9:0
> comm="systemd" exe="/init" hostname=? addr=? terminal=? res=success'
> Feb 10 17:24:17 archlinux kernel: audit: type=1130
> audit(1612974257.986:6): pid=1 uid=0 auid=4294967295 ses=4294967295 
> msg='unit=lvm2-pvscan@9:0 comm="systemd" exe="/init" hostname=?
> addr=? terminal=? res=success'
> Feb 10 17:24:26 archlinux lvm[643]:   pvscan[643] PV /dev/mapper/sys
> online, VG sys is complete.
> Feb 10 17:24:26 archlinux lvm[643]:   pvscan[643] VG sys run
> autoactivation.
> Feb 10 17:24:26 archlinux lvm[720]:   pvscan[720] PV /dev/md0 online.
> Feb 10 17:24:27 spock systemd[1]: lvm2-pvscan@9:0.service: Control
> process exited, code=killed, status=15/TERM
> Feb 10 17:24:27 spock systemd[1]: lvm2-pvscan@9:0.service: Failed
> with result 'signal'.
> Feb 10 17:24:26 spock audit[1]: SERVICE_STOP pid=1 uid=0
> auid=4294967295 ses=4294967295 msg='unit=lvm2-pvscan@9:0
> comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=?
> terminal=? res=failed'
> Feb 10 17:24:27 spock systemd[1]: Requested transaction contradicts
> existing jobs: Transaction for lvm2-pvscan@253:2.service/start is
> destructive (lvm2-pvscan@253:2.service has 'stop' job queued, but
> 

Re: [linux-lvm] [PATCH 1/1] pvscan: wait for udevd

2021-02-19 Thread Oleksandr Natalenko
Hi.

(will comment only on what I can comment, dropping the rest)

On Thu, Feb 18, 2021 at 04:19:01PM +0100, Martin Wilck wrote:
> > I'm not sure this issue is reproducible with any kind of LVM layout.
> > What I have is thin-LVM-on-LUKS-on-LVM:
> 
> I saw MD in your other logs...?

FWIW, one of the machines have soft RAID, another does not, the issue
is reproducible regardless of whether there's soft RAID or not.

> > Feb 10 17:24:26 archlinux lvm[643]:   pvscan[643] VG sys run
> > autoactivation.
> > Feb 10 17:24:26 archlinux lvm[643]:   /usr/bin/dmeventd: stat failed:
> > No such file or directory
> 
> What's going on here? pvscan trying to start dmeventd ? Why ? There's a
> dedicated service for starting dmeventd (lvm2-monitor.service). I can
> see that running dmeventd makes sense as you have thin pools, but I'm
> at a loss why it has to be started at that early stage during boot
> already.
> 
> This is a curious message, it looks as if pvscan was running from an
> environment (initramfs??) where dmeventd wasn't available. The message
> is repeated, and after that, pvscan appears to hang...

Not sure either. FWIW, real root is on a thin volume (everything is,
in fact, except /boot and swap).

> > Feb 10 17:24:26 archlinux lvm[720]:   pvscan[720] PV /dev/md0 online.
> > Feb 10 17:24:26 archlinux lvm[643]:   /usr/bin/dmeventd: stat failed:
> > No such file or directory
> > Feb 10 17:24:26 archlinux lvm[643]:   WARNING: Failed to monitor
> > sys/pool.
> > Feb 10 17:24:56 spock systemd[1]: lvm2-pvscan@253:2.service: State
> > 'stop-sigterm' timed out. Killing.
> > Feb 10 17:24:56 spock systemd[1]: lvm2-pvscan@253:2.service: Killing
> > process 643 (lvm) with signal SIGKILL.
> > Feb 10 17:24:56 spock systemd[1]: lvm2-pvscan@253:2.service: Main
> > process exited, code=killed, status=9/KILL
> > Feb 10 17:24:56 spock systemd[1]: lvm2-pvscan@253:2.service: Failed
> > with result 'timeout'.
> > Feb 10 17:24:56 spock systemd[1]: Stopped LVM event activation on
> > device 253:2.
> 
> So what's timing out here is the attempt to _stop_ pvscan. That's
> curious. It looks like a problem in pvscan to me, not having reacted to
> a TERM signal for 30s.
> 
> It's also worth noting that the parallel pvscan process for device 9:0
> terminated correctly (didn't hang).

Yes, pvscan seems to not react to SIGTERM. I have
DefaultTimeoutStopSec=30s, if I set this to 90s, pvscan hangs for 90s
respectively.

-- 
  Best regards,
Oleksandr Natalenko (post-factum)
Principal Software Maintenance Engineer

___
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/