Hi Heming,
Thanks for the analysis and tying things together for us so clearly, and I
like the ideas you've outlined.
On Sun, Jun 06, 2021 at 02:15:23PM +0800, heming.z...@suse.com wrote:
> I send this mail for a well known performance issue:
> when system is attached huge numbers of devices. (ie. 1000+ disks),
> the lvm2-pvscan@.service costs too much time and systemd is very easy to
> time out, and enter emergency shell in the end.
>
> This performance topic had been discussed in there some times, and the issue
> was
> lasting for many years. From the lvm2 latest code, this issue still can't be
> fix
> completely. The latest code add new function _pvscan_aa_quick(), which makes
> the
> booting time largely reduce but still can's fix this issue utterly.
>
> In my test env, x86 qemu-kvm machine, 6vcpu, 22GB mem, 1015 pv/vg/lv,
> comparing
> with/without _pvscan_aa_quick() code, booting time reduce from "9min 51s" to
> "2min 6s". But after switching to direct activation, the booting time is 8.7s
> (for longest lvm2 services: lvm2-activation-early.service).
Interesting, it's good to see the "quick" optimization is so effective.
Another optimization that should be helping in many cases is the
"vgs_online" file which will prevent concurrent pvscans from all
attempting to autoactivate a VG.
> The hot spot of event activation is dev_cache_scan, which time complexity is
> O(n^2). And at the same time, systemd-udev worker will generate/run
> lvm2-pvscan@.service on all detecting disks. So the overall is O(n^3).
>
> ```
> dev_cache_scan //order: O(n^2)
> + _insert_dirs //O(n)
> | if obtain_device_list_from_udev() true
> | _insert_udev_dir //O(n)
> |
> + dev_cache_index_devs //O(n)
>
> There are 'n' lvm2-pvscan@.service running: O(n)
> Overall: O(n) * O(n^2) => O(n^3)
> ```
I knew the dev_cache_scan was inefficient, but didn't realize it was
having such a negative impact, especially since it isn't reading devices.
Some details I'm interested to look at more closely (and perhaps you
already have some answers here):
1. Does obtain_device_list_from_udev=0 improve things? I recently noticed
that 0 appeared to be faster (anecdotally), and proposed we change the
default to 0 (also because I'm biased toward avoiding udev whenever
possible.)
2. We should probably move or improve the "index_devs" step; it's not the
main job of dev_cache_scan and I suspect this could be done more
efficiently, or avoided in many cases.
3. pvscan --cache is supposed to be scalable because it only (usually)
reads the single device that is passed to it, until activation is needed,
at which point all devices are read to perform a proper VG activation.
However, pvscan does not attempt to reduce dev_cache_scan since I didn't
know it was a problem. It probably makes sense to avoid a full
dev_cache_scan when pvscan is only processing one device (e.g.
setup_device() rather than setup_devices().)
> Question/topic:
> Could we find out a final solution to have a good performance & scale well
> under
> event-based activation?
First, you might not have seen my recently added udev rule for
autoactivation, I apologize it's been sitting in the "dev-next" branch
since we've not figured out a good a branching strategy for this change.
We just began getting some feedback on this change last week:
https://sourceware.org/git/?p=lvm2.git;a=blob;f=udev/69-dm-lvm.rules.in;h=03c8fbbd6870bbd925c123d66b40ac135b295574;hb=refs/heads/dev-next
There's a similar change I'm working on for dracut:
https://github.com/dracutdevs/dracut/pull/1506
Each device uevent still triggers a pvscan --cache, reading just the one
device, but when a VG is complete, the udev rule runs systemd-run vgchange
-aay VG. Since it's not changing dev_cache_scan usage, the issues you're
describing will still need to be looked at.
> Maybe two solutions (Martin & I discussed):
>
> 1. During boot phase, lvm2 automatically swithes to direct activation mode
> ("event_activation = 0"). After booted, switch back to the event activation
> mode.
>
> Booting phase is a speical stage. *During boot*, we could "pretend" that
> direct
> activation (event_activation=0) is set, and rely on lvm2-activation-*.service
> for PV detection. Once lvm2-activation-net.service has finished, we could
> "switch on" event activation.
>
> More precisely: pvscan --cache would look at some file under /run,
> e.g. /run/lvm2/boot-finished, and quit immediately if the file doesn't exist
> (as if event_activation=0 was set). In lvm2-activation-net.service, we would
> add
> something like:
>
> ```
> ExecStartPost=/bin/touch /run/lvm2/boot-finished
> ```
>
> ... so that, from this point in time onward, "pvscan --cache" would _not_ quit
> immediately any more, but run normally (assuming that the global
> event_activation setting is 1). This way we'd get the benefit of using the
> static activation services during boot (good performance) while still being
> able
> to react to udev events after