Re: [systemd-devel] Please help: timeout waiting for /dev/tty* console device

2023-01-09 Thread Martin Wilck
On Fri, 2023-01-06 at 19:32 -0500, Gabriel L. Somlo wrote:
> 
> I *can* run any tests y'all might suggest to further debug the state
> of the system. But at this point I really do believe there is (or
> should be :) a way to extend the timeout during initial boot to force
> the system to wait for /dev/ttyLXU0 to become available (via udev?).
> 

Have you tried systemd.default_timeout_start_sec= ?

Martin



Re: [systemd-devel] eth2: Failed to rename network interface 6 from 'eth2' to 'eno1': File exists

2022-01-05 Thread Martin Wilck
On Wed, 2022-01-05 at 08:39 +0100, Harald Dunkel wrote:
> On 2022-01-04 16:14:16, Andrei Borzenkov wrote:
> > 
> > You have two interfaces which export the same onboard interface
> > index.
> > There is not much udev can do here; the only option is to disable
> > onboard interface name policy. The attributes that are used by udev
> > are "acpi_index"  and "index". Check values of these attributes for
> > all interfaces.
> > 
> 
> I will check, but please note that I didn't enable this. AFAIU Debian
> uses the settings according to the guidelines of upstream.

This is default behavior. To disable it, you need to use
"net.ifnames=0". If you see the same value multiple times for either
"acpi_index" or "index", it'd be a firmware problem. I suppose it can
happen that one device has acpi_index==1 and another one has index==1
(IIRC the first is derived from ACPI _DSM, the second from SMBIOS / DMI
type 41).


Martin



Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] Run reboot as normal user

2021-12-01 Thread Martin Wilck
On Wed, 2021-12-01 at 10:24 +0100, Ulrich Windl wrote:
> > > 
> 
> And I wonder what's wrong with allowing the shutdown command for the
> user in
> sudoers.
> (sudo $(which shutdown) -r now)

Sure. I thought sudo might not be installed on that embedded system,
either. If it is, I'd prefer it over other solutions simply because
it's more transparent. Capability bits tend to go unnoticed.

Martin





Re: [systemd-devel] Run reboot as normal user

2021-12-01 Thread Martin Wilck
On Tue, 2021-11-30 at 14:11 +0100, Mohamed Ali Fodha wrote:
> Thanks, but I think using setuid has a security risk for attackers,
> so I understand there is no so much granularity to manage
> unprivileged access to systemd in case the polkit is not used.

You could use setcap to set CAP_SYS_ADMIN capabilities on the
executable you start for rebooting. I don't see a big difference wrt
using AmbientCapabilities in a systemd service, as long as you restrict
the program to be executable only by a certain user or group. Polkit
can't do much more, either. Its main purpose is to serve logged-in
users that want to do certain priviliged actions like mount a volume or
install software, and trigger pop-ups that ask for either user or admin
passwords. IIUC it's overengineered for what you're trying to do,
unless you want to ask for a password or some other extra
authorization.

OTOH, if you use CAP_SYS_ADMIN, you might as well use setuid. Same
argument - if you restrict the program properly, it comes down to
exactly the same thing that polkit would do, just far simpler.

Martin



Re: [systemd-devel] [dm-devel] RFC: one more time: SCSI device identification

2021-04-28 Thread Martin Wilck
On Wed, 2021-04-28 at 11:01 +1000, Erwin van Londen wrote:
> 
> The way out of this is to chuck the array in the bin. As I mentioned
> in one of my other emails when a scenario happens as you described
> above and the array does not inform the initiator it goes against the
> SAM-5 standard.
> 
> That standard shows:
> 5.14 Unit attention conditions
> 5.14.1 Unit attention conditions that are not coalesced
> Each logical unit shall establish a unit attention condition whenever
> one of the following events occurs:
>   a) a power on (see 6.3.1), hard reset (see 6.3.2), logical
> unit reset (see 6.3.3), I_T nexus loss (see 6.3.4), or power loss
> expected (see 6.3.5) occurs;
>   b) commands received on this I_T nexus have been cleared by
> a command or a task management function associated with another I_T
> nexus and the TAS bit was set to zero in the Control mode page
> associated with this I_T nexus (see 5.6);
>   c) the portion of the logical unit inventory that consists
> of administrative logical units and hierarchical logical units has
> been changed (see 4.6.18.1); or
>   d) any other event requiring the attention of the SCSI
> initiator device.
> 
> Especially the I_T nexus loss under a is an important trigger.
> 
> ---
> 6.3.4 I_T nexus loss
> An I_T nexus loss is a SCSI device condition resulting from:
> 
>  a) a hard reset condition (see 6.3.2);
>  b) an I_T nexus loss event (e.g., logout) indicated by a Nexus Loss
> event notification (see 6.4);
>  c) indication that an I_T NEXUS RESET task management request (see
> 7.6) has been processed; or
>  d) an indication that a REMOVE I_T NEXUS command (see SPC-4) has
> been processed.
> An I_T nexus loss event is an indication from the SCSI transport
> protocol to the SAL that an I_T nexus no
> longer exists. SCSI transport protocols may define I_T nexus loss
> events.
> 
> Each SCSI transport protocol standard that defines I_T nexus loss
> events should specify when those events
> result in the delivery of a Nexus Loss event notification to the SAL.
> 
> The I_T nexus loss condition applies to both SCSI initiator devices
> and SCSI target devices.
> 
> If a SCSI target port detects an I_T nexus loss, then a Nexus Loss
> event notification shall be delivered to
> each logical unit to which the I_T nexus has access.
> 
> In response to an I_T nexus loss condition a logical unit shall take
> the following actions:
> a) abort all commands received on the I_T nexus as described in 5.6;
> b) abort all background third-party copy operations (see SPC-4) that
> are using the I_T nexus;
> c) terminate all task management functions received on the I_T nexus;
> d) clear all ACA conditions (see 5.9.5) associated with the I_T
> nexus;
> e) establish a unit attention condition for the SCSI initiator port
> associated with the I_T nexus (see 5.14
> and 6.2); and
> f) perform any additional functions required by the applicable
> command standards.
> ---
> 
> This does also mean that any underlying transport protocol issues
> like on FC or TCP for iSCSI will very often trigger aborted commands
> or UA's as well which will be picked up by the kernel/respected
> drivers.

Thanks a lot. I'm not quite certain which of these paragraphs would
apply to the situation I had in mind (administrator remapping an
existing LUN on a storage array to a different volume). That scenario
wouldn't necessarily involve transport-level errors, or an I_T nexus
loss. 5.14.1 c) or d) might apply, is that what you meant?

Regards
Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] RFC: one more time: SCSI device identification

2021-04-28 Thread Martin Wilck
On Tue, 2021-04-27 at 16:41 -0400, Ewan D. Milne wrote:
> On Tue, 2021-04-27 at 20:33 +0000, Martin Wilck wrote:
> > On Tue, 2021-04-27 at 16:14 -0400, Ewan D. Milne wrote:
> > > 
> > > There's no way to do that, in principle.  Because there could be
> > > other I/Os in flight.  You might (somehow) avoid retrying an I/O
> > > that got a UA until you figured out if something changed, but other
> > > I/Os can already have been sent to the target, or issued before you
> > > get to look at the status.
> > 
> > Right. But in practice, a WWID change will hardly happen under full
> > IO
> > load. The storage side will probably have to block IO while this
> > happens, at least for a short time period. So blocking and quiescing
> > the queue upon an UA might still work, most of the time. Even if we
> > were too late already, the sooner we stop the queue, the better.
> > 
> > The current algorithm in multipath-tools needs to detect a path going
> > down and being reinstated. The time interval during which a WWID
> > change
> > will go unnoticed is one or more path checker intervals, typically on
> > the order of 5-30 seconds. If we could decrease this interval to a
> > sub-
> > second or even millisecond range by blocking the queue in the kernel
> > quickly, we'd have made a big step forward.
> 
> Yes, and in many situations this may help.  But in the general case
> we can't protect against a storage array misconfiguration,
> where something like this can happen.  So I worry about people
> believing the host software will protect them against a mistake,
> when we can't really do that.

I agree. I expressed a similar notion in the following thread about
multipathd's WWID change detection capabilities in the face of really
bad mistakes on the administrator's (or storage array's, FTM)  part:
https://listman.redhat.com/archives/dm-devel/2021-February/msg00248.html
But others stressed that nonetheless we should try our best to
avoid customer data corruption (which I agree with, too), and thus we
settled on the current algorithm, which suited the needs at least of
the affected user(s) in that specific case.

Personally I think that the current "5-30s" time period for WWID change
detection in multipathd is unsafe both theoretically and practially,
and may lure users into a false feeling of safety. Therefore I'd
strongly welcome a kernel-side solution that might still not be safe
theoretically, but cover most practical problem scenarios much better
than we currently do.

Regards
Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] RFC: one more time: SCSI device identification

2021-04-27 Thread Martin Wilck
On Tue, 2021-04-27 at 16:14 -0400, Ewan D. Milne wrote:
> On Mon, 2021-04-26 at 13:16 +0000, Martin Wilck wrote:
> > On Mon, 2021-04-26 at 13:14 +0200, Ulrich Windl wrote:
> > > > > 
> > > > 
> > > > While we're at it, I'd like to mention another issue: WWID
> > > > changes.
> > > > 
> > > > This is a big problem for multipathd. The gist is that the
> > > > device
> > > > identification attributes in sysfs only change after rescanning
> > > > the
> > > > device. Thus if a user changes LUN assignments on a storage
> > > > system,
> > > > it can happen that a direct INQUIRY returns a different WWID as
> > > > in
> > > > sysfs, which is fatal. If we plan to rely more on sysfs for
> > > > device
> > > > identification in the future, the problem gets worse. 
> > > 
> > > I think many devices rely on the fact that they are identified by
> > > Vendor/model/serial_nr, because in most professional SAN storage
> > > systems you
> > > can pre-set the serial number to a custom value; so if you want a
> > > new
> > > disk
> > > (maybe a snapshot) to be compatible with the old one, just assign
> > > the
> > > same
> > > serial number. I guess that's the idea behind.
> > 
> > What you are saying sounds dangerous to me. If a snapshot has the
> > same
> > WWID as the device it's a snapshot of, it must not be exposed to
> > any
> > host(s) at the same time with its origin, otherwise the host may
> > happily combine it with the origin into one multipath map, and data
> > corruption will almost certainly result. 
> > 
> > My argument is about how the host is supposed to deal with a WWID
> > change if it happens. Here, "WWID change" means that a given
> > H:C:T:L
> > suddenly exposes different device designators than it used to,
> > while
> > this device is in use by a host. Here, too, data corruption is
> > imminent, and can happen in a blink of an eye. To avoid this,
> > several
> > things are needed:
> > 
> >  1) the host needs to get notified about the change (likely by an
> > UA
> > of
> > some sort)
> >  2) the kernel needs to react to the notification immediately, e.g.
> > by
> > blocking IO to the device,
> 
> There's no way to do that, in principle.  Because there could be
> other I/Os in flight.  You might (somehow) avoid retrying an I/O
> that got a UA until you figured out if something changed, but other
> I/Os can already have been sent to the target, or issued before you
> get to look at the status.

Right. But in practice, a WWID change will hardly happen under full IO
load. The storage side will probably have to block IO while this
happens, at least for a short time period. So blocking and quiescing
the queue upon an UA might still work, most of the time. Even if we
were too late already, the sooner we stop the queue, the better.

The current algorithm in multipath-tools needs to detect a path going
down and being reinstated. The time interval during which a WWID change
will go unnoticed is one or more path checker intervals, typically on
the order of 5-30 seconds. If we could decrease this interval to a sub-
second or even millisecond range by blocking the queue in the kernel
quickly, we'd have made a big step forward.

Regards
Martin

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [dm-devel] RFC: one more time: SCSI device identification

2021-04-27 Thread Martin Wilck
On Tue, 2021-04-27 at 13:48 +1000, Erwin van Londen wrote:
> > 
> > Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
> > afaics.
> > 
> In my view the WWID should never change. 

In an ideal world, perhaps not. But in the dm-multipath realm, we know
that WWID changes can happen with certain storage arrays. See 
https://listman.redhat.com/archives/dm-devel/2021-February/msg00116.html 
and follow-ups, for example.

Regards,
Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] RFC: one more time: SCSI device identification

2021-04-26 Thread Martin Wilck
On Mon, 2021-04-26 at 13:14 +0200, Ulrich Windl wrote:
> > > 
> > 
> > While we're at it, I'd like to mention another issue: WWID changes.
> > 
> > This is a big problem for multipathd. The gist is that the device
> > identification attributes in sysfs only change after rescanning the
> > device. Thus if a user changes LUN assignments on a storage system,
> > it can happen that a direct INQUIRY returns a different WWID as in
> > sysfs, which is fatal. If we plan to rely more on sysfs for device
> > identification in the future, the problem gets worse. 
> 
> I think many devices rely on the fact that they are identified by
> Vendor/model/serial_nr, because in most professional SAN storage
> systems you
> can pre-set the serial number to a custom value; so if you want a new
> disk
> (maybe a snapshot) to be compatible with the old one, just assign the
> same
> serial number. I guess that's the idea behind.

What you are saying sounds dangerous to me. If a snapshot has the same
WWID as the device it's a snapshot of, it must not be exposed to any
host(s) at the same time with its origin, otherwise the host may
happily combine it with the origin into one multipath map, and data
corruption will almost certainly result. 

My argument is about how the host is supposed to deal with a WWID
change if it happens. Here, "WWID change" means that a given H:C:T:L
suddenly exposes different device designators than it used to, while
this device is in use by a host. Here, too, data corruption is
imminent, and can happen in a blink of an eye. To avoid this, several
things are needed:

 1) the host needs to get notified about the change (likely by an UA of
some sort)
 2) the kernel needs to react to the notification immediately, e.g. by
blocking IO to the device,
 3) userspace tooling such as udev or multipathd need to figure out how
to  how to deal with the situation cleanly, and eventually unblock it.

Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
afaics.

Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] RFC: one more time: SCSI device identification

2021-04-23 Thread Martin Wilck
On Thu, 2021-04-22 at 21:40 -0400, Martin K. Petersen wrote:
> 
> Martin,
> 
> > I suppose 99.9% of users never bother with customizing the udev
> > rules.
> 
> Except for the other 99.9%, perhaps? :) We definitely have many users
> that tweak udev storage rules for a variety of reasons. Including
> being
> able to use RII for LUN naming purposes.
> 
> > But we can actually combine both approaches. If "wwid" yields a
> > good
> > value most of the time (which is true IMO), we could make user
> > space
> > rely on it by default, and make it possible to set an udev property
> > (e.g. ENV{ID_LEGACY}="1") to tell udev rules to determine WWID
> > differently. User-space apps like multipath could check the
> > ID_LEGACY
> > property to determine whether or not reading the "wwid" attribute
> > would
> > be consistent with udev. That would simplify matters a lot for us
> > (Ben,
> > do you agree?), without the need of adding endless BLIST entries.
> 
> That's fine with me.
> 
> > AFAICT, no major distribution uses "wwid" for this purpose (yet).
> 
> We definitely have users that currently rely on wwid, although
> probably
> not through standard distro scripts.
> 
> > In a recent discussion with Hannes, the idea came up that the
> > priority
> > of "SCSI name string" designators should actually depend on their
> > subtype. "naa." name strings should map to the respective NAA
> > descriptors, and "eui.", likewise (only "iqn." descriptors have no
> > binary counterpart; we thought they should rather be put below NAA,
> > prio-wise).
> 
> I like what NVMe did wrt. to exporting eui, nguid, uuid separately
> from
> the best-effort wwid. That's why I suggested separate sysfs files for
> the various page 0x83 descriptors. I like the idea of being able to
> explicitly ask for an eui if that's what I need. But that appears to
> be
> somewhat orthogonal to your request.
> 
> > I wonder if you'd agree with a change made that way for "wwid". I
> > suppose you don't. I'd then propose to add a new attribute
> > following
> > this logic. It could simply be an additional attribute with a
> > different
> > name. Or this new attribute could be a property of the block device
> > rather than the SCSI device, like NVMe does it
> > (/sys/block/nvme0n2/wwid).
> 
> That's fine. I am not a big fan of the idea that block/foo/wwid and
> block/foo/device/wwid could end up being different. But I do think
> that
> from a userland tooling perspective the consistency with NVMe is more
> important.

OK, then here's the plan: Change SCSI (block) device identification to
work similar to NVMe (in addition to what we have now).

 1. add a new sysfs attribute for SCSI block devices as
/sys/block/sd$X/wwid, the value derived similar to the current "wwid"
SCSI device attribute, but using the same prio for SCSI name strings as
for their binary counterparts, as described above.

 2. add "naa" and "eui" attributes, too, for user-space applications
that are interested in these specific attributes. 
Fixme: should we differentiate between different "naa" or eui subtypes,
like "naa_regext", "eui64" or similar? If the device defines multiple
"naa" designators, which one should we choose?

 3. Change udev rules such that they primarily look at the attribute in
1.) on new installments, and introduce a variable ID_LEGACY to tell the
rules to fall back to the current algorithm. I suppose it makes sense
to have at least ID_VENDOR and ID_PRODUCT available when making this
decision, so that it doesn't have to be a global setting on a given
host.

While we're at it, I'd like to mention another issue: WWID changes.

This is a big problem for multipathd. The gist is that the device
identification attributes in sysfs only change after rescanning the
device. Thus if a user changes LUN assignments on a storage system, 
it can happen that a direct INQUIRY returns a different WWID as in
sysfs, which is fatal. If we plan to rely more on sysfs for device
identification in the future, the problem gets worse. 

I wonder if there's a chance that future kernels would automatically
update the attributes if a corresponding UNIT ATTENTION condition such
as INQUIRY DATA HAS CHANGED is received (*), or if we can find some
other way to avoid data corruption resulting from writing to the wrong
device.

Regards,
Martin

(*) I've been told that WWID changes can happen even without receiving
an UA. But in that case I'm inclined to put the blame on the storage.

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] RFC: one more time: SCSI device identification

2021-04-22 Thread Martin Wilck
On Wed, 2021-04-21 at 22:46 -0400, Martin K. Petersen wrote:
> 
> Martin,
> 
> > Hm, it sounds intriguing, but it has issues in its own right. For
> > years to come, user space will have to probe whether these attribute
> > exist, and fall back to the current ones ("wwid", "vpd_pg83")
> > otherwise. So user space can't be simplified any time soon. Speaking
> > for an important user space consumer of WWIDs (multipathd), I doubt
> > that this would improve matters for us. We'd be happy if the kernel
> > could just pick the "best" designator for us. But I understand that
> > the kernel can't guarantee a good choice (user space can't either).
> 
> But user space can be adapted at runtime to pick one designator over
> the
> other (ha!).

And that's exactly the problem. Effectively, all user space relies on
udev today, because that's where this "adaptation" is taking place. It
happens

 1) either in systemd's scsi_id built-in 
   
(https://github.com/systemd/systemd/blob/7feb1dd6544d1bf373dbe13dd33cf563ed16f891/src/udev/scsi_id/scsi_serial.c#L37)
 2) or in the udev rules coming with sg3_utils 
   
(https://github.com/hreinecke/sg3_utils/blob/master/scripts/55-scsi-sg3_id.rules)

1) is just as opaque and un-"adaptable" as the kernel, and the logic is
suboptimal. 2) is of course "adaptable", but that's a problem in
practice, if udev fails to provide a WWID. multipath-tools go through
various twists for this case to figure out "fallback" WWIDs, guessing
whether that "fallback" matches what udev would have returned if it had
worked.

That's the gist of it - the general frustration about udev among some
of its heaviest users (talk to the LVM2 maintainers).

I suppose 99.9% of users never bother with customizing the udev rules.
IOW, these users might as well just use a kernel-provided value. But
the remaining 0.1% causes headaches for user-space applications, which
can't make solid assumptions about the rules. Thus, in a way, the
flexibility of the rules does more harm than it helps.

> We could do that in the kernel too, of course, but I'm afraid what
> the
> resulting BLIST changes would end up looking like over time.

That's something we want to avoid, sure.

But we can actually combine both approaches. If "wwid" yields a good
value most of the time (which is true IMO), we could make user space
rely on it by default, and make it possible to set an udev property
(e.g. ENV{ID_LEGACY}="1") to tell udev rules to determine WWID
differently. User-space apps like multipath could check the ID_LEGACY
property to determine whether or not reading the "wwid" attribute would
be consistent with udev. That would simplify matters a lot for us (Ben,
do you agree?), without the need of adding endless BLIST entries.


> I am also very concerned about changing what the kernel currently
> exports in a given variable like "wwid". A seemingly innocuous change
> to
> the reported value could lead to a system no longer booting after
> updating the kernel.

AFAICT, no major distribution uses "wwid" for this purpose (yet). I
just recently realized that the kernel's ALUA code refers to it. (*)

In a recent discussion with Hannes, the idea came up that the priority
of "SCSI name string" designators should actually depend on their
subtype. "naa." name strings should map to the respective NAA
descriptors, and "eui.", likewise (only "iqn." descriptors have no
binary counterpart; we thought they should rather be put below NAA,
prio-wise).

I wonder if you'd agree with a change made that way for "wwid". I
suppose you don't. I'd then propose to add a new attribute following
this logic. It could simply be an additional attribute with a different
name. Or this new attribute could be a property of the block device
rather than the SCSI device, like NVMe does it
(/sys/block/nvme0n2/wwid).

I don't like the idea of having separate sysfs attributes for
designators of different types, that's impractical for user space.

> But taking a step back: Other than "it's not what userland currently
> does", what specifically is the problem with designator_prio()? We've
> picked the priority list once and for all. If we promise never to
> change
> it, what is the issue?

If the prioritization in kernel and user space was the same, we could
migrate away from udev more easily without risking boot failure.

Thanks,
Martin

(*) which is an argument for using "wwid" in user space too - just to
be consitent with the kernel's internal logic.

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] RFC: one more time: SCSI device identification

2021-04-16 Thread Martin Wilck
Hello Martin,

Sorry for the late response, still recovering from a week out of
office.

On Tue, 2021-04-06 at 00:47 -0400, Martin K. Petersen wrote:
> 
> Martin,
> 
> > The kernel's preference for type 8 designators (see below) is in
> > contrast with the established user space algorithms, which
> > determine
> > SCSI WWIDs on productive systems in practice. User space can try to
> > adapt to the kernel logic, but it will necessarily be a slow and
> > painful path if we want to avoid breaking user setups.
> 
> I was concerned when you changed the kernel prioritization a while
> back
> and I still don't think that we should tweak that code any further.

Ok.

> If the kernel picks one ID over another, that should be for the
> kernel's
> use. Letting the kernel decide which ID is best for userland is not a
> good approach.

Well, the kernel itself doesn't make any use of this property currently
(and user space doesn't much either, afaik).


> So I think my inclination would be to leave the current wwid as-is to
> avoid the risk of breaking things. And then export all ID descriptors
> reported in sysfs. Even though vpd83 is already exported in its
> entirety, I don't have any particular concerns about the individual
> values being exported separately. That makes many userland things so
> much easier. And I think the kernel is in a good position to
> disseminate
> information reported by the hardware.
> 
> This puts the prioritization entirely in the distro/udev/scripting
> domain. Taking the kernel out of the picture will make migration
> easier. And it allows a user to pick their descriptor of choice
> should a
> device report something completely unusable in type 8.

Hm, it sounds intriguing, but it has issues in its own right. For years
to come, user space will have to probe whether these attribute exist,
and fall back to the current ones ("wwid", "vpd_pg83") otherwise. So
user space can't be simplified any time soon. Speaking for an important
user space consumer of WWIDs (multipathd), I doubt that this would
improve matters for us. We'd be happy if the kernel could just pick the
"best" designator for us. But I understand that the kernel can't
guarantee a good choice (user space can't either).

What is your idea how these new sysfs attributes should be named? Just
enumerate, or name them by type somehow?

Thanks,
Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] RFC: one more time: SCSI device identification

2021-03-29 Thread Martin Wilck
Hello,

[sorry for cross-posting, I think this is relevant to multiple
communities.]

I'm referring to the recent discussion about SCSI device identification
for multipath-tools 
(https://listman.redhat.com/archives/dm-devel/2021-March/msg00332.html)

As you all know, there are different designators to identify SCSI LUNs,
and the specs don't mandate priorities for devices that support
multiple designator types. There are various implementations for device
identification, which use different priorities (summarized below).

It's highly desirable to clean up this confusion and settle on a single
instance and a unique priority order. I believe this instance should be
the kernel.

OTOH, changing device WWIDs is highly dangerous for productive systems.
The WWID is prominently used in multipath-tools, but also in lots of
other important places such as fstab, grub.cfg, dracut, etc. No doubt
that we'll be stuck with the different algorithms for years, especially
for LTS distributions. But perhaps we can figure out a long-term exit
strategy?

The kernel's preference for type 8 designators (see below) is in
contrast with the established user space algorithms, which determine
SCSI WWIDs on productive systems in practice. User space can try to
adapt to the kernel logic, but it will necessarily be a slow and
painful path if we want to avoid breaking user setups.

In principle, I believe the kernel is "right" to prefer type 8. But
because the "wwid" attribute isn't actually used for device
identification today, changing the kernel logic would be less prone to
regressions than changing user space, even if it violates the principle
that the kernel's user space API must remain stable.

Would it be an option to modify the kernel logic?

If we can't, I think we should start with making the "wwid" attribute
part of the udev rule logic, and letting distros configure whether the
kernel logic or the traditional udev logic would be used.

Please tell me your thoughts on this matter.

Regards,
Martin

PS: Incomplete list of algorithms for SCSI designator priorities:

The kernel ("wwid" sysfs attribute) prefers "SCSI name string" (type 8)
designators over other types
(https://elixir.bootlin.com/linux/latest/A/ident/designator_prio).

The current set of udev rules in sg3_utils
(https://github.com/hreinecke/sg3_utils/blob/master/scripts/55-scsi-sg3_id.rules)
don't use the kernel's wwid attribute; they parse VPD 83 and 80
instead and prioritize types 36, 35, 32, and 2 over type 8.

udev's "scsi_id" tool, historically the first attempt to implement a
priority for this, doesn't look at the SCSI name attribute at all:
https://github.com/systemd/systemd/blob/main/src/udev/scsi_id/scsi_serial.c

There's a "fallback" logic in multipath-tools in case udev doesn't
provide a WWID:
https://github.com/opensvc/multipath-tools/blob/a41a61e8482def33e3ca8c9e3639ad2c37611551/libmultipath/discovery.c#L1040

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-28 Thread Martin Wilck
Hi Lennart,

thanks again.

On Wed, 2021-01-27 at 23:56 +0100, Lennart Poettering wrote:
> On Mi, 27.01.21 21:51, Martin Wilck (mwi...@suse.com) wrote:
> 
> if you want the initrd environment to fully continue to exist,

I don't. I just need /sys and /dev (and perhaps /proc and /run, too) to
remain accessible. I believe most root storage daemons will need this.

> consider creating a new mount namespace, bind mount the initrd root
> into it recursively to some new dir you created. Then afterwards mark
> that mount MS_PRIVATE. then pivot_root()+chroot()+chdir() into your
> new old world.

And on exit, I'd need to tear all that down again, right? I don't want
my daemon to block shutdown because some file systems haven't been
cleanly unmounted.

Regards,
Martin


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-27 Thread Martin Wilck
On Tue, 2021-01-26 at 11:33 +0100, Lennart Poettering wrote:
> 
> > 
> > [Unit]
> > Description=NVMe Event Monitor for Automatical Subsystem Connection
> > Documentation=man:nvme-monitor(1)
> > DefaultDependencies=false
> > Conflicts=shutdown.target
> > Requires=systemd-udevd-kernel.socket
> > After=systemd-udevd-kernel.socket
> 
> Why do you require this?
> 

Brain fart on my part. I need to connect to the kernel socket, but that
doesn't require the systemd unit.

> My guess: the socket unit gets shutdown, and since you have Requires=
> on it you thus go away too.

That was it, thanks a lot. So obvious in hindsight :-/

Meanwhile I've looked a bit deeper into the problems accessing "/dev"
that I talked about in my other post. scandir on "/" actually returns
an empty directory after switching root, and any path lookups for
absolute paths fail. I didn't expect that, because I thought systemd
removed the contents of the old root, and stopped on (bind) mounts.
Again, this is systemd-234.

If I chdir("/run") before switching root and chroot("..") afterwards
(*), I'm able to access everything just fine (**). However, if I do
this, I end up in the real root file system, which is what I wanted to
avoid in the first place.

So, I guess I'll have to create bind mounts for /dev, /sys etc. in the
old root, possibly after entering a private mount namespace?

The other option would be to save fd's for the file systems I need to
access and use opendirat() only. Right?

Regards,
Martin

(*) Michal suggested to simply do chroot(".") instead. That might as
well work, I haven't tried it yet.

(**) For notification about switching root, I used epoll(EPOLLPRI) on
/proc/self/mountinfo, because I read that inotify doesn't work on proc.
polling for EPOLLPRI works just fine.


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-26 Thread Martin Wilck
On Tue, 2021-01-26 at 11:30 +0100, Lennart Poettering wrote:
> 
> > Imagine two parallel instances of systemd-udevd (IMO there are
> > reasons
> > to handle it like a "root storage daemon" in some distant future).
> 
> Hmm, wa? naahh.. udev is about dicovery it should not be required to
> maintain access to something you found.

True. But if udev ran without interruption, we could get rid of
coldplug after switching root. That could possibly save us a lot of
trouble.

Anyway, it's just a thought I find tempting.

Regrads
Martin


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-25 Thread Martin Wilck
On Mon, 2021-01-25 at 18:33 +0100, Lennart Poettering wrote:
> 
> Consider using IgnoreOnIsolate=.
> 

I fail to make this work. Installed this to the initrd (note the
ExecStop "command"):

[Unit]
Description=NVMe Event Monitor for Automatical Subsystem Connection
Documentation=man:nvme-monitor(1)
DefaultDependencies=false
Conflicts=shutdown.target
Requires=systemd-udevd-kernel.socket
After=systemd-udevd-kernel.socket
Before=sysinit.target systemd-udev-trigger.service 
nvmefc-boot-connections.service
RequiresMountsFor=/sys
IgnoreOnIsolate=true

[Service]
Type=simple
ExecStart=/usr/sbin/nvme monitor $NVME_MONITOR_OPTIONS
ExecStop=-/usr/bin/systemctl show -p IgnoreOnIsolate %N
KillMode=mixed

[Install]
WantedBy=sysinit.target

I verified (in a pre-pivot shell) that systemd had seen the
IgnoreOnIsolate property. But when initrd-switch-root.target is
isolated, the unit is cleanly stopped nonethless.

[  192.832127] dolin systemd[1]: initrd-switch-root.target: Trying to enqueue 
job initrd-switch-root.target/start/isolate
[  192.836697] dolin systemd[1]: nvme-monitor.service: Installed new job 
nvme-monitor.service/stop as 98
[  193.027182] dolin systemctl[3751]: IgnoreOnIsolate=yes
[  193.029124] dolin systemd[1]: nvme-monitor.service: Changed running -> 
stop-sigterm
[  193.029353] dolin nvme[768]: monitor_main_loop: monitor: exit signal received
[  193.029535] dolin systemd[1]: Stopping NVMe Event Monitor for Automatical 
Subsystem Connection...
[  193.065746] dolin systemd[1]: Child 768 (nvme) died (code=exited, 
status=0/SUCCESS)
[  193.065905] dolin systemd[1]: nvme-monitor.service: Child 768 belongs to 
nvme-monitor.service
[  193.066073] dolin systemd[1]: nvme-monitor.service: Main process exited, 
code=exited, status=0/SUCCESS
[  193.066241] dolin systemd[1]: nvme-monitor.service: Changed stop-sigterm -> 
dead
[  193.066403] dolin systemd[1]: nvme-monitor.service: Job 
nvme-monitor.service/stop finished, result=done
[  193.066571] dolin systemd[1]: Stopped NVMe Event Monitor for Automatical 
Subsystem Connection.
[  193.500010] dolin systemd[1]: initrd-switch-root.target: Job 
initrd-switch-root.target/start finished, result=done
[  193.500188] dolin systemd[1]: Reached target Switch Root.

After boot, the service actually remains running when isolating e.g. 
"rescue.target". But when switching root,
it doesn't work.

dolin:~/:[141]# systemctl show -p IgnoreOnIsolate nvme-monitor.service
IgnoreOnIsolate=yes

Tested only with systemd-234 so far. Any ideas what I'm getting wrong?

Martin


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-25 Thread Martin Wilck
On Mon, 2021-01-25 at 18:33 +0100, Lennart Poettering wrote:
> On Sa, 23.01.21 02:44, Martin Wilck (mwi...@suse.com) wrote:
> 
> > Hi
> > 
> > I'm experimenting with systemd's root storage daemon concept
> > (https://systemd.io/ROOT_STORAGE_DAEMONS/).
> > 
> > I'm starting my daemon from a service unit in the initrd, and
> > I set argv[0][0] to '@', as suggested in the text.
> > 
> > So far so good, the daemon isn't killed. 
> > 
> > But a lot more is necessary to make this actually *work*. Here's a
> > list
> > of issues I found, and what ideas I've had so far how to deal with
> > them. I'd appreciate some guidance.
> > 
> > 1) Even if a daemon is exempted from being killed by killall(), the
> > unit it belongs to will be stopped when initrd-switch-root.target
> > is
> > isolated, and that will normally cause the daemon to be stopped,
> > too.
> > AFAICS, the only way to ensure the daemon is not killed is by
> > setting
> > "KillMode=none" in the unit file. Right? Any other mode would send
> > SIGKILL sooner or later even if my daemon was smart enough to
> > ignore
> > SIGTERM when running in the intird.
> 
> Consider using IgnoreOnIsolate=.

Ah, thanks a lot. IIUC that would actually make systemd realize that
the unit continues to run after switching root, which is good.

Like I remarked for KillMode=none, IgnoreOnIsolate=true would be
suitable only for the "root storage daemon" instance, not for a
possible other instance serving data volumes only.
I suppose there's no way to make this directive conditional on being
run from the initrd, so I'd need two different unit files,
or use a drop-in in the initrd.

Is there any way for the daemon to get notified if root is switched?

> 
> > 3) The daemon that has been started in the initrd's root file
> > system
> > is unable to access e.g. the /dev file system after switching
> > root. I haven't yet systematically analyzed which file systems are
> > available.   I suppose this must be handled by creating bind
> > mounts,
> > but I need guidance how to do this. Or would it be
> > possible/advisable for the daemon to also re-execute itself under
> > the real root, like systemd itself? I thought the root storage
> > daemon idea was developed to prevent exactly that.
> 
> Not sure why it wouldn't be able to access /dev after switching. We
> do
> not allocate any new instance of that, it's always the same devtmpfs
> instance.

I haven't digged deeper yet, I just saw "No such file or directory"
error messages trying to access device nodes that I knew existed, so I
concluded there were issues with /dev.

> Do not reexec onto the host fs, that's really not how this should be
> done.

Would there be a potential security issue because the daemon keeps a
reference to the intird root FS?

> 
> > 4) Most daemons that might qualify as "root storage daemon" also
> > have
> > a "normal" mode, when the storage they serve is _not_ used as root
> > FS,
> > just for data storage. In that case, it's probably preferrable to
> > run
> > them from inside the root FS rather than as root storage daemon.
> > That
> > has various advantages, e.g. the possibility to update the sofware
> > without rebooting. It's not clear to me yet how to handle the two
> > options (root and non-root) cleanly with unit files.
> 
> option one: have two unit files? i.e. two instances of the subsystem,
> one managing the root storage, and one the rest.

Hm, that looks clumsy to me. It could be done e.g. for multipath by
using separate configuration files and setting up appropriate
blacklists, but it would cause a lot of work to be done twice. e.g.
uevents would be received by both daemons and acted upon
simultaneously. Generally ruling out race conditions wouldn't be easy.

Imagine two parallel instances of systemd-udevd (IMO there are reasons
to handle it like a "root storage daemon" in some distant future).

> option two: if you cannot have multiple instances of your subsystem,
> then the only option is to make the initrd version manage
> everything. But of course, that sucks, but there's little one can do
> about that.

Why would it be so bad? I would actually prefer a single instance for
most subsystems. But maybe I'm missing something.

Thanks,
Martin

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-22 Thread Martin Wilck
Hi

I'm experimenting with systemd's root storage daemon concept
(https://systemd.io/ROOT_STORAGE_DAEMONS/).

I'm starting my daemon from a service unit in the initrd, and
I set argv[0][0] to '@', as suggested in the text.

So far so good, the daemon isn't killed. 

But a lot more is necessary to make this actually *work*. Here's a list
of issues I found, and what ideas I've had so far how to deal with
them. I'd appreciate some guidance.

1) Even if a daemon is exempted from being killed by killall(), the
unit it belongs to will be stopped when initrd-switch-root.target is
isolated, and that will normally cause the daemon to be stopped, too. 
AFAICS, the only way to ensure the daemon is not killed is by setting
"KillMode=none" in the unit file. Right? Any other mode would send
SIGKILL sooner or later even if my daemon was smart enough to ignore
SIGTERM when running in the intird.

2) KillMode=none will make systemd consider the respective unit
stopped, even if the daemon is still running. That feels wrong. Are
there better options?

3) The daemon that has been started in the initrd's root file system is
unable to access e.g. the /dev file system after switching root. I
haven't yet systematically analyzed which file systems are available. 
I suppose this must be handled by creating bind mounts, but I need
guidance how to do this. Or would it be possible/advisable for the
daemon to also re-execute itself under the real root, like systemd
itself? I thought the root storage daemon idea was developed to prevent
exactly that.

4) Most daemons that might qualify as "root storage daemon" also have
a "normal" mode, when the storage they serve is _not_ used as root FS,
just for data storage. In that case, it's probably preferrable to run
them from inside the root FS rather than as root storage daemon. That
has various advantages, e.g. the possibility to update the sofware
without rebooting. It's not clear to me yet how to handle the two
options (root and non-root) cleanly with unit files. 

 - if (for "root storage daemon" mode) I simply put the enabled unit
file in the initrd, systemd will start the daemon twice, at least if
it's a "simple" service. I considered working with conditions, such as 

   ConditionPathExists=!/run/my-daemon/my-pidfile

(where the pidfile would have been created by the initrd-based daemon)
but that would cause the unit in the root FS to fail, which is ugly.

 - I could (for root mode) add the enabled unit file to the intird
and afterwards disable it in the root fs, thus avoiding two copies to
be started. But that would cause issues whenever the intird must be
rebuilt. I suppose it could be handled with a dracut module.

- I could create two different unit files mydaemon.service and
mydaemon-initrd.service and have them conflict. dracut doesn't support
this out of the box. A separate dracut module would be necessary, too.

- Some settings such as KillMode=none make sense for the service in the
intird environment, but not for the one running in the root FS, and
vice versa. This is another argument for having separate unit files, or
initrd-specific drop-ins.

Bottom line for 4) is that a dracut module specific to the daemon at
hand must be written. That dracut module would need to figure out
whether the service is required for mounting root, and activate "root-
storage-daemon" mode by adding the service to the intird. The instance
in the root FS would then either need be disabled, or be smart enough
to detect situation and exit gracefully. Ideally, "systemctl status"
would show the service as running even thought the instance inside the
root FS isn't actually running. I am unsure if all this can be achieved
easily with the current sytemd functionality, please advise.

I hope this makes at least some sense.

Suggestions and Feedback welcome.

Regards
Martin




___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] SystemCallFilter

2019-05-28 Thread Martin Wilck
On Tue, 2019-05-28 at 11:43 +0200, Josef Moellers wrote:
> Hi,
> 
> We just had an issue with a partner who tried to filter out the
> "open"
> system call:
> 
> . This may, in general, not be a very clever idea because how is one
> to
> load a shared library to start with, but this example has revealed
> something problematic ...
>   SystemCallFilter=~open
> The problem the partner had was that the filter just didn't work. No
> matter what he tried, the test program ran to completion.
> It took us some time to figure out what caused this:
> The test program relied on the fact that when it called open(), that
> the
> "open" system call would be used, which it doesn't any more. It uses
> the
> "openat" system call instead (*).

AFAIK, glibc hardly ever uses open(2) any more, and has been doing so
fo
r some time.

> Now it appears that this change is deliberate and so my question is
> what
> to do about these cases.
> Should one
> * also filter out "openat" if only "open" is required?

That looks wrong to me. Some people *might* want to filter open(2)
only, and would be even more surprised than you are now if this
would implicitly filter out openat(2) as well.

> * introduce a new group "@open" which filters both?

Fair, but then there are lots of XYat() syscalls which would need
to be treated the same way.

> I regard "SystemCallFilter" as a security measure and if one cannot
> rely
> on mechanisms any more, what good is such a feature?

Have you seen this? https://lwn.net/Articles/738694/
IMO this is a question related to seccomp design; "SystemCallFilter"
is just a convenient helper for using seccomp.

Cheers,
Martin


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] RFC: temporarily deactivating udev rules during coldplug

2019-05-28 Thread Martin Wilck
We are facing problems during udev coldplug on certain very big systems
(read: > 1000 CPUs, several TiB RAM). Basically, the systems get
totally unresponsive immediately after coldplug starts, and remain so
for minutes, causing uevents to time out. Attempts to track it down
have shown that access to files on tmpfs (e.g. /run/udev/db) may take
a very long time. Limiting the maximum number of udev workers helps,
but doesn't seem to solve all problems we are seeing.

Among the things we observed was lots of activity running certain udev
rules which are executed for every device. One such example is the
"vpdupdate" rule on Linux-PowerPC systems:

https://sourceforge.net/p/linux-diag/libvpd/ci/master/tree/90-vpdupdate.rules

Another one is a SUSE specific rule that is run on CPU- or memory-
events
(https://github.com/openSUSE/kdump/blob/master/70-kdump.rules.in). 
It is triggered very often on large systems that may have 1s of 
memory devices.

These are rules that are worthwhile and necessary in a fully running
system to respond to actual hotplug events, but that need not be run
during coldplug, in particular not 1000s of times in a very short time
span. 

Therefore I'd like to propose a scheme to deactivate certain rules
during coldplug. The idea involves 2 new configuration directories:

 /etc/udev/pre-trigger.d:

   "*.rules" files in this directory are copied to /run/udev/rules.d
   before starting "udev trigger". Normally these would be 0-byte files
   with a name corresponding to an actual rule file from
   /usr/lib/udev/rules.d - by putting them to /run/udev/rules.d,
   the original rules are masked.
   After "udev settle" finishes, either successfully or not, the
   files are removed from /run/udev/rules.d again.

 /etc/udev/post-settle.d:

   "*.post" files in this directory are executed after udev settle
   finishes. The intention is to create a "cumulative action", ideally
   equivalent to having run the masked-out rules during coldplug.
   This may or may not be necessary, depending on the rules being
   masked out. For the vpdupdate rule above, thus comes down to running
   "/bin/touch /run/run.vpdupdate". 

The idea is implemented with a simple shell script and two unit files.
The 2nd unit file is necessary because simply using systemd-udev-
settle's  "ExecStartPost" doesn't work - unmasking must be done even if
"udev settle" fails or times out. "ExecStopPost" doesn't work either,
we don't want to run this when systemd-udev-settle.service is stopped
after having been started successfully.

See details below. Comments welcome.
Also, would this qualify for inclusion in the systemd code base?

Martin


Shell script: /usr/lib/udev/coldplug.sh

#! /bin/sh
PRE_DIR=/etc/udev/pre-trigger.d
POST_DIR=/etc/udev/post-settle.d
RULES_DIR=/run/udev/rules.d

[ -d "$PRE_DIR" ] || exit 0
[ -d "$RULES_DIR" ] || exit 0

case $1 in
mask)
cd "$PRE_DIR"
for fl in *.rules; do
[ -e "$fl" ] || break
cp "$fl" "$RULES_DIR"
done
;;
unmask)
cd "$PRE_DIR"
for fl in *.rules; do
[ -e "$fl" ] || break
rm -f "$RULES_DIR/$fl"
done
;;
post)
[ -d "$POST_DIR" ] || exit 0
cd "$POST_DIR"
for fl in *.post; do
[ -e "$fl" ] || break
[ -x "$fl" ] || continue
./"$fl"
done
;;
*) echo "usage: $0 [mask|unmask|post]" >&2; exit 1;;
esac



Unit file: systemd-udev-pre-coldplug.service

[Unit]
Description=Mask udev rules before coldplug
DefaultDependencies=No
Conflicts=shutdown.target
Before=systemd-udev-trigger.service
Wants=systemd-udev-post-coldplug.service
ConditionDirectoryNotEmpty=/etc/udev/pre-trigger.d
ConditionPathIsDirectory=/run/udev/rules.d
ConditionFileIsExecutable=/usr/lib/udev/coldplug.sh

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=-/usr/lib/udev/coldplug.sh mask
ExecStop=-/usr/lib/udev/coldplug.sh unmask

[Install]
WantedBy=sysinit.target



Unit file: systemd-udev-post-coldplug.service

[Unit]
Description=Reactivate udev rules after coldplug
DefaultDependencies=No
Conflicts=shutdown.target
After=systemd-udev-settle.service
ConditionDirectoryNotEmpty=/etc/udev/pre-trigger.d
ConditionPathIsDirectory=/run/udev/rules.d
ConditionFileIsExecutable=/usr/lib/udev/coldplug.sh

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=-/usr/bin/systemctl stop systemd-udev-pre-coldplug.service
ExecStart=-/usr/lib/udev/coldplug.sh post

[Install]
WantedBy=sysinit.target


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] Possible race condition with LVM activation during boot

2019-02-07 Thread Martin Wilck
On Thu, 2019-02-07 at 19:13 +0100, suscrici...@gmail.com wrote:
> El Thu, 7 Feb 2019 11:18:40 +0100
> 
> There's been a reply in Arch Linux forums and at least I can apply
> some
> contingency measures. If it happens again I will provide more info
> following your advice.

The log shows clearly that the device was available first:

feb 06 12:07:09 systemd[1]: Starting File System Check on 
/dev/disk/by-uuid/cabdab31-983b-401f-be30-dda0ae462080...
feb 06 12:07:09 systemd-fsck[520]: multimedia: limpio, 1051/953984 ficheros, 
193506294/244189184 bloques
feb 06 12:07:09 systemd[1]: Started File System Check on 
/dev/disk/by-uuid/cabdab31-983b-401f-be30-dda0ae462080.

That wouldn't be possible without the device being visible to the system.
Shortly after you get

feb 06 12:07:09 mount[544]: mount: /mnt/multimedia: el dispositivo especial 
/dev/disk/by-uuid/cabdab31-983b-401f-be30-dda0ae462080 no existe.

... So the device that had already been visible must have disappeared 
temporarily.
After the mount failure, we see messages about the LV becoming active:

[...]
feb 06 12:07:09 lvm[483]:   1 logical volume(s) in volume group "storage" now 
active
[...]
feb 06 12:07:09 lvm[494]:   1 logical volume(s) in volume group "storage" now 
active

There are two "lvm pvscan" processes (483 and 494) that may be interfering 
with each other and/or with the "mount" process. These processes are running 
on 8:17 (/dev/sdb1) and 254:1. I couldn't figure out from your logs what this 
latter
device might be. 

Wild guess: the pvscan processes working on the VG while it's already
visible are causing the device to get offline for a short time span,
and if the mount is attempted in that time window, the error occurs.

There's another lvm process (340) started earlier by the lvm2-monitor
unit ("Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or
progress polling."). Immediately after termination of this process, the
device seems to be detected by systemd and the above fsck/mount
sequence begins, while the pvscan processes are still running. "lvm2-
monitor.service" runs "vgchange --monitor y". It almost looks as if
this had caused the devices to be visible by systemd, but that would be
wrong AFAICT.

Can you reproduce this with "udev.log-priority=debug"?

Regards,
Martin



___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] LibUdev: serial not displayed properly for scsi devices

2019-02-07 Thread Martin Wilck
On Thu, 2019-02-07 at 14:39 +, Sven Wiltink wrote:
> Hey all,
> 
> We've been running into the issue where lblk -O outputs the wwn
> instead of the serial for scsi devices. After some debugging I've
> come to the conclusing that /lib/udev/scsi_id is the underlying
> cause. It outputs the wwn of a disk in the ID_SERIAL_SHORT field, but
> does also export the actual serial in a separate field
> (ID_SCSI_SERIAL).

This is not a bug.

Check the code in scsi_serial.c to see that this is is done on purpose.
ID_SERIAL does *not* represent the hardware serial number, it is rather
something like "the best unique identifier for the device at hand".
This is typically an NAA identifier from the VPD page 0x83 ("device
identification"). VPD 0x83 may contain multiple identifiers, out of
which the code uses the "best" one according to a hard-coded internal
priority list (id_search_list in scsi_serial.c). ID_SERIAL_SHORT just
strips the first byte (NAA type) off ID_SERIAL.

You can override this behavior by using the parameter "-p 0x80" for
scsi_id, which forces scsi_id to use VPD page 0x80 ("serial number").

Don't be confused by the name ID_SERIAL, which is just historic AFAICT.
The variable name ID_SERIAL was introduced in 2005, in udev 59 (!).

https://git.kernel.org/pub/scm/linux/hotplug/udev.git/commit/?id=34129109a1f5dca72af2f3f2d3d14a9a0d0c43f6

The name might have been chosen after the variable name "serial" that
the code printed, but even at that time, the algorithm described above
was already followed.


> 
> I am unsure if the fix should be in udev or lsblk, but I wanted to
> bring it up for discussion because the serial not being displayed
> properly.

Yes, lsblk should be fixed (if anything). We can't change udev in this
regard. Too much code depends on the given semantics of ID_SERIAL.

Regards,
Martin


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] udev “PROGRAM/RUN” command not working properly for “REMOVE” action

2019-02-05 Thread Martin Wilck
On Mon, 2019-02-04 at 13:19 +0100, Lennart Poettering wrote:
> 
> reading sysfs attrs is problematics from "remove" rules, as the sysfs
> device is likely to have vanished by then, as rules are executed
> asynchronously to the events they are run for.
> 
> udev will import the udev db from the last event it has seen on a
> device on "remove", but sysfs attrs are not stored in the udev
> db. hence, consider testing against udev db props here, not sysfs
> attrs.

Right. Then, maybe, udev should treat & report attempts to refer to
sysfs attributes in rules for "remove" events as errors in the first
place?

Regards,
Martin



___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] udev “PROGRAM/RUN” command not working properly for “REMOVE” action

2019-02-01 Thread Martin Wilck
On Thu, 2019-01-31 at 14:46 +0100, Ziemowit Podwysocki wrote:
> 
> ACTION=="remove", SUBSYSTEM=="usb", DRIVER=="usb",
> ATTRS{idVendor}=="1244", ATTRS{idProduct}=="206d", RUN+="/bin/touch
> /home/user/udev/%k"
> 
> This one suppose to create file named after "KERNEL" param of the
> device. This is also not happening! But for action "add" it works!

Have you tried udev debugging (udevadm control -l debug)?

I'd do that, and that I'd remove some conditions and see if one of them
is not (or unreliably) set on remove events, and thus causing your rule
not to be run. E.g. start with 

  ACTION=="remove", RUN+="..."

and then add the original conditions one by one.

Regards
Martin


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] mount unit with special requirements

2018-09-10 Thread Martin Wilck
On Mon, 2018-09-10 at 09:55 +0200, Michael Hirmke wrote:
> 
> > > 
> > > > (I would just use `umount /var/backup`, however.)
> > > 
> > > Can't do that as long as the mount unit is under systemd control.
> > > A few seconds later systemd remounts it on its own.
> > > 
> > "noauto" mount option?
> 
> This would prevent it from being mounted at startup, which is
> necessary.

If you leave out "noauto", you're telling systemd to mount the file
system when it's ready. You said you didn't want that. From your
problem description, I'd infer that this file system needs to be
mounted only at certain times (while the backup is running). My
suggestion would be to create a dedicated script (or systemd service,
for that matter) that would mount the file system, start the backup,
and unmount / freeze the file system when the backup is done.

Regards,
Martin


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] mount unit with special requirements

2018-09-10 Thread Martin Wilck
On Sat, 2018-09-08 at 19:55 +0200, Michael Hirmke wrote:
> Hi *,
> 
> [...]
> > > - The partition has to be mounted on boot.
> > > - It has to be unmounted before the nightly copy job, so that an
> > > fsck
> > >   can be performed.
> > > - After that it has to be mounted read only, so that during the
> > > copy
> > >   job no other machine can write to it.
> > > - After finishing the copy job, the partition has to be remounted
> > > read
> > >   write again.
> > > 
> > Isn't that commonly done using LVM? If it were on a logical volume,
> > you
> > could fsfreeze /var/backup (to suspend writes during snapshotting),
> > make a
> > LVM snapshot, thaw, mount the read-only snapshot elsewhere and
> > rsync off it.
> 
> I never used LVM and this system does not use an LVM partitioning.

fsfreeze should work without LVM. Of course you shouldn't be writing
tons of data to the file system while it's frozen, therefore LVM
snapshot + quick unfreeze would be more robust.

> 
> > (I would just use `umount /var/backup`, however.)
> 
> Can't do that as long as the mount unit is under systemd control.
> A few seconds later systemd remounts it on its own.
> 

"noauto" mount option?

Martin


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] About stable network interface names

2017-06-09 Thread Martin Wilck
On Tue, 2017-06-06 at 21:40 +0300, Andrei Borzenkov wrote:
> 
> Can device and function really change? My understanding is that
> device
> part is determined by bus physical wiring and function by PCI card
> design; this leaves bus as volatile run-time enumeration value.

For PCIe, that's only true for the "function" part.
https://superuser.com/questions/1060808/how-is-the-device-determined-in
-pci-enumeration-bus-device-function

The systemd docs are a bit misleading for PCIe, as they talk about
"physical/geographical location" for the common enp$Xs$Yf$Z scheme,
which is actually just the BDF. The interface on my laptop is called
enp0s31f6 although the laptop doesn't have "slot 31". (1)

Martin

(1) Well, that's actually because the manufacturer saved the money to
implement the DMI BIOS correctly: there is even a DMI type 41 entry for
onboard LAN, but the PCI device number (sic!) is wrong.

-- 
Dr. Martin Wilck <mwi...@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] About stable network interface names

2017-06-06 Thread Martin Wilck
On Mon, 2017-05-29 at 02:35 +0200, Cesare Leonardi wrote:

> I ask because I've done several tests, with different motherboards, 
> adding and removing PCI-express cards and that expectation was not 
> satisfied in many cases.
> 
> For example, in one of those tests I initially had this setup:
> Integrated NIC: enp9s0
> PCIE1 (x1): dual port ethernet card [enp3s0, enp4s0]
> PCIE2 (x16): empty
> PCIE3 (x1): dual port ethernet card [enp7s0, enp8s0]
> 
> Then i inserted a SATA controller in the PCIE2 slot and three NICs
> got 
> renamed:
> Integrated NIC: enp10s0
> PCIE1 (x1): dual port ethernet card [enp3s0, enp4s0]
> PCIE2 (x16): empty
> PCIE3 (x1): dual port ethernet card [enp8s0, enp9s0]
> 
> Why?
> Didn't this interface naming scheme supposed to avoid this kind of
> renaming?
>  From what i've experimented network names are guaranteed to be
> stable 
> across reboots *and* if you doesn't add or remove hardware.

As others have remarked already, PCI bus-device-function is subject to
change.

Yet this highlights a problem with the current "predictable" network
device naming scheme. "biosdevname" uses information from various
sources, including ACPI _DSM method and SMBIOS type 41 and type 9. The
systemd device naming scheme (or the kernel code in pci-label.c, for
that matter) evaluates only _DSM and SMBIOS type 41, and not SMBIOS
type 9. The latter is necessary for mapping system PCI slot numbers to
bus-device-function tuples. Obviously, the physical slot a controller
is connected to is less likely to change than the bus-device-function
number, so exposing it might make a lot of sense.

All of this requires support on the BIOS/Firmware side - without that,
none of the "predictable" schemes work correctly.

Regards,
Martin

-- 
Dr. Martin Wilck <mwi...@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Best way to configure longer start timeout for .device units?

2017-05-08 Thread Martin Wilck
On Sat, 2017-04-29 at 13:40 +0200, Lennart Poettering wrote:
> On Fri, 28.04.17 09:36, Michal Sekletar (msekl...@redhat.com) wrote:
> 
> > Hi,
> > 
> > On big setups (read: a lot of multipathed disks), probing and
> > assembling storage may take significant amount of time. However, by
> > default systemd waits only 90s (DefaultTimeoutStartSec) for
> > "top-level" device unit to show up, i.e. one that is referenced in
> > /etc/fstab.
> > 
> > One possible solution is to change JobTimeout for device unit by
> > adding x-systemd.device-timeout= option to fstab entries. This is
> > kinda ugly.
> > 
> > Another option is to bump value of DefaultTimeoutStartSec, since
> > that
> > is what systemd uses as default timeout for device's unit start
> > job.
> > However, this has possible negative effect on all other units as
> > well,
> > e.g. service Exec* timeouts will be affected by this change.
> > 
> > I am looking for elegant solution that doesn't involve rewriting
> > automation scripts that manage /etc/fstab.
> > 
> > Is there any other way how to configure the timeout? Can't we
> > introduce new timeout value specifically for device units?
> > 
> > Any advice is much appreciated, thanks.
> 
> Note that x-systemd.device-tiemout= is implemented by simply writing
> out drop-in snippets for the .device unit. Hence, if you know the
> device unit names ahead you can write this out from any tool you
> like.
> 
> I am not overly keen on adding global but per-unit-type options for
> this, but then again I do see the usefulness, hence such a
> DefaultDeviceTimeoutStartSec= setting might be OK to add...

Wouldn't this be, at least in part, be covered by using the newly
introduced "JobRunningTimeoutSec" for devices?

https://github.com/systemd/systemd/commit/db7076bf78bd8e466ae927b6d3ddf
64190c8d299

https://github.com/systemd/systemd/pull/5164


Martin


-- 
Dr. Martin Wilck <mwi...@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Should automount units for network filesystems be Before=local-fs.target?

2017-04-27 Thread Martin Wilck
On Thu, 2017-04-27 at 15:53 +1000, Michael Chapman wrote:
> Hello all,
> 
> At present, when systemd-fstab-generator creates an automount unit
> for an 
> fstab entry, it applies the dependencies that would have been put
> into the 
> mount unit into the automount unit instead.
> 
> For a local filesystem, this automount unit would be 
> Before=local-fs.target. For a network filesystem, it gets 
> Before=remote-fs.target. If the mount is not noauto, it also gets a 
> corresponding WantedBy= or RequiredBy= dependency.
> 
> Would it make more sense for the automount unit to be ordered before
> (and, 
> if not noauto, be pulled in by) local-fs.target, even for network 
> filesystems?

Please don't. We don't need additional failure points for local-
fs.target.

Martin

-- 
Dr. Martin Wilck <mwi...@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Early testing for service enablement

2017-04-13 Thread Martin Wilck
On Thu, 2017-04-13 at 11:45 +0200, Lennart Poettering wrote:
> On Thu, 13.04.17 08:49, Mantas Mikulėnas (graw...@gmail.com) wrote:
> 
> > IIRC, enable/disable/is-enabled are implemented entirely via direct
> > filesystem access. Other than that, systemctl uses a private socket
> > when
> > running as root – it talks DBus but doesn't require dbus-daemon.
> 
> Correct, enable/disable/is-enabled can operate without PID 1, but
> they
> usually don't unless the tool detects it is being run in a chroot
> environment.
> 
> And yes, systemctl can communicate with PID 1 through a private
> communication socket that exists as long as PID 1 exists. dbus-daemon
> is not needed, except when your client is unprivileged.

If I interpret this answer correctly, you're saying that "systemctl is-
enabled xyz.service" *should* actually work, even if it's called right
after PID 1 is started. I'm pretty certain that that wasn't the case
for me. My client was running from an udev rule and thus not
unprivileged. That should be considered a bug, then?

My tests were done with systemd 228 a while ago.

Martin

-- 
Dr. Martin Wilck <mwi...@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Early testing for service enablement

2017-04-13 Thread Martin Wilck
Hi all,

is there a way to test whether a certain service is enabled (or is
going to be enabled) that would work even very early in the boot
process (in our case from udev rules called in the "udev trigger" phase
both in initrd and after switching root)?

I tried calling "systemctl is-enabled" but it obviously depends on some
services (dbus, I guess) being functional, and didn't provide reliable
results during early boot for us. 

We finally resorted to scanning *.wants directories by ourselves, but
that's of course sub-optimal (poor mans partial implementation of
systemd's service enablement logic). 

Regards
Martin

-- 
Dr. Martin Wilck <mwi...@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Errorneous detection of degraded array

2017-01-27 Thread Martin Wilck
> 26.01.2017 21:02, Luke Pyzowski пишет:
> > Hello,
> > I have a large RAID6 device with 24 local drives on CentOS7.3.
> > Randomly (around 50% of the time) systemd will unmount my RAID
> > device thinking it is degraded after the mdadm-last-resort@.timer
> > expires, however the device is working normally by all accounts,
> > and I can immediately mount it manually upon boot completion. In
> > the logs below /share is the RAID device. I can increase the timer
> > in /usr/lib/systemd/system/mdadm-last-resort@.timer from 30 to 60
> > seconds, but this problem can randomly still occur.

It seems to me that you rather need to decrease the timeout value, or
(more reasonable) increase x-systemd.device-timeout for the /share
mount point.
Unfortunately your log excerpt contains to time stamps but I suppose
you're facing a race where the device times out before the "last
resort" timer starts it (and before the last devices appear).

Martin

-- 
Dr. Martin Wilck <mwi...@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel