[systemd-devel] Stuck mount units

2022-09-06 Thread Dusty Mabe
Hey all,

This one is a bit of a long shot so I'm not too optimistic about
finding any resolution, but figured I'd try.

In Fedora CoreOS we've seen a mount unit get stuck and the system will
just stay there forever (in the initramfs). Does this sound crazy?
Anyone seen something similar before?

The crazy part here is that debug logging doesn't expose the problem
(not surprising I guess). The even crazier part is that if I do
anything later (while the system is stalled) the stalled unit gets
unstuck and continues. See my proposed workaround that works [1].

Anyway.. There are way more details over in [2]. I just didn't feel
like I had enought to open a proper bug here. 

Thanks for any help!
Dusty

[1] https://github.com/coreos/fedora-coreos-config/pull/1961
[2] 
https://github.com/coreos/fedora-coreos-tracker/issues/1233#issuecomment-1238814171


Re: [systemd-devel] issues with systemd-cryptsetup@.service after in 251-rc3

2022-05-23 Thread Dusty Mabe



On 5/22/22 08:35, Zbigniew Jędrzejewski-Szmek wrote:
> On Fri, May 20, 2022 at 03:51:58PM +0200, Zbigniew Jędrzejewski-Szmek wrote:
>> On Thu, May 19, 2022 at 01:42:43PM +0200, Daan De Meyer wrote:
>>>> Am 19.05.22 um 05:32 schrieb Dusty Mabe:
>>>>> I'm requesting help to try to find a problematic commit between
>>>>> v251-rc2..v251-rc3.
>>>>>
>>>>> We have a test in Fedora CoreOS [1] that tests luks and this test
>>>>> started failing in our rawhide stream with the introduction of
>>>>> 251-rc3. Reverting back to 251-rc2 makes the failing test go away. I
>>>>> briefly looked at the commits from v251-rc2..v251-rc3 but nothing
>>>>> jumped out at me.
>>>>>
>>>>> Any commits in there that we think might be risky or ones we should
>>>>> look at closer?
>>>>
>>>> Please bisect. It’s the most efficient way, and you can do it yourself,
>>>> especially as you have a test to reproduce the issue.
>>>>
>>>>> Here's the original bug [2] I opened against Fedora CoreOS:
>>
>>>>> [1] 
>>>>> https://github.com/coreos/fedora-coreos-config/tree/testing-devel/tests/kola/root-reprovision/luks
>>>>> [2] https://github.com/coreos/fedora-coreos-tracker/issues/1200
>>
>> No idea. There were a few changes to udev and sd-device, and it's
>> possible that they caused some regression. But I don't see anything
>> obvious.
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=87360713 has a
> scratch builds with one patch reverted.  Could you test if it fixes
> the issue?

Just tested a build of FCOS with that scratch build included. It passes our
tests now so it looks like it was that patch that was the problem.


Dusty


[systemd-devel] issues with systemd-cryptsetup@.service after in 251-rc3

2022-05-18 Thread Dusty Mabe
I'm requesting help to try to find a problematic commit between 
v251-rc2..v251-rc3.

We have a test in Fedora CoreOS [1] that tests luks and this test started 
failing in our
rawhide stream with the introduction of 251-rc3. Reverting back to 251-rc2 
makes the
failing test go away. I briefly looked at the commits from v251-rc2..v251-rc3 
but nothing
jumped out at me. 

Any commits in there that we think might be risky or ones we should look at 
closer?

Here's the original bug [2] I opened against Fedora CoreOS:

Dusty

[1] 
https://github.com/coreos/fedora-coreos-config/tree/testing-devel/tests/kola/root-reprovision/luks
[2] https://github.com/coreos/fedora-coreos-tracker/issues/1200


Re: [systemd-devel] Should `MACAddressPolicy=persistent` for bridges/bonds/all-software-devices be reconsidered?

2022-05-12 Thread Dusty Mabe



On 5/12/22 13:36, Dan Streetman wrote:
> On Thu, May 12, 2022 at 11:11 AM Thomas Haller  wrote:
>>
>> Hi Zbyszek,
>>
>>
>>
>> I must say, I personally don't care too much. NetworkManager is fine
>> either way.
>>
>> There is however the problem about RHEL8/9, which patches this
>> downstream. I don't like that deviation, but I'd also be uncomfortable
>> to push that change for RHEL(10) users.
>>
>> But let me play devil's advocate here...
>>
>>
>> On Mon, 2022-05-09 at 19:27 +0200, Zbigniew Jędrzejewski-Szmek wrote:
>>>
>>> FWIW, I still think it's a better _default_.
>>
>> I don't agree that it's clearly better. Your arguments don't seem
>> strong, arguably, neither are mine. Except, that it's not clear that
>> this solves an actual problem, while it clearly causes problems for
>> some people. Just look at the referened issues from !3374.
>>
>>
>> Either
>>
>>   - a user doesn't care about the MAC address,
> 
> note that it's possible for a user not to care about the *specific*
> mac address, only that they want the mac to remain consistent.
> 
> for example, on my system i have br0 bridge with multiple interfaces
> attached, and my local DHCP server is configured to provide a static
> addr to br0's mac. If I replace an interface card (e.g. change from 1g
> to 10g, or just replace a failing nic) then the br0 mac *does not*
> change, if using systemd's default. If I had br0 inheriting its mac
> from one of the attached interfaces, it would change, and i'd have to
> update my dhcp server config.
> 
> I think the argument really comes down to, should the default be to
> have (without user configuration) a mac that is predictable or a mac
> that is consistent. Your argument is for predictability, which makes
> the field engineer's work deploying systems easier, but if you lose
> consistency then the life of the maintenance engineer gets harder.

Interested discussion. Let's take it a bit further.

On your system how did your DHCP server get configured to provide an
addr to br0's MAC? You had to install the OS, and create br0 first
before you even knew the MAC to tell the network admin what the MAC
was. Now you're golden, the MAC will never change for the lifetime
of that OS install even if someone replaces a NIC, but it wasn't a great
first experience really.

On the other hand, what if you re-provision that server often (new machine-id)
you get a new MAC and you get to dance with your network admin again. OR
what if you have disk failure? You most likely backed up your critical data,
but did you backup your machine-id that hashes your new MAC? Probably not
and even if you did would you want to duplicate that machine-id to the new
install you would do?

Barring other reasons, if we simplify it down to just the consistency argument,
one approach seems better for if you replace NIC cards often and one of them
seems better if you re-install your OS often.

Dusty




Re: [systemd-devel] Should `MACAddressPolicy=persistent` for bridges/bonds/all-software-devices be reconsidered?

2022-05-12 Thread Dusty Mabe



On 5/12/22 12:43, Lennart Poettering wrote:
> On Mo, 09.05.22 22:37, Dusty Mabe (du...@dustymabe.com) wrote:
> 
>>> This is true. But one can just as well argument that with
>>> MACAddressPolicy=persistent the address is even more predictable. If
>>> you know the machine-id and device name, you can calculate the address
>>> in advance, even before deciding if the device will e.g. have this or
>>> that card attached.
>>
>> Regarding machine-id, isn't that unique and set on first boot?
> 
> Not necessarily. We will initialize it from the ID passed in through
> DMI if we detect execution in a VM and the ID is not set yet. This
> means cloud providers can control the machine ID a system will use
> ahead of time.
> 

OK. But in practice, how often is that used versus machine-id just being
randomly generated on first boot?

Dusty


Re: [systemd-devel] Should `MACAddressPolicy=persistent` for bridges/bonds/all-software-devices be reconsidered?

2022-05-09 Thread Dusty Mabe



On 5/9/22 13:27, Zbigniew Jędrzejewski-Szmek wrote:
> On Mon, May 09, 2022 at 03:57:21PM +0200, Lennart Poettering wrote:
>> On Mo, 09.05.22 11:23, Thomas Haller (thal...@redhat.com) wrote:
>>
>>> Hi everybody,
>>>
>>> this email is for discussing MACAddressPolicy=persistent in
>>> /data/src/systemd/network/99-default.link
>>
>> I think this would be better discussed on a new github issue, as
>> suggested here:
> 
> I suggested systemd-devel for this… It's not a question of a bug,
> but instead whether the default MACAddressPolicy makes sense. So I think
> it's better to discuss this on the mailing list.
> 
> FWIW, I still think it's a better _default_. The patch that finally
> introduced this was my patch [1], so I'm obviously biased… Some more
> considerations:
> 
> 1. this allows bridge devices to be created without attached
> interfaces, and have a stable MAC address.
> 
> 2. the idea that all interfaces are always available and always in the
> same order is something that isn't necessarilly true for modern systems
> where we want to react to hardware being detected dynamically.
> If we wait until late userspace to configure networking, then yeah, all
> devices will most likely have been detected by that time. But if we want
> to bring networking in the initrd, not all hardware must be detected by
> the time we start configuration.
> 
> 3. one of the reasons to use bridge/bond and similar devices it that
> the agreggate device can function when some of the child devices die.
> By requiring the presence of one of the devices, we're partially
> defeating this.
> 
> [1] https://github.com/systemd/systemd/commit/6d36464065



> 
>> 1) for bridge/bond interfaces, there is a special meaning of leaving
>> the MAC address unassigned. It causes kernel to automatically set the
>> MAC address when the first port gets attached. By setting a persistent
>> MAC address, that automatism is not longer possible.
>>
>> The MAC address can matter, for example, if you configure the DHCP
>> server to hand out IP addresses based on the MAC address (or based on
>> the client-id, which in turn might be based on the MAC address). If you
>> boot many machines (e.g. in data center or cloud), you might know the
>> MAC address of the machines, and thereby can also determine the
>> assigned IP addresses. With MACAddressPolicy=persistent the MAC address
>> is not predictable.
> 
> This is true. But one can just as well argument that with
> MACAddressPolicy=persistent the address is even more predictable. If
> you know the machine-id and device name, you can calculate the address
> in advance, even before deciding if the device will e.g. have this or
> that card attached.


Regarding machine-id, isn't that unique and set on first boot? So you don't
really even know your "new mac address" until your system is up and running
(and now can't get DHCP because your net is locked down). 

For me it's more about expectations and the user being surprised, which isn't
necessarily a bug. If I'm controlling a datacenter (or even my 10s of devices
on my home network) I generally know what hardware exists in the environment
and don't want to be surprised to see activity from "unexpected hardware". I
lock down my DHCP via MAC address of the known hardware NICs in the lab/DC.

Often times hardware gets racked and configured in the datacenter environment
before it can even be powered on. You get the MAC address from the documentation
(labels, stickers, etc) and have your network admins plumb through the necessary
bits.

I understand your arguments for the new behavior, but my preference would be
to keep the old behavior. 


> 
> I'm not sure if we expose this conveniently anywhere… Would it help if
> we document how to do this calculation with python one-liner or some
> helper?
> 
>> 2) udev changing the MAC address causes races with naive scripts/tools.
>> For example:
>>
>>   ip monitor link & 
>>   while : ; do 
>>     ip link del xxx
>>     ip link add name xxx type dummy \
>>     && ip link set xxx addr aa:00:00:00:00:00 \
>>     && ip link show xxx | grep -q aa:00:00:00:00:00 \
>>     || break
>>   done
> 
> Again, this is a question of expecations. With MACAP=p, 'dummy' gets
> the same address every time, and maybe there's no reason to set it
> manually.
> 
> It would be great if we could have 'ip link add' wait for udev to
> process the device. Maybe even by default?
> 
> We also discussed adding a kernel command-line switch similar to
> net.ifnames= to allow this to be configured globally. Would that
> help?

Right. I opened that request here: 
https://github.com/systemd/systemd/issues/23294

I think it helps. I would want to make sure it applied in the initrd as well.

> 
> The cases where the old behaviour is relied on seems to be cases like
> the data center described above. But in that case you're creating
> local configuration anyway, so dropping in an override should be
> acceptable.

At least for bonds I'd argue pretty hard that