Re: [systemd-devel] Errorneous detection of degraded array

2017-01-30 Thread Luke Pyzowski
> Does
>   systemctl  list-dependencies  sys-devices-virtual-block-md0.device
> report anything interesting?  I get
>
> sys-devices-virtual-block-md0.device
> ● └─mdmonitor.service

Nothing interesting, the same output as you have above.



> Could you try run with systemd.log_level=debug on kernel command line and 
> upload journal again. We can only hope that it will not skew timings enough 
> but it may prove my hypothesis.

I've uploaded the full debug logs to: 
https://gist.github.com/Kryai/8273322c8a61347e2300e476c70b4d05
In around 20 reboots, the error appeared only twice, certainly with debug 
enabled it is more rare, but it does still occur, but to your correct guess, 
debug logging does affect the exhibition of the race condition.

Reminder of key things in the log:
# cat /etc/systemd/system/mdadm-last-resort@.timer 
[Unit]
Description=Timer to wait for more drives before activating degraded array.
DefaultDependencies=no
Conflicts=sys-devices-virtual-block-%i.device

[Timer]
OnActiveSec=30



# cat /etc/systemd/system/share.mount 
[Unit]
Description=Mount /share RAID partition explicitly
Before=nfs-server.service

[Mount]
What=/dev/disk/by-uuid/2b9114be-3d5a-41d7-8d4b-e5047d223129
Where=/share
Type=ext4
Options=defaults
TimeoutSec=120

[Install]
WantedBy=multi-user.target


Again, if any more information is needed please let me know I'll provide it.


Many thanks,
Luke Pyzowski
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Errorneous detection of degraded array

2017-01-27 Thread Luke Pyzowski
mdadm-last-resort@.timer to 
OnActiveSec=300 and TimeoutSec=320. This give me enough time to log in to the 
system. During that time, I can view the RAID and everything appears proper, 
yet 300 seconds later the mdadm-last-resort@.timer expires with an error on 
/dev/md0?

Perhaps systemd is working normally, but then the question is why is the raid 
not being assembled properly - which triggers 
/usr/lib/systemd/system/mdadm-last-resort@.service?

From your suggestion I need to next move to full udev debug logs? Can I delay 
the assembly of the RAID by the kernel if this is a race condition, as it 
appears that might be the case? Should that delay be early in the systemd 
startup process or elsewhere?

Thanks again,
Luke Pyzowski
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Errorneous detection of degraded array

2017-01-26 Thread Luke Pyzowski
Hello,
I have a large RAID6 device with 24 local drives on CentOS7.3. Randomly (around 
50% of the time) systemd will unmount my RAID device thinking it is degraded 
after the mdadm-last-resort@.timer expires, however the device is working 
normally by all accounts, and I can immediately mount it manually upon boot 
completion. In the logs below /share is the RAID device. I can increase the 
timer in /usr/lib/systemd/system/mdadm-last-resort@.timer from 30 to 60 
seconds, but this problem can randomly still occur.

systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting Activate md array even though degraded...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopping Local File Systems.
systemd[1]: Unmounting /share...
systemd[1]: Stopped (with error) /dev/md0.
systemd[1]: Started Activate md array even though degraded.
systemd[1]: Unmounted /share.

When the system boots normally the following is in the logs:
systemd[1]: Started Timer to wait for more drives before activating degraded 
array..
systemd[1]: Starting Timer to wait for more drives before activating degraded 
array..
...
systemd[1]: Stopped Timer to wait for more drives before activating degraded 
array..
systemd[1]: Stopping Timer to wait for more drives before activating degraded 
array..

The above occurs within the same second according to the timestamps and the 
timer ends prior to mounting any local filesystems, it properly detects that 
the RAID is valid and everything continues normally. The other RAID device - a 
RAID1 of 2 disks containing swap and / have never exhibited this failure.

My question is, what are the conditions where systemd detects the RAID6 as 
being degraded? It seems to be a race condition somewhere, but I am not sure 
what configuration should be modified if any. If needed I can provide more 
verbose logs, just let me know if they might be useful.

Many thanks,
Luke Pyzowski
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel