Hi, For the last week I am working on the fix for the ataraid(4) related to the bug reported in kern/43986 and partially to kern/59130 (hits same issue due to different bug).
Taylor has a good analysis and explanation of the issue https://mail-index.netbsd.org/netbsd-bugs/2024/03/26/msg082202.html, which I also noticed by testing ATA RAID setup on VIA controllers. For the short context, ataraid(4) configures RAID array and all disks information depending on vendor in ata_raid_<vendor>.c components by each connected drive using information from RAID config blocks. The problem is that code assumes that all initially configured RAID drives exist and are attached. However, given one drive is missing (removed/faulty/code bug), configuration of the drive will be skipped leading to failure on https://nxr.netbsd.org/xref/src/sys/dev/ata/ld_ataraid.c#229 due to adi->adi_dev being NULL (or more specifically in device_xname(adi->adi_dev) at https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_subr.c#71). After some discussion I ended up with following patch: https://netbsd.org/~andvar/ata_raid_fix.diff. It checks that disk status is online (adi->adi_status has ADI_S_ONLINE status flag), otherwise treats it as if vnode_find returned NULL. That would solve described situation a bit more gracefully and avoid the crash. Initially it looked OK and I successfully tested the patch on VIA machines (by setting up RAID, removing on of the RAID components before next reboot, also deleting ). However, after analyzing various RAID components I noticed it may not work for promise and intel RAIDs. Promise (https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_promise.c#194) and intel (https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_intel.c#278) RAIDs may have ADI_S_SPARE status which removes online flag. I don't have these controllers, but I assume my patch would treat these drives incorrectly as missing. Other RAID types use only ADI_S_ONLINE | ADI_S_ASSIGNED, thus patch would work for them. Given that three statuses are defined for adi_status (https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raidvar.h#75), I probably need to check if any of the flags are defined ((adi_status & (ADI_S_ONLINE | ADI_S_ASSIGNED | ADI_S_SPARE)) instead (https://netbsd.org/~andvar/ata_raid_fix2.diff). Another alternative is to check that adi->adi_dev IS NULL as Taylor proposed in his analysis thread. Please advice if any of these two proposals would be good enough to solve the issue or something else should be considered? Thank you. Regards, Andrius Varanavicius