I rebooted to upgrade to kernel 3.4.1. I accidentally had the
combination of uvesafb, nouveau kms and nvidia-drivers enabled, which
caused my system to go blank after rebooting. I was not able to SSH
into the machine, so I did the magic-sysrq REISUB to reboot into my
previous kernel. When it booted into the previous kernel (3.3.5), I
saw a whole bunch of I/O error messages scrolling by, for every disk
in my RAID array. I have never seen these errors before. I hoped it
was just some module confusion because I was booting a different
kernel. I was able to boot into my root filesystem, but the raid did
not assemble. After blacklisting nouveau and rebooting into 3.4.1,
there were none of the I/O errors mentioned, but mdraid failed with
this message:
* Starting up RAID devices ...
* mdadm main: failed to get exclusive lock on mapfile
mdadm: /dev/md2 is already in use.
mdadm: /dev/md1 is already in use.
[ !! ]
Oh no! Heart beating quickly... terabytes of data... Google finds
nothing useful with these messages.
My mdadm.conf has not changed, no physical disks have been added or
removed in over a year. mdadm configuration has not changed at all. I
have of course updated hundreds of packages since my last reboot,
including mdadm.
From the /proc/mdstat it shows that it's not detecting all of the
member disks/partitions:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
[raid4] [multipath] [faulty]
md1 : inactive sdb1[0](S)
1048575868 blocks super 1.1
md2 : inactive sdf2[5](S)
904938415 blocks super 1.1
unused devices: none
Those normally included all disks in sdb through sdf, partition 1 and
2 from each disk.
My mdadm.conf has always had only two ARRAY lines (for /dev/md1 and
/dev/md2) with the UUID of the arrays. Previously the member disks
were always automatically detected and assembled when I booted and
started mdadm. Running mdadm --query --examine on the partitions
showed they did still contain the valid raid information. So I felt
confident in trying to reassemble it.
To fix, I did:
/etc/init.d/mdraid stop
to stop the array (could have also done mdadm -Ss, which is what the
stop script did)
Then I edited mdadm.conf and added a device line:
DEVICE /dev/sd[bcdef][12]
So now I am telling it specifically where to look. I then restarted mdraid:
/etc/init.d/mdraid start
et voilà! my raid was back and functioning. I don't know if this is a
result of a change in kernel or mdadm behavior, or simply a result of
my REISUB that left the raid in a strange state.