I forgot to mention that I'm running svn_111b. I let the memory test run for 9 hours with 4 tests and not a single error came back. I'm having strong doubts that it is the memory, but I think it is still remotely possible. Originally when I got the motherboard it came flashed with version 1611 and it was completely unstable. It wouldn't run any OS for more than 10 seconds! Apparently this board was extremely picky about memory before the BIOS was fixed and the 2001 version was released. Before flashing the BIOS to the new version, I ran memtest for about 10 minutes and it did not find any problems with the memory. So it seems conceivable (although I don't know how) that there could be some instability with the memory/motherboard/north bridge and mem test can not detect it. The only way to know convincingly would be to reflash the BIOS back to version 1611 and run mem test until either it encounters errors or several successful tests have been run.
I reseated the SATA cables and it made no difference. I swapped the 500 watt power supply for a 750 watt power supply, which has 12 SATA power connectors. This is a much better arrangement than the setup I had before, because the other power supply only had 3 SATA power connectors. Rescrubbing the drive resulted in a handful of checksum errors again. I'm not surprised, but the power supply is not the issue. I also tried a small test to see if switching from "IDE enhanced" to "IDE standard" in the BIOS would result in clearing up the check sum errors. This also failed to make any difference. I found out that you cannot simply switch from IDE to AHCI in the bios once you have a zfs raid in place. It certainly found all of the drives when I tried, but Solaris says that the whole raid is faulted. Maybe the geometry of the disk changed?? So now I'm making a secondary backup from what is on the RAID under the IDE mode. And I'll have to reconstruct the RAID by moving off the data, destroying the zpool, switching modes to AHCI, re-creating the zpool, and finally coping the data back. This won't be completed until tomorrow because of the amount of data that I am dealing with. Before I nuke the current RAID, I do plan on swapping out the memory for some other higher end memory that I have in another machine. I'll rescrub the array and see if that makes any difference. I'm not very hopeful at this point. If switching modes from IDE to AHCI doesn't work either, I don't think there is too many things left to test. 1) I can reflash the memory and investigate mem test's ability to detect some weird motherboard/memory instability. 2)I can start applying a slight over voltage to the south bridge (which has the intel controller on it) and north bridge and see if that makes any difference. The BIOS has more than just a few options for tweaking things. The motherboard has over clockers heaven written all over it. And no, I'm not overclocking or over/under voltaging anything. Over voltage-ing really only makes sense if mem test cannot detect the problem I was having with in BIOS version 1611. Over voltage-ing did make some difference before I switched to BIOS version 2001. 3) I can RMA the motherboard and get a new one. 4) I could buy memory that is on the QVL list, but I find this highly unlikely to make any difference. The other higher end memory that I'll be trying is very similar to the corresponding one in the QVL list--Kingston HyperX DDR2 1066MHz KHX8500D2K4/4G instead of KHX8500D2K2/2G. But I actually think they might be the same, except the 4G isn't made any more. It was a kit of 4 1GB modules instead of the 2 1GB modules. However, it should be noted this memory also did not work with the 1611 BIOS version. I have no idea if they updated the QVL list after Asus released the 2001 BIOS version. 5) I could also try a different motherboard with a different chipset--like the Intel X48 instead of the Intel P45. I'm not for sure that would get me much, so I'm not super optimistic or enthusiastic about this option. 6) If 1-3 don't work, I guess I'll have to give up completely on the onboard SATA controllers and use something different--maybe a LSI controller? If I get this far down the rabbit hole, I'm going to start suspecting buggy Solaris drivers. As a side note, I found that if you start a scrub and then reboot before it completes, Solaris seems to restart the scrub. Not that it matters much, but I didn't see an obvious way of canceling the scrub. Even once I finally get to the bottom of what is causing these check sum errors, I am a little unsure if I should trust the data that is being copied off of the current setup. zpool has probably *fixed* 40 check sum errors by this point in time, but it has not reported any unrecoverable errors. So I can either trust that it is working correctly and none of my data is bad, and use this data once things are working for real OR I can use an older copy, and lose a few days worth of work. Sadly this older backup was created from the current zfs RAID setup about a day after I had things working, so it probably also had a handful of check sum errors auto-magically fixed. This question is should I REALLY trust that zfs is working and dealing with this mystery problem? -- This message posted from opensolaris.org _______________________________________________ storage-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/storage-discuss
