Re: [storage-discuss] zfs repeatable checksum errors

Kurt Zoglmann Wed, 08 Jul 2009 08:55:52 -0700

I forgot to mention that I'm running svn_111b.

I let the memory test run for 9 hours with 4 tests and not a single error came 
back. I'm having strong doubts that it is the memory, but I think it is still 
remotely possible. Originally when I got the motherboard it  came flashed with 
version 1611 and it was completely unstable. It wouldn't run any OS for more 
than 10 seconds! Apparently this board was extremely picky about memory before 
the BIOS was fixed and the 2001 version was released. Before flashing the BIOS 
to the new version, I ran memtest for about 10 minutes and it did not find any 
problems with the memory. So it seems conceivable (although I don't know how) 
that there could be some instability with the memory/motherboard/north bridge 
and mem test can not detect it. The only way to know convincingly would be to 
reflash the BIOS back to version 1611 and run mem test until either it 
encounters errors or several successful tests have been run.


I reseated the SATA cables and it made no difference. I swapped the 500 watt 
power supply for a 750 watt power supply, which has 12 SATA power connectors. 
This is a much better arrangement than the setup I had before, because the 
other power supply only had 3 SATA power connectors. Rescrubbing the drive 
resulted in a handful of checksum errors again. I'm not surprised, but the 
power supply is not the issue.

I also tried a small test to see if switching from "IDE enhanced" to "IDE 
standard" in the BIOS would result in clearing up the check sum errors. This 
also failed to make any difference.

I found out that you cannot simply switch from IDE to AHCI in the bios once you 
have a zfs raid in place. It certainly found all of the drives when I tried, 
but Solaris says that the whole raid is faulted. Maybe the geometry of the disk 
changed?? So now I'm making a secondary backup from what is on the RAID under 
the IDE mode. And I'll have to reconstruct the RAID by moving off the data, 
destroying the zpool, switching modes to AHCI, re-creating the zpool, and 
finally coping the data back. This won't be completed until tomorrow because of 
the amount of data that I am dealing with.

Before I nuke the current RAID, I do plan on swapping out the memory for some 
other higher end memory that I have in another machine. I'll rescrub the array 
and see if that makes any difference. I'm not very hopeful at this point.

If switching modes from IDE to AHCI doesn't work either, I don't think there is 
too many things left to test. 
1) I can reflash the memory and investigate mem test's ability to detect some 
weird motherboard/memory instability. 
2)I can start applying a slight over voltage to the south bridge (which has the 
intel controller on it) and north bridge and see if that makes any difference. 
The BIOS has more than just a few options for tweaking things. The motherboard 
has over clockers heaven written all over it. And no, I'm not overclocking or 
over/under voltaging anything. Over voltage-ing really only makes sense if mem 
test cannot detect the problem I was having with in BIOS version 1611. Over 
voltage-ing did make some difference before I switched to BIOS version 2001.
3) I can RMA the motherboard and get a new one.
4) I could buy memory that is on the QVL list, but I find this highly unlikely 
to make any difference. The other higher end memory that I'll be trying is very 
similar to the corresponding one in the QVL list--Kingston HyperX DDR2 1066MHz 
KHX8500D2K4/4G instead of KHX8500D2K2/2G. But I actually think they might be 
the same, except the 4G isn't made any more. It was a kit of 4 1GB modules 
instead of the 2 1GB modules. However, it should be noted this memory also did 
not work with the 1611 BIOS version. I have no idea if they updated the QVL 
list after Asus released the 2001 BIOS version.
5) I could also try a different motherboard with a different chipset--like the 
Intel X48 instead of the Intel P45. I'm not for sure that would get me much, so 
I'm not super optimistic or enthusiastic about this option.
6) If 1-3 don't work, I guess I'll have to give up completely on the onboard 
SATA controllers and use something different--maybe a LSI controller?  If I get 
this far down the rabbit hole, I'm going to start suspecting buggy Solaris 
drivers.

As a side note, I  found that if you start a scrub and then reboot before it 
completes, Solaris seems to restart the scrub. Not that it matters much, but I 
didn't see an obvious way of canceling the scrub.

Even once I finally get to the bottom of what is causing these check sum 
errors, I am a little unsure if I should trust the data that is being copied 
off of the current setup. zpool has probably *fixed* 40 check sum errors by 
this point in time, but it has not reported any unrecoverable errors. So I can 
either trust that it is working correctly and none of my data is bad, and use 
this data once things are working for real OR I can use an older copy, and lose 
a few days worth of work. Sadly this older backup was created from the current 
zfs RAID setup about a day after I had things working, so it probably also had 
a handful of check sum errors auto-magically fixed. 

This question is should I REALLY trust that zfs is working and dealing with 
this mystery problem?
-- 
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Re: [storage-discuss] zfs repeatable checksum errors

Reply via email to