Thank you Marian for the IDE to AHCI tip. Doing an export and than an import 
did the trick. And it was much quicker than having to copy everything off and 
then back.

The zfs scrub using AHCI went no better than using the IDE mode. It resulted in 
about the same number of checksum errors. It did seem to be a tiny bit faster, 
but not by all that much--1hr 20min versus 1hr 30min.

I next tried switching out the memory and ran a test using the DDR2 Kingston 
HyperX memory. This memory has been heavily tested in another machine 
overclocked and running two instances of Folding @ Home in a virtualized 
environment and 1 GPU client for over a year with no problem. The memory also 
passed a 24 hour prime95 torture test in this Folding @ Home machine. The 
memory is rated for 1066, but I forced it to 800Mhz for the test in my media 
server. 

Finally this made a difference! Instead of nearly 10-12 checksum errors, it 
only came back with one error. So I doubled checked all of the settings and 
changed several things related to the memory that were marked auto. I also 
forced the memory to 2.2V, since that is what it needs at 1066Mhz. I have no 
idea what the default was doing or even if it was really required at 800Mhz.

After manually setting the timings and voltage, I re-ran the scrub again. No 
dice. It still came back with one zfs checksum error.

I now believe that the motherboard is bad or it is ridiculously picky about 
memory. The question is whether it is some controller instability or whether it 
is some other problem. Reluctantly I blew away my Solaris drive and installed 
Windows on it. (I don't have a spare drive.) I decided to REALLY test the 
memory subsystem by running Prime95. I have had very good luck using this as a 
memory tester during my overclocking foray last year. It easily catches memory 
problems that escape memtest86.

Sure enough within 15 minutes, the first Prime95 self test failed. I stopped it 
and tried again. And within about the same time period it failed again. So it 
seems that it is *almost* stable. I have never seen a failure after making it 
50 minutes running Prime95. Anyway, I plan on RMA'ing the motherboard and going 
ahead and buying 8GB of memory from Asus's qualified vendor list for the new 
motherboard.

You can be sure I will be testing this new motherboard and memory with Prime95 
before I even think about getting Open Solaris up and going again.

I guess the moral of the story is that you REALLY need to extensively test your 
memory/motherboard/cpu even if you don't plan on over clocking anything. This 
mistake has cost me a lot of time and aggravation. Honestly, coupled with the 
other problems I have had with this motherboard (one of them a misunderstanding 
on my part), this is easily the worst motherboard I have ever owned!

Now that I have had quite a number of checksum errors get caught by zfs. I 
really wish that zfs would give a bit more information on how it automatically 
fixed the problem. Since this is a memory error, it should have generally read 
the data and checksum from disk and then found an error on occasion. I hope it 
tries to re-read the data and checksum from disk to a different memory 
location. If it passes, it should be reported but indicate that nothing was 
written back to the disk. This could help someone infer whether it was 
memory/controller related or something with the disk.
-- 
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to