Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
If SunVTS is installed you may also want to consider running ramtest: SunVTS 7.0: cd /usr/sunvts/bin/sparcv9 ( or bin/64 ) ./ramtest -xo pass=2 HTH, Marion This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> If you don't test your RAM, how are you sure you have no problems (unless > you exclusively use ECC memory)? I usually keep an eagle eye on my personal systems. If something appears to be wrong, I usually spend considerable time into diagnosing. Goes as far as me running zpool status everytime I think I've heard suspicious disk activity (like screeching, which usually ends up being some neighbor with a flex, and stuff like that :) Professional systems naturally use ECC. -mg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Pete Bentley wrote: > Mario Goebbels wrote: >> Heh, the last ever RAM problems I had was a broken 1MB memory stick on >> that wannabe 486 from Cyrix like over a decade ago. And I never test my >> machines for broken sticks :) > > If you don't test your RAM, how are you sure you have no problems (unless you > exclusively use ECC memory)? Even if you use ECC :-) though the probability that ECC will show an error is much better than simple parity or nothing. WARNING: PC vendors are very cost sensitive. In most cases, they will not offer ECC. Try going to Fry's and asking for ECC memory, they will laugh at you (that is, if they even know what ECC is) > For example, a friend recently built a new zfs home fileserver which appeared >to work fine but a zpool scrub of a large raidz pool after copying lots of > files into it would consistently return one or two errors. That turned out > to be marginal RAM, showed up by a long memtest86 run. Swapped the RAM and > the problem went away. > > So RAM problems may not manifest themselves very obviously without some kind > of checksumming technology (either a zfs pool or ECC on the memory itself). I > have often wondered how much of Windows' poor reputation for stability is > actually due to uncorrected RAM errors on cheapo PCs. A Microsoft paper says that memory-induced failures are now in the top-10 list of common failures. Microsoft is trying to create change, but since ECC DIMMs will always cost more than non-ECC DIMMs, the market has not shown any interest. http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=199601761 -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Mario Goebbels wrote: > Heh, the last ever RAM problems I had was a broken 1MB memory stick on > that wannabe 486 from Cyrix like over a decade ago. And I never test my > machines for broken sticks :) If you don't test your RAM, how are you sure you have no problems (unless you exclusively use ECC memory)? For example, a friend recently built a new zfs home fileserver which appeared to work fine but a zpool scrub of a large raidz pool after copying lots of files into it would consistently return one or two errors. That turned out to be marginal RAM, showed up by a long memtest86 run. Swapped the RAM and the problem went away. So RAM problems may not manifest themselves very obviously without some kind of checksumming technology (either a zfs pool or ECC on the memory itself). I have often wondered how much of Windows' poor reputation for stability is actually due to uncorrected RAM errors on cheapo PCs. Pete. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
more history (geezing) below... Frank Leers wrote: > MC wrote: >>> Richard, thanks for the pointer to the tests in >>> '/usr/sunvts', as this >>> is the first I have heard of them. They look quite >>> comprehensive. >>> I will give them a trial when I have some free time. >>> Thanks >>> Nigel Smith >>> >>> pmemtest- Physical Memory Test >>> ramtest - Memory DIMMs (RAM) Test >>> vmemtest- Virtual Memory Test >>> cddvdtest - Optical Disk Drive Test >>> cputest - CPUtest >>> disktest- Disk and Floppy Drives Test >>> dtlbtest- Data Translation Look-aside Buffer >>> Test >>> fputest - Floating Point Unit Test >>> l1dcachetest- Level 1 Data Cache Test >>> l2sramtest - Level 2 Cache Test >>> netlbtest - Net Loop Back Test >>> nettest - Network Hardware Test >>> serialtest - Serial Port Test >>> tapetest- Tape Drive Test >>> usbtest - USB Device Test >>> systest - System Test >>> iobustest - Test for the IO interconnects and >>> the Components on the IObus on high end Machines >> >> That is apparently one of those crazy hidden features in OpenSolaris that I >> think Indiana should expose :) >> > > VTS has been around for many years, although may have been more widely > deployed on SPARC hardware. VTS is Sun Services' tool of choice when > 'validating' hardware (V_alidation T_est S_uite). Manufacturing also > use the tool suite extensively to burn in hardware on their floor before > shipping. IIRC, SunVTS came from SunDiag (wow! http://docs.sun.com/app/docs/doc/801-6627 :-) which I first saw delivered on 1/2" tape :-) SunVTS is still actively developed, and we (actually, my sibling group) is very interested in any bugs or RFEs. Test developers are always looking to improve test coverage. The more we can catch in development or the factory, the less you should see in the field. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> Yes, I'm not surprised. I thought it would be a RAM problem. > I always recommend a 'memtest' on any new hardware. > Murphy's law predicts that you only have RAM problems > on PC's that you don't test! Heh, the last ever RAM problems I had was a broken 1MB memory stick on that wannabe 486 from Cyrix like over a decade ago. And I never test my machines for broken sticks :) -mg signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
MC wrote: >> Richard, thanks for the pointer to the tests in >> '/usr/sunvts', as this >> is the first I have heard of them. They look quite >> comprehensive. >> I will give them a trial when I have some free time. >> Thanks >> Nigel Smith >> >> pmemtest- Physical Memory Test >> ramtest - Memory DIMMs (RAM) Test >> vmemtest- Virtual Memory Test >> cddvdtest - Optical Disk Drive Test >> cputest - CPUtest >> disktest- Disk and Floppy Drives Test >> dtlbtest- Data Translation Look-aside Buffer >> Test >> fputest - Floating Point Unit Test >> l1dcachetest- Level 1 Data Cache Test >> l2sramtest - Level 2 Cache Test >> netlbtest - Net Loop Back Test >> nettest - Network Hardware Test >> serialtest - Serial Port Test >> tapetest- Tape Drive Test >> usbtest - USB Device Test >> systest - System Test >> iobustest - Test for the IO interconnects and >> the Components on the IObus on high end Machines > > > That is apparently one of those crazy hidden features in OpenSolaris that I > think Indiana should expose :) > VTS has been around for many years, although may have been more widely deployed on SPARC hardware. VTS is Sun Services' tool of choice when 'validating' hardware (V_alidation T_est S_uite). Manufacturing also use the tool suite extensively to burn in hardware on their floor before shipping. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> Richard, thanks for the pointer to the tests in > '/usr/sunvts', as this > is the first I have heard of them. They look quite > comprehensive. > I will give them a trial when I have some free time. > Thanks > Nigel Smith > > pmemtest- Physical Memory Test > ramtest - Memory DIMMs (RAM) Test > vmemtest- Virtual Memory Test > cddvdtest - Optical Disk Drive Test > cputest - CPUtest > disktest- Disk and Floppy Drives Test > dtlbtest- Data Translation Look-aside Buffer > Test > fputest - Floating Point Unit Test > l1dcachetest- Level 1 Data Cache Test > l2sramtest - Level 2 Cache Test > netlbtest - Net Loop Back Test > nettest - Network Hardware Test > serialtest - Serial Port Test > tapetest- Tape Drive Test > usbtest - USB Device Test > systest - System Test > iobustest - Test for the IO interconnects and > the Components on the IObus on high end Machines That is apparently one of those crazy hidden features in OpenSolaris that I think Indiana should expose :) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Richard, thanks for the pointer to the tests in '/usr/sunvts', as this is the first I have heard of them. They look quite comprehensive. I will give them a trial when I have some free time. Thanks Nigel Smith pmemtest- Physical Memory Test ramtest - Memory DIMMs (RAM) Test vmemtest- Virtual Memory Test cddvdtest - Optical Disk Drive Test cputest - CPUtest disktest- Disk and Floppy Drives Test dtlbtest- Data Translation Look-aside Buffer Test fputest - Floating Point Unit Test l1dcachetest- Level 1 Data Cache Test l2sramtest - Level 2 Cache Test netlbtest - Net Loop Back Test nettest - Network Hardware Test serialtest - Serial Port Test tapetest- Tape Drive Test usbtest - USB Device Test systest - System Test iobustest - Test for the IO interconnects and the Components on the IObus on high end Machines This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Yes, I'm not surprised. I thought it would be a RAM problem. I always recommend a 'memtest' on any new hardware. Murphy's law predicts that you only have RAM problems on PC's that you don't test! Regards Nigel Smith This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Ok, well I fired up Memtest and had a failure on the first run. I've run it twice more and have yet to manage to get it through a full run. Memory problem it is. :( Sorry to bother everyone. Thanks for the help. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
On Fri, 31 Aug 2007, Zeke wrote: >> Are you sure your hardware is working without >> problems? >> I would first check the RAM with memtest86+ >> http://www.memtest.org/ > > I'll give this a shot tonight when I get home. I believe that Ubuntu liveCDs > have a memtest boot option on them, if not I've got a Memtest disc somewhere. > I'll run it at least 24h and let you know how it goes. http://www.ultimatebootcd.com/ has memtest86. IMHO you have some serious hardware issue(s) with this system. OpenSolaris tends to push the underlying system hardware pretty hard. I've seen systems fail to install (Open)Solaris - while they installed and ran other OSes just fine. In one case, where a system failed to load Solaris, I advised the OP to remove the CPU fan from the CPU cooler and visually inspect if there was a layer of dust restricting airflow over the heatsink [1]. This turned out to be the issue - and after removing all the crap from the heatsink, he was able to load Solaris just fine. A similar issue, when the CPU fan is 2+ years old, is that the fan bearings are foobarred and the fan slows down when the heatsink starts to warn up. In this case, when you pop the side cover off, everything appears to be working just fine. Ten minutes later, *after* you've replaced the covers, the fan slows down to almost nothing and your system starts to "mis-behave". Recommendation: replace the CPU cooler fan assembly if its 2 years or older. PS: For a long time, the AMD factory coolers were completely un-reliable. And the very thin spacing between the heatsink fins nicely facilitated the capture and buildup of a layer of dust. I always recommend replacement of older AMD factory coolers with Zalman (www.zalmanusa.com) parts. Email me offlist if you want specific part recommendations. On older systems I recommend the Zalman passive copper heatsink (CNPS6000-Cu) in conjunction with the (fan bracket) FB123 with one or more 92mm (Zalman) fans. [1] you *must* remove the fan to do the inspection. You can't see the thin layer of crap with the fan in place. >> How many megabytes of RAM do you have on this PC? >> Can you get any other operating system, like Ubuntu >> to work ok on this hardware? > > It's got 1GB of RAM, and Solaris is the first OS I've installed on this > particular system. I ran an Ubuntu LiveCD and did notice some instability > while attempting to install some extra packages (was trying to get a JDK > installed to run the Solaris driver tool) though, so maybe it is the RAM. > >> I think it would be useful to know which chipset and >> hence driver you are using to connect the sata drives. >> I would guess it's the AHCI driver. >> See this link to see how I answered this question for >> my system: >> http://mail.opensolaris.org/pipermail/zfs-discuss/2007 >> -May/040562.html > > Again, I'll take a look at this when I get home. I strongly suspect that it > is indeed the AHCI driver. > Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> Do you by chance mean Silicon Image with that "SI"? > Their chipsets aren't exactly known for reliability and > data safety. Just pointing that out as potential source > of problems. Indeed it is; however I'm not using that controller for anything at the moment, it's simply in the system with nothing hooked up to it. I had read somewhere that ZFS performance is better if you have disks in a RaidZ spread across controllers, and there are 2 on the motherboard already, so I was hoping to use 3 controllers for 3 disks. However, I noticed a message in the Solaris installer that there was an unrecognized controller which for some reason I thought was that card, so I didn't use it. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> Are you sure your hardware is working without > problems? > I would first check the RAM with memtest86+ > http://www.memtest.org/ I'll give this a shot tonight when I get home. I believe that Ubuntu liveCDs have a memtest boot option on them, if not I've got a Memtest disc somewhere. I'll run it at least 24h and let you know how it goes. > How many megabytes of RAM do you have on this PC? > Can you get any other operating system, like Ubuntu > to work ok on this hardware? It's got 1GB of RAM, and Solaris is the first OS I've installed on this particular system. I ran an Ubuntu LiveCD and did notice some instability while attempting to install some extra packages (was trying to get a JDK installed to run the Solaris driver tool) though, so maybe it is the RAM. > I think it would be useful to know which chipset and > hence driver you are using to connect the sata drives. > I would guess it's the AHCI driver. > See this link to see how I answered this question for > my system: > http://mail.opensolaris.org/pipermail/zfs-discuss/2007 > -May/040562.html Again, I'll take a look at this when I get home. I strongly suspect that it is indeed the AHCI driver. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> This could be a hardware problem. Bad powersuply for > the load? Try removing 2 of the large disks. I should have mentioned in my first post that this is the very first thing I thought, and that I've already swapped the power supply with one I know can handle the load (an Enermax 400W which has powered a significantly more power-hungry system than this just fine). If all else fails (going to run Memtest for 24h as suggested below first), I'll remove two of the drives and try again. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> I added a SI 2 port PCI SATA controller, but it seemed to not be recognized > so I am not using it. Do you by chance mean Silicon Image with that "SI"? Their chipsets aren't exactly known for reliability and data safety. Just pointing that out as potential source of problems. -mg signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Nigel Smith wrote: > Are you sure your hardware is working without problems? > I would first check the RAM with memtest86+ > http://www.memtest.org/ Also, SunVTS should be in /usr/sunvts and includes memory and disk tests (plus others). This is the test suite we (Sun) use in manufacturing. Take care when using destructive tests :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
Are you sure your hardware is working without problems? I would first check the RAM with memtest86+ http://www.memtest.org/ How many megabytes of RAM do you have on this PC? Can you get any other operating system, like Ubuntu to work ok on this hardware? I think it would be useful to know which chipset and hence driver you are using to connect the sata drives. I would guess it's the AHCI driver. See this link to see how I answered this question for my system: http://mail.opensolaris.org/pipermail/zfs-discuss/2007-May/040562.html Regards Nigel Smith This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!
> The problems I'm experiencing are as follows: > ZFS creates the storage pool just fine, sees no errors on the drives, and > seems to work great...right up until I attempt to put data on the drives. > After only a few moments of transfer, things start to go wrong. The system > doesn't power off, it just beeps 4-5 times. The X session dies and the > monitor turns off (doesn't drop back to a console). All network access dies. > It seems that the system panics (is it called something else in > solaris-land?). The HD access light stays on (though I can hear no drives > doing anything strenuous), and the CD light blinks. This has happened two or > three times, every time I've tried to start copying data to the ZFS pool. > I've been transfering over the network, via SCP or NFS. This could be a hardware problem. Bad powersuply for the load? Try removing 2 of the large disks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss