Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-11-01 Thread sunvts
If SunVTS is installed you may also
want to consider running ramtest:

SunVTS 7.0:
 cd /usr/sunvts/bin/sparcv9 ( or bin/64 )
 ./ramtest -xo pass=2

HTH,
Marion
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-09-04 Thread me
> If you don't test your RAM, how are you sure you have no problems (unless
> you exclusively use ECC memory)?

I usually keep an eagle eye on my personal systems. If something appears
to be wrong, I usually spend considerable time into diagnosing. Goes as
far as me running zpool status everytime I think I've heard suspicious
disk activity (like screeching, which usually ends up being some neighbor
with a flex, and stuff like that :)

Professional systems naturally use ECC.

-mg

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-09-04 Thread Richard Elling
Pete Bentley wrote:
> Mario Goebbels wrote:
>> Heh, the last ever RAM problems I had was a broken 1MB memory stick on
>> that wannabe 486 from Cyrix like over a decade ago. And I never test my
>> machines for broken sticks :)
> 
> If you don't test your RAM, how are you sure you have no problems (unless you 
> exclusively use ECC memory)?

Even if you use ECC :-) though the probability that ECC will show an error is
much better than simple parity or nothing.

WARNING: PC vendors are very cost sensitive.  In most cases, they will not offer
ECC.  Try going to Fry's and asking for ECC memory, they will laugh at you (that
is, if they even know what ECC is)

> For example, a friend recently built a new zfs home fileserver which appeared 
>to work fine but a zpool scrub of a large raidz pool after copying lots of 
> files into it would consistently return one or two errors.  That turned out 
> to be marginal RAM, showed up by a long memtest86 run. Swapped the RAM and 
> the problem went away.
> 
> So RAM problems may not manifest themselves very obviously without some kind 
> of checksumming technology (either a zfs pool or ECC on the memory itself). I 
> have often wondered how much of Windows' poor reputation for stability is 
> actually due to uncorrected RAM errors on cheapo PCs.

A Microsoft paper says that memory-induced failures are now in the top-10 list
of common failures.  Microsoft is trying to create change, but since ECC DIMMs
will always cost more than non-ECC DIMMs, the market has not shown any interest.
http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=199601761

  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-09-04 Thread Pete Bentley
Mario Goebbels wrote:
> Heh, the last ever RAM problems I had was a broken 1MB memory stick on
> that wannabe 486 from Cyrix like over a decade ago. And I never test my
> machines for broken sticks :)

If you don't test your RAM, how are you sure you have no problems (unless you 
exclusively use ECC memory)?

For example, a friend recently built a new zfs home fileserver which appeared 
   to work fine but a zpool scrub of a large raidz pool after copying lots of 
files into it would consistently return one or two errors.  That turned out 
to be marginal RAM, showed up by a long memtest86 run. Swapped the RAM and 
the problem went away.

So RAM problems may not manifest themselves very obviously without some kind 
of checksumming technology (either a zfs pool or ECC on the memory itself). I 
have often wondered how much of Windows' poor reputation for stability is 
actually due to uncorrected RAM errors on cheapo PCs.

Pete.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-09-01 Thread Richard Elling
more history (geezing) below...

Frank Leers wrote:
> MC wrote:
>>> Richard, thanks for the pointer to the tests in
>>> '/usr/sunvts', as this
>>> is the first I have heard of them. They look quite
>>> comprehensive.
>>> I will give them a trial when I have some free time.
>>> Thanks
>>> Nigel Smith
>>>
>>> pmemtest- Physical Memory Test
>>> ramtest - Memory DIMMs (RAM) Test
>>> vmemtest- Virtual Memory Test
>>> cddvdtest   - Optical Disk Drive Test
>>> cputest - CPUtest
>>> disktest- Disk and Floppy Drives Test
>>> dtlbtest- Data Translation Look-aside Buffer
>>> Test
>>> fputest - Floating Point Unit Test
>>> l1dcachetest- Level 1 Data Cache Test
>>> l2sramtest  - Level 2 Cache Test
>>> netlbtest   - Net Loop Back Test
>>> nettest - Network Hardware Test
>>> serialtest  - Serial Port Test
>>> tapetest- Tape Drive Test
>>> usbtest - USB Device Test
>>> systest - System Test
>>> iobustest   - Test for the IO interconnects and
>>> the Components on the IObus on high end Machines
>>
>> That is apparently one of those crazy hidden features in OpenSolaris that I 
>> think Indiana should expose :)
>>  
> 
> VTS has been around for many years, although may have been more widely 
> deployed on SPARC hardware.  VTS is Sun Services' tool of choice when 
> 'validating' hardware (V_alidation T_est S_uite).  Manufacturing also 
> use the tool suite extensively to burn in hardware on their floor before 
> shipping.

IIRC, SunVTS came from SunDiag (wow! http://docs.sun.com/app/docs/doc/801-6627 
:-)
which I first saw delivered on 1/2" tape :-)

SunVTS is still actively developed, and we (actually, my sibling group) is very
interested in any bugs or RFEs.  Test developers are always looking to improve
test coverage.  The more we can catch in development or the factory, the less 
you
should see in the field.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-09-01 Thread Mario Goebbels
> Yes, I'm not surprised. I thought it would be a RAM problem.
> I always recommend a 'memtest' on any new hardware.
> Murphy's law predicts that you only have RAM problems
> on PC's that you don't test!

Heh, the last ever RAM problems I had was a broken 1MB memory stick on
that wannabe 486 from Cyrix like over a decade ago. And I never test my
machines for broken sticks :)

-mg



signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-31 Thread Frank Leers
MC wrote:
>> Richard, thanks for the pointer to the tests in
>> '/usr/sunvts', as this
>> is the first I have heard of them. They look quite
>> comprehensive.
>> I will give them a trial when I have some free time.
>> Thanks
>> Nigel Smith
>>
>> pmemtest- Physical Memory Test
>> ramtest - Memory DIMMs (RAM) Test
>> vmemtest- Virtual Memory Test
>> cddvdtest   - Optical Disk Drive Test
>> cputest - CPUtest
>> disktest- Disk and Floppy Drives Test
>> dtlbtest- Data Translation Look-aside Buffer
>> Test
>> fputest - Floating Point Unit Test
>> l1dcachetest- Level 1 Data Cache Test
>> l2sramtest  - Level 2 Cache Test
>> netlbtest   - Net Loop Back Test
>> nettest - Network Hardware Test
>> serialtest  - Serial Port Test
>> tapetest- Tape Drive Test
>> usbtest - USB Device Test
>> systest - System Test
>> iobustest   - Test for the IO interconnects and
>> the Components on the IObus on high end Machines
> 
> 
> That is apparently one of those crazy hidden features in OpenSolaris that I 
> think Indiana should expose :)
>  

VTS has been around for many years, although may have been more widely 
deployed on SPARC hardware.  VTS is Sun Services' tool of choice when 
'validating' hardware (V_alidation T_est S_uite).  Manufacturing also 
use the tool suite extensively to burn in hardware on their floor before 
shipping.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-31 Thread MC
> Richard, thanks for the pointer to the tests in
> '/usr/sunvts', as this
> is the first I have heard of them. They look quite
> comprehensive.
> I will give them a trial when I have some free time.
> Thanks
> Nigel Smith
> 
> pmemtest- Physical Memory Test
> ramtest - Memory DIMMs (RAM) Test
> vmemtest- Virtual Memory Test
> cddvdtest   - Optical Disk Drive Test
> cputest - CPUtest
> disktest- Disk and Floppy Drives Test
> dtlbtest- Data Translation Look-aside Buffer
> Test
> fputest - Floating Point Unit Test
> l1dcachetest- Level 1 Data Cache Test
> l2sramtest  - Level 2 Cache Test
> netlbtest   - Net Loop Back Test
> nettest - Network Hardware Test
> serialtest  - Serial Port Test
> tapetest- Tape Drive Test
> usbtest - USB Device Test
> systest - System Test
> iobustest   - Test for the IO interconnects and
> the Components on the IObus on high end Machines


That is apparently one of those crazy hidden features in OpenSolaris that I 
think Indiana should expose :)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-31 Thread Nigel Smith
Richard, thanks for the pointer to the tests in '/usr/sunvts', as this
is the first I have heard of them. They look quite comprehensive.
I will give them a trial when I have some free time.
Thanks
Nigel Smith

pmemtest- Physical Memory Test
ramtest - Memory DIMMs (RAM) Test
vmemtest- Virtual Memory Test
cddvdtest   - Optical Disk Drive Test
cputest - CPUtest
disktest- Disk and Floppy Drives Test
dtlbtest- Data Translation Look-aside Buffer Test
fputest - Floating Point Unit Test
l1dcachetest- Level 1 Data Cache Test
l2sramtest  - Level 2 Cache Test
netlbtest   - Net Loop Back Test
nettest - Network Hardware Test
serialtest  - Serial Port Test
tapetest- Tape Drive Test
usbtest - USB Device Test
systest - System Test
iobustest   - Test for the IO interconnects and the Components on the IObus 
on high end Machines
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-31 Thread Nigel Smith
Yes, I'm not surprised. I thought it would be a RAM problem.
I always recommend a 'memtest' on any new hardware.
Murphy's law predicts that you only have RAM problems
on PC's that you don't test!
Regards
Nigel Smith
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-31 Thread Zeke
Ok, well I fired up Memtest and had a failure on the first run.  I've run it 
twice more and have yet to manage to get it through a full run.  Memory problem 
it is.  :(

Sorry to bother everyone.  Thanks for the help.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-31 Thread Al Hopper
On Fri, 31 Aug 2007, Zeke wrote:

>> Are you sure your hardware is working without
>> problems?
>> I would first check the RAM with memtest86+
>> http://www.memtest.org/
>
> I'll give this a shot tonight when I get home.  I believe that Ubuntu liveCDs 
> have a memtest boot option on them, if not I've got a Memtest disc somewhere. 
>  I'll run it at least 24h and let you know how it goes.

http://www.ultimatebootcd.com/ has memtest86.

IMHO you have some serious hardware issue(s) with this system. 
OpenSolaris tends to push the underlying system hardware pretty hard. 
I've seen systems fail to install (Open)Solaris - while they installed 
and ran other OSes just fine.  In one case, where a system failed to 
load Solaris, I advised the OP to remove the CPU fan from the CPU 
cooler and visually inspect if there was a layer of dust restricting 
airflow over the heatsink [1].  This turned out to be the issue - and 
after removing all the crap from the heatsink, he was able to load 
Solaris just fine.  A similar issue, when the CPU fan is 2+ years old, 
is that the fan bearings are foobarred and the fan slows down when the 
heatsink starts to warn up.  In this case, when you pop the side cover 
off, everything appears to be working just fine.  Ten minutes later, 
*after* you've replaced the covers, the fan slows down to almost 
nothing and your system starts to "mis-behave".  Recommendation: 
replace the CPU cooler fan assembly if its 2 years or older.

PS: For a long time, the AMD factory coolers were completely 
un-reliable.  And the very thin spacing between the heatsink fins 
nicely facilitated the capture and buildup of a layer of dust.  I 
always recommend replacement of older AMD factory coolers with Zalman 
(www.zalmanusa.com) parts.  Email me offlist if you want specific part 
recommendations.  On older systems I recommend the Zalman passive 
copper heatsink (CNPS6000-Cu) in conjunction with the (fan bracket) 
FB123 with one or more 92mm (Zalman) fans.

[1] you *must* remove the fan to do the inspection.  You can't see the 
thin layer of crap with the fan in place.

>> How many megabytes of RAM do you have on this PC?
>> Can you get any other operating system, like Ubuntu
>> to work ok on this hardware?
>
> It's got 1GB of RAM, and Solaris is the first OS I've installed on this 
> particular system.  I ran an Ubuntu LiveCD and did notice some instability 
> while attempting to install some extra packages (was trying to get a JDK 
> installed to run the Solaris driver tool) though, so maybe it is the RAM.
>
>> I think it would be useful to know which chipset and
>> hence driver you are using to connect the sata drives.
>> I would guess it's the AHCI driver.
>> See this link to see how I answered this question for
>> my system:
>> http://mail.opensolaris.org/pipermail/zfs-discuss/2007
>> -May/040562.html
>
> Again, I'll take a look at this when I get home.  I strongly suspect that it 
> is indeed the AHCI driver.
>

Regards,

Al Hopper Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-31 Thread Zeke
> Do you by chance mean Silicon Image with that "SI"?
> Their chipsets aren't exactly known for reliability and
> data safety.  Just pointing that out as potential source
> of problems.

Indeed it is; however I'm not using that controller for anything at the moment, 
it's simply in the system with nothing hooked up to it.  I had read somewhere 
that ZFS performance is better if you have disks in a RaidZ spread across 
controllers, and there are 2 on the motherboard already, so I was hoping to use 
3 controllers for 3 disks.  However, I noticed a message in the Solaris 
installer that there was an unrecognized controller which for some reason I 
thought was that card, so I didn't use it.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-31 Thread Zeke
> Are you sure your hardware is working without
> problems?
> I would first check the RAM with memtest86+
> http://www.memtest.org/

I'll give this a shot tonight when I get home.  I believe that Ubuntu liveCDs 
have a memtest boot option on them, if not I've got a Memtest disc somewhere.  
I'll run it at least 24h and let you know how it goes.

> How many megabytes of RAM do you have on this PC?
> Can you get any other operating system, like Ubuntu
> to work ok on this hardware?

It's got 1GB of RAM, and Solaris is the first OS I've installed on this 
particular system.  I ran an Ubuntu LiveCD and did notice some instability 
while attempting to install some extra packages (was trying to get a JDK 
installed to run the Solaris driver tool) though, so maybe it is the RAM.

> I think it would be useful to know which chipset and
> hence driver you are using to connect the sata drives.
> I would guess it's the AHCI driver.
> See this link to see how I answered this question for
> my system:
> http://mail.opensolaris.org/pipermail/zfs-discuss/2007
> -May/040562.html

Again, I'll take a look at this when I get home.  I strongly suspect that it is 
indeed the AHCI driver.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-31 Thread Zeke
> This could be a hardware problem. Bad powersuply for
> the load? Try removing 2 of the large disks.

I should have mentioned in my first post that this is the very first thing I 
thought, and that I've already swapped the power supply with one I know can 
handle the load (an Enermax 400W which has powered a significantly more 
power-hungry system than this just fine).

If all else fails (going to run Memtest for 24h as suggested below first), I'll 
remove two of the drives and try again.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-31 Thread Mario Goebbels
> I added a SI 2 port PCI SATA controller, but it seemed to not be recognized 
> so I am not using it.

Do you by chance mean Silicon Image with that "SI"? Their chipsets
aren't exactly known for reliability and data safety. Just pointing that
out as potential source of problems.

-mg



signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-30 Thread Richard Elling
Nigel Smith wrote:
> Are you sure your hardware is working without problems?
> I would first check the RAM with memtest86+
> http://www.memtest.org/

Also, SunVTS should be in /usr/sunvts and includes memory and disk
tests (plus others).  This is the test suite we (Sun) use in manufacturing.
Take care when using destructive tests :-)
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-30 Thread Nigel Smith
Are you sure your hardware is working without problems?
I would first check the RAM with memtest86+
http://www.memtest.org/

How many megabytes of RAM do you have on this PC?
Can you get any other operating system, like Ubuntu to work ok on
this hardware?

I think it would be useful to know which chipset and hence driver you
are using to connect the sata drives. I would guess it's the AHCI driver.
See this link to see how I answered this question for my system:
http://mail.opensolaris.org/pipermail/zfs-discuss/2007-May/040562.html
Regards
Nigel Smith
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash & burn in SXCE b70!

2007-08-30 Thread Mattias Pantzare
> The problems I'm experiencing are as follows:
> ZFS creates the storage pool just fine, sees no errors on the drives, and 
> seems to work great...right up until I attempt to put data on the drives.  
> After only a few moments of transfer, things start to go wrong.  The system 
> doesn't power off, it just beeps 4-5 times.  The X session dies and the 
> monitor turns off (doesn't drop back to a console).  All network access dies. 
>  It seems that the system panics (is it called something else in 
> solaris-land?).  The HD access light stays on (though I can hear no drives 
> doing anything strenuous), and the CD light blinks.  This has happened two or 
> three times, every time I've tried to start copying data to the ZFS pool.   
> I've been transfering over the network, via SCP or NFS.

This could be a hardware problem. Bad powersuply for the load? Try
removing 2 of the large disks.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss