Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Toby Thain
On 27/10/10 3:14 PM, Harry Putnam wrote:
 It seems my hardware is getting bad, and I can't keep the os running
 for more than a few minutes until the machine shuts down.
 
 It will run 15 or 20 minutes and then shutdown
 I haven't found the exact reason for it.
 

One thing to try is a thorough memory test (few hours).

--Toby

 Or really any thing in logs that seems like a reason.
 
 It may be because I don't know what to look for.
 
 I have been having some trouble with corrupted data in one pool but
 I thought I'd gotten it cleared up and posted to that effect in
 another thread.
 
 zpool status on all pools shows thumbs up.
 
 What are some key words I should be looking for in /var/adm/messages?
 
 On this next shutdown (the machine is currently running) I'm going
 into bios and see what temperatures are like... but passing my hand
 around the insides of the box seems to indicate nothing unusual.
 
 I'm not sure how to query the OS for temperatures while its running.
 
 But if heat is a problem something would be in /var/adm/messages right?
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Harry Putnam
Toby Thain t...@telegraphics.com.au writes:

 On 27/10/10 3:14 PM, Harry Putnam wrote:
 It seems my hardware is getting bad, and I can't keep the os running
 for more than a few minutes until the machine shuts down.
 
 It will run 15 or 20 minutes and then shutdown
 I haven't found the exact reason for it.
 

 One thing to try is a thorough memory test (few hours).


It does some kind of memory test on bootup.  I recall seeing something
about high memory.  And shows all of the 3GB installed

I just now saw last time it came down, that the cpu was at 134
degrees.

And that would of been after it cooled a couple minutes.

I don't think that is astronomical but it may have been a good bit
higher under load.  But still wouldn't something show in
/var/adm/messages if that were the problem?

Are there not a list of standard things to grep for in logs that would
indicate various troubles?  Surely system admins would have  somekind
of reporting tool to get ahead of serious troubles.

I've had one or another problem with this machine for a couple of
months now so thinking of scrapping it out, and putting a new setup in
that roomy  midtower.

Where can I find a guide to help me understand how to build up a
machine and then plug my existing discs and data into the new OS?

I don't mean the hardware part but that part particularly opensolaris
and zfs related.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Krunal Desai
I believe he meant a memory stress test, i.e. booting with a
memtest86+ CD and seeing if it passed. Even if the memory is OK, the
stress from that test may expose defects in the power supply or other
components.

Your CPU temperature is 56C, which is not out-of-line for most modern
CPUs (you didn't state what type of CPU it is). Heck, 56C would be
positively cool for a NetBurst-based Xeon.

On Wed, Oct 27, 2010 at 4:17 PM, Harry Putnam rea...@newsguy.com wrote:
 Toby Thain t...@telegraphics.com.au writes:

 On 27/10/10 3:14 PM, Harry Putnam wrote:
 It seems my hardware is getting bad, and I can't keep the os running
 for more than a few minutes until the machine shuts down.

 It will run 15 or 20 minutes and then shutdown
 I haven't found the exact reason for it.


 One thing to try is a thorough memory test (few hours).


 It does some kind of memory test on bootup.  I recall seeing something
 about high memory.  And shows all of the 3GB installed

 I just now saw last time it came down, that the cpu was at 134
 degrees.

 And that would of been after it cooled a couple minutes.

 I don't think that is astronomical but it may have been a good bit
 higher under load.  But still wouldn't something show in
 /var/adm/messages if that were the problem?

 Are there not a list of standard things to grep for in logs that would
 indicate various troubles?  Surely system admins would have  somekind
 of reporting tool to get ahead of serious troubles.

 I've had one or another problem with this machine for a couple of
 months now so thinking of scrapping it out, and putting a new setup in
 that roomy  midtower.

 Where can I find a guide to help me understand how to build up a
 machine and then plug my existing discs and data into the new OS?

 I don't mean the hardware part but that part particularly opensolaris
 and zfs related.

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
--khd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Toby Thain
On 27/10/10 4:21 PM, Krunal Desai wrote:
 I believe he meant a memory stress test, i.e. booting with a
 memtest86+ CD and seeing if it passed. 

Correct. The POST tests are not adequate.

--Toby


Even if the memory is OK, the
 stress from that test may expose defects in the power supply or other
 components.
 
 Your CPU temperature is 56C, which is not out-of-line for most modern
 CPUs (you didn't state what type of CPU it is). Heck, 56C would be
 positively cool for a NetBurst-based Xeon.
 
 On Wed, Oct 27, 2010 at 4:17 PM, Harry Putnam rea...@newsguy.com wrote:
 Toby Thain t...@telegraphics.com.au writes:

 On 27/10/10 3:14 PM, Harry Putnam wrote:
 It seems my hardware is getting bad, and I can't keep the os running
 for more than a few minutes until the machine shuts down.

 It will run 15 or 20 minutes and then shutdown
 I haven't found the exact reason for it.


 One thing to try is a thorough memory test (few hours).


 It does some kind of memory test on bootup.  I recall seeing something
 about high memory.  And shows all of the 3GB installed

 I just now saw last time it came down, that the cpu was at 134
 degrees.

 And that would of been after it cooled a couple minutes.

 I don't think that is astronomical but it may have been a good bit
 higher under load.  But still wouldn't something show in
 /var/adm/messages if that were the problem?

 Are there not a list of standard things to grep for in logs that would
 indicate various troubles?  Surely system admins would have  somekind
 of reporting tool to get ahead of serious troubles.

 I've had one or another problem with this machine for a couple of
 months now so thinking of scrapping it out, and putting a new setup in
 that roomy  midtower.

 Where can I find a guide to help me understand how to build up a
 machine and then plug my existing discs and data into the new OS?

 I don't mean the hardware part but that part particularly opensolaris
 and zfs related.

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 
 
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Harry Putnam
Krunal Desai mov...@gmail.com writes:

 I believe he meant a memory stress test, i.e. booting with a
 memtest86+ CD and seeing if it passed. Even if the memory is OK, the
 stress from that test may expose defects in the power supply or other
 components.

 Your CPU temperature is 56C, which is not out-of-line for most modern
 CPUs (you didn't state what type of CPU it is). Heck, 56C would be
 positively cool for a NetBurst-based Xeon.

I'm guessing it was probably more like 60 to 62 c under load.  The
temperature I posted was after something like 5minutes of being
totally shutdown and the case been open for a long while. (mnths if
not yrs)

That would be a bit hot for this machine which has run cool since I
built it some 6 yrs ago or so.

I agree about the heat not being all that remarkable and said so in my
prior post. Even my old p4s with 3.2hgz run hotter than 56c quite often.

The hardware is athlon64 2.2Ghz +3400  Abit Mobo maxed out at 3gb ram.
With 3 sets of mirrored pools (6 discs in all) and total of 1.7 tb of
disc space.

The machine has consistently shut down 3 times today, after 15-20
minutes up, during a fairly hefty rsync across local discs, and a
homeboy script running checking the amount of data being moved every 5
minutes with du -sh $TARGET.

This follows a problem I had with corrupted data in one pool being
reported over and over.  That part I think I've finally gotten
straightened out by moving the data to a different pool and zfs
`zfs destroy -r' the problem filesystems, followed by a scrub.

Recreated the filesystems and am now putting the data back when I ran
into this repeated shutdown problem.  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Harry Putnam
Toby Thain t...@telegraphics.com.au writes:

 On 27/10/10 4:21 PM, Krunal Desai wrote:
 I believe he meant a memory stress test, i.e. booting with a
 memtest86+ CD and seeing if it passed. 

 Correct. The POST tests are not adequate.

Got it. Thank you.  

Short of doing such a test, I have evidence already that machine will
predictably shutdown after 15 to 20 minutes of uptime.

It seems there ought to be something, some kind of evidence and clues
if I only knew how to look for them, in the logs.

Is there not some semi standard kind of keywords to grep for that
would indicate some clue as to the problem?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Krunal Desai
With an A64, I think a thermal shutdown would instantly halt CPU
execution, removing the chance to write any kind of log message.
memtest will report any errors in RAM; perhaps when the ARC expands to
the upper-stick of memory it hits the bad bytes and crashes.

Can you try switching power supplies? removing unnecessary add-on
cards? Swapping mobos?

On Wed, Oct 27, 2010 at 4:45 PM, Harry Putnam rea...@newsguy.com wrote:
 Toby Thain t...@telegraphics.com.au writes:

 On 27/10/10 4:21 PM, Krunal Desai wrote:
 I believe he meant a memory stress test, i.e. booting with a
 memtest86+ CD and seeing if it passed.

 Correct. The POST tests are not adequate.

 Got it. Thank you.

 Short of doing such a test, I have evidence already that machine will
 predictably shutdown after 15 to 20 minutes of uptime.

 It seems there ought to be something, some kind of evidence and clues
 if I only knew how to look for them, in the logs.

 Is there not some semi standard kind of keywords to grep for that
 would indicate some clue as to the problem?

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
--khd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Glenn Lagasse
* Harry Putnam (rea...@newsguy.com) wrote:
 Toby Thain t...@telegraphics.com.au writes:
 
  On 27/10/10 4:21 PM, Krunal Desai wrote:
  I believe he meant a memory stress test, i.e. booting with a
  memtest86+ CD and seeing if it passed. 
 
  Correct. The POST tests are not adequate.
 
 Got it. Thank you.  
 
 Short of doing such a test, I have evidence already that machine will
 predictably shutdown after 15 to 20 minutes of uptime.
 
 It seems there ought to be something, some kind of evidence and clues
 if I only knew how to look for them, in the logs.
 
 Is there not some semi standard kind of keywords to grep for that
 would indicate some clue as to the problem?

If it's a thermal problem, then no there wouldn't be.  Thermal shutdown
is handled by the BIOS iirc and thus there isn't any notification to the
host OS.  Certainly not on commodity PC hardware at any rate.

-- 
Glenn
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Harry Putnam
Krunal Desai mov...@gmail.com writes:

 With an A64, I think a thermal shutdown would instantly halt CPU
 execution, removing the chance to write any kind of log message.
 memtest will report any errors in RAM; perhaps when the ARC expands to
 the upper-stick of memory it hits the bad bytes and crashes.

 Can you try switching power supplies? removing unnecessary add-on
 cards? Swapping mobos?

Well, yes I can try any of those... but I'm wondering now if it would
be a good time to just upgrade to something more modern.

See the new thread `Jumping ship'

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Bob Friesenhahn

On Wed, 27 Oct 2010, Harry Putnam wrote:

I have been having some trouble with corrupted data in one pool but
I thought I'd gotten it cleared up and posted to that effect in
another thread.

zpool status on all pools shows thumbs up.

What are some key words I should be looking for in /var/adm/messages?


Use

  /usr/sbin/fmadm faulty

to see any existing fault reports.

use

  /usr/sbin/fmdump

to dump error reports.  Use

  /usr/sbin/fmdump -f

to do a sort of 'tail' on error reports as they arrive.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Peter Jeremy
On 2010-Oct-28 04:45:16 +0800, Harry Putnam rea...@newsguy.com wrote:
Short of doing such a test, I have evidence already that machine will
predictably shutdown after 15 to 20 minutes of uptime.

My initial guess is thermal issues.  Check that the fans are running
correctly and there's no dust/fluff buildup on the CPU heatsink.  The
BIOS might be able to report actual fan speeds.

It's also possible that you have RAM or PSU problems and I'd also
recommend running some sort of offline stress test (eg memtest86 or
the mersenne prime tester).

It seems there ought to be something, some kind of evidence and clues
if I only knew how to look for them, in the logs.

Serious hardware problems are unlikely to be in the logs because the
system will die before it can write the error to disk and sync the
disks.  You are more likely to see a problem on the console.

-- 
Peter Jeremy


pgpL46BRTTVid.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Harry Putnam
Peter Jeremy peter.jer...@alcatel-lucent.com writes:

It seems there ought to be something, some kind of evidence and clues
if I only knew how to look for them, in the logs.

 Serious hardware problems are unlikely to be in the logs because the
 system will die before it can write the error to disk and sync the
 disks.  You are more likely to see a problem on the console.

Here is another clue... I'd posted earlier about but on the general
list.

I use a kvm with 4-6 machines.   Its only a 4port kvm so if I want to hook
my laptop into it I must disconnect something else.  For mnths now that
something else has been the opensolaris machine.

I normally access it via ssh anyway and don't use its own console so
often.

I've discovered that if the Osol machine is restarted for any reason
while the KVM stuff is disconnected... it causes quite a hubbub.

During boot the thing starts sounding like a Paris police siren with
two tones going  like High low, High low, High low, continuously and
will not complete booting.

If I shut it down and reconnect the KVM stuff (vga cable and two usb
cables) then it will boot ok.

Disconnecting KVM stuff is fine long as the machine doesn't get
rebooted in that state.

Seems really odd and only started within a few mnths.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Mike Gerdts
On Wed, Oct 27, 2010 at 3:41 PM, Harry Putnam rea...@newsguy.com wrote:
 I'm guessing it was probably more like 60 to 62 c under load.  The
 temperature I posted was after something like 5minutes of being
 totally shutdown and the case been open for a long while. (mnths if
 not yrs)

What happens if the case is closed (and all PCI slot, disk, etc. slots
are closed)?  Having the case open likely changes the way that air
flows across the various components.  Also, if there is tobacco smoke
near the machine, it will cause a sticky build-up that likely
contributes to heat dissipation problems.

Perhaps this belongs somewhere other than zfs-discuss - it has nothing
to do with zfs.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Harry Putnam
Mike Gerdts mger...@gmail.com writes:

[...]
Thanks for suggestions and I have closed it all up to see if there was
a difference.

 Perhaps this belongs somewhere other than zfs-discuss - it has nothing
 to do with zfs.

Yes... it does, It started out much nearer to belonging here.

Not sure now how to switch to `general' and still get the kind of
excellent input I've gotten here.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss