Re: [osol-discuss] diagnosing server hang

2009-01-30 Thread Matt Harrison
Matt Harrison wrote:
 m...@bruningsystems.com wrote:
 Hi Matt,

 Matt Harrison wrote:
 Matt Harrison wrote:
  
 thanks Ian, I'll look into this tomorrow.

 
 Well I'm not sure if it's good news or not. I've got the machine 
 running memtest86+ with the standard tests and so far it's done 2 
 passes (3 hours runtime) without a single error.

 I'm going to leave it running overnight but does it seem there could 
 be another problem other than memory?
   
 Have you gotten anywhere yet with this hang?  Have you tried set 
 snooping=1 in
 /etc/system?  How about booting with kmdb and forcing a dump?
 I'm not sure why this is necessarily hardware related...

 max


 
 Unfortunately not yet...I was forced to bring the server back up to get 
 some files from it, and I haven't had a chance to take it down again yet.
 
 I still need to make sure it can survive 24h solid of memtest, but I am 
 happy to try other things.
 
 I'm not familiar with the snooping variable, nor with kmdb, although I 
 have read about it being used here and there.
 
 I'll go ahead with the memtest when I can and report back.

Ok, sorry it's taken a while but I've had the server run memtest for 24 
hours and it hasn't found any errors whatsoever.

Does anyone have an idea as to what I could try next?

Thanks

-- 
Matt Harrison
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] diagnosing server hang

2009-01-30 Thread m...@bruningsystems.com
Hi Matt,
Maybe this should be moved to mdb-discuss...
My comments are at the end.

Matt Harrison wrote:
 Matt Harrison wrote:
   
 m...@bruningsystems.com wrote:
 
 Hi Matt,

 Matt Harrison wrote:
   
 Matt Harrison wrote:
  
 
 thanks Ian, I'll look into this tomorrow.

 
   
 Well I'm not sure if it's good news or not. I've got the machine 
 running memtest86+ with the standard tests and so far it's done 2 
 passes (3 hours runtime) without a single error.

 I'm going to leave it running overnight but does it seem there could 
 be another problem other than memory?
   
 
 Have you gotten anywhere yet with this hang?  Have you tried set 
 snooping=1 in
 /etc/system?  How about booting with kmdb and forcing a dump?
 I'm not sure why this is necessarily hardware related...

 max


   
 Unfortunately not yet...I was forced to bring the server back up to get 
 some files from it, and I haven't had a chance to take it down again yet.

 I still need to make sure it can survive 24h solid of memtest, but I am 
 happy to try other things.

 I'm not familiar with the snooping variable, nor with kmdb, although I 
 have read about it being used here and there.

 I'll go ahead with the memtest when I can and report back.
 

 Ok, sorry it's taken a while but I've had the server run memtest for 24 
 hours and it hasn't found any errors whatsoever.

 Does anyone have an idea as to what I could try next?
   
I would try booting with kmdb (or, alternatively, load kmdb once the 
machine is
up but before it is hung).  You can do this from command line console 
login (no graphics)
by running:

# mdb -K  -- this will load kmdb and drop into it
:c  -- this will continue

If you must have a windowing system to reproduce the hang, you can still use
kmdb, but, unless you can redirect console input/output from/to a serial 
port, you
won't be able to see what you are doing.  But, it is ok.
You type:

# mdb -K -F  -- again, loads kmdb and drops into it.  The machine will 
appear hung.

Now, carefully with no typos:

: c   -- and enter (that's colon c enter (3 key strokes)) the machine 
should
 come back (unless you have a typo).

Now, do whatever you are doing that causes the machine to hang.
When the machine is hung, type F1-a  (that is function key f1 and a 
together.
Unless the machine is hard hung, this will put you into kmdb.  Again, 
you won't
be able to see what is happening if your console is on a windowing system.
Then type (again, no typos):

$systemdump-- this will give you a panic dump and reboot.

If the above doesn't work, you either made typos, or your machine is 
hard hung.
If it is hard hung, add this line to /etc/system (of course, you'll have 
to bounce the
machine to get it back up to do this):

set snooping=1

Then reboot.  This sets a deadman timer.  Again, do your thing to cause
the hang.  If the scheduling clock does not
run for (by default) 50 seconds, the machine will panic giving you a dump.

If neither of these work, it implies that the real time clock is blocked 
out.
This is highly unlikely, but can occur.

Once you have the dump, report back...

max


___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] diagnosing server hang

2009-01-30 Thread Matt Harrison
m...@bruningsystems.com wrote:
 I would try booting with kmdb (or, alternatively, load kmdb once the 
 machine is
 up but before it is hung).  You can do this from command line console 
 login (no graphics)
 by running:
 
 # mdb -K  -- this will load kmdb and drop into it
 :c  -- this will continue
 
 If you must have a windowing system to reproduce the hang, you can still 
 use
 kmdb, but, unless you can redirect console input/output from/to a serial 
 port, you
 won't be able to see what you are doing.  But, it is ok.
 You type:
 
 # mdb -K -F  -- again, loads kmdb and drops into it.  The machine will 
 appear hung.
 
 Now, carefully with no typos:
 
 : c   -- and enter (that's colon c enter (3 key strokes)) the machine 
 should
 come back (unless you have a typo).
 
 Now, do whatever you are doing that causes the machine to hang.
 When the machine is hung, type F1-a  (that is function key f1 and a 
 together.
 Unless the machine is hard hung, this will put you into kmdb.  Again, 
 you won't
 be able to see what is happening if your console is on a windowing system.
 Then type (again, no typos):
 
 $systemdump-- this will give you a panic dump and reboot.
 
 If the above doesn't work, you either made typos, or your machine is 
 hard hung.
 If it is hard hung, add this line to /etc/system (of course, you'll have 
 to bounce the
 machine to get it back up to do this):
 
 set snooping=1
 
 Then reboot.  This sets a deadman timer.  Again, do your thing to cause
 the hang.  If the scheduling clock does not
 run for (by default) 50 seconds, the machine will panic giving you a dump.
 
 If neither of these work, it implies that the real time clock is blocked 
 out.
 This is highly unlikely, but can occur.
 
 Once you have the dump, report back...

Thanks Max, I can follow that and I will try it very shortly. I am happy 
to move list, I'll subscribe to mdb-discuss and post my next reply there.

I'll try what you've given me and post back in an hour or two.

Thanks again

-- 
Matt Harrison
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] diagnosing server hang

2009-01-30 Thread Matt Harrison
Matt Harrison wrote:
 I'll try what you've given me and post back in an hour or two.

Wellnot exactly good news, but it's something.

For whatever unfathomable reason, the server has stopped hanging when we 
copy large amounts of data from it. It has been doing it steadily for at 
least 2 weeks and nothing has changed except I rand the mdb commands as 
directed.

Now, instead of a hang, the network stops. dladm shows that the link is 
still there, ifconfig shows it's still running but nothing is going 
either way.

If I unplumb and re-plumb the interface it works fine again for anything 
between 50 and 500mb, then it drops again.

Previously, the machine would just hang but now it appears to be an 
interface problem.

I've tried rebooting to clear the mdb command, but that has made no 
difference at all.

I'm getting to my wits end with this machine, I'm thinking about RMAing 
the entire box and trying something different, which is silly because I 
know others with exactly the same hardware that works perfectly.

Incidentally, I previously had problems with the onboard realtek NICs, 
and only managed to get round it by replacing them with a PCI-X Intell 
pro 1000. Now this new NIC is giving problems, or something related to it.


Grateful for any insights and sorry that this is turning into a bit of a 
circus.

--
Matt Harrison
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] diagnosing server hang

2009-01-27 Thread m...@bruningsystems.com
Hi Matt,

Matt Harrison wrote:
 Matt Harrison wrote:
   
 thanks Ian, I'll look into this tomorrow.

 

 Well I'm not sure if it's good news or not. I've got the machine running 
 memtest86+ with the standard tests and so far it's done 2 passes (3 
 hours runtime) without a single error.

 I'm going to leave it running overnight but does it seem there could be 
 another problem other than memory?
   
Have you gotten anywhere yet with this hang?  Have you tried set 
snooping=1 in
/etc/system?  How about booting with kmdb and forcing a dump?
I'm not sure why this is necessarily hardware related...

max


___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] diagnosing server hang

2009-01-27 Thread Matt Harrison
m...@bruningsystems.com wrote:
 Hi Matt,
 
 Matt Harrison wrote:
 Matt Harrison wrote:
  
 thanks Ian, I'll look into this tomorrow.

 

 Well I'm not sure if it's good news or not. I've got the machine 
 running memtest86+ with the standard tests and so far it's done 2 
 passes (3 hours runtime) without a single error.

 I'm going to leave it running overnight but does it seem there could 
 be another problem other than memory?
   
 Have you gotten anywhere yet with this hang?  Have you tried set 
 snooping=1 in
 /etc/system?  How about booting with kmdb and forcing a dump?
 I'm not sure why this is necessarily hardware related...
 
 max
 
 

Unfortunately not yet...I was forced to bring the server back up to get 
some files from it, and I haven't had a chance to take it down again yet.

I still need to make sure it can survive 24h solid of memtest, but I am 
happy to try other things.

I'm not familiar with the snooping variable, nor with kmdb, although I 
have read about it being used here and there.

I'll go ahead with the memtest when I can and report back.

Thanks

Matt
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] diagnosing server hang

2009-01-25 Thread Matt Harrison
Matt Harrison wrote:
 thanks Ian, I'll look into this tomorrow.
 

Well I'm not sure if it's good news or not. I've got the machine running 
memtest86+ with the standard tests and so far it's done 2 passes (3 
hours runtime) without a single error.

I'm going to leave it running overnight but does it seem there could be 
another problem other than memory?

Thanks

Matt
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] diagnosing server hang

2009-01-24 Thread Ian Collins
Matt Harrison wrote:
 Hi all,

 We've got an SXCE snv_97 filer that has been working ok for a few
 months now.

 Recently, it has started falling over when large amounts of data are
 being copied from it. There is nothing printed to the console and I
 can't find anything related in the logs. The machine doesn't respond
 via the network or the console.

I'd guess a hardware fault.  Swap out the NIC and run a thorough memory
check (memtes86 on an x86 system).

-- 
Ian.

___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] diagnosing server hang

2009-01-24 Thread Matt Harrison
Ian Collins wrote:
 Matt Harrison wrote:
 Hi all,

 We've got an SXCE snv_97 filer that has been working ok for a few
 months now.

 Recently, it has started falling over when large amounts of data are
 being copied from it. There is nothing printed to the console and I
 can't find anything related in the logs. The machine doesn't respond
 via the network or the console.

 I'd guess a hardware fault.  Swap out the NIC and run a thorough memory
 check (memtes86 on an x86 system).
 

Thanks for the reply, the NIC is a brand new intel pro 1000 server card, 
  which was replaced after talking on the zfs-discuss list (we were 
having performance issues over cifs).

The memory certainly could be a problem. Seeing as this is a 64bit AMD 
x2 system, should I be looking for a 64bit version of memtest?

Thanks

-- 
Matt Harrison
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] diagnosing server hang

2009-01-24 Thread Ian Collins
Matt Harrison wrote:
 Ian Collins wrote:
 Matt Harrison wrote:
 Hi all,

 We've got an SXCE snv_97 filer that has been working ok for a few
 months now.

 Recently, it has started falling over when large amounts of data are
 being copied from it. There is nothing printed to the console and I
 can't find anything related in the logs. The machine doesn't respond
 via the network or the console.

 I'd guess a hardware fault.  Swap out the NIC and run a thorough memory
 check (memtes86 on an x86 system).


 The memory certainly could be a problem. Seeing as this is a 64bit AMD
 x2 system, should I be looking for a 64bit version of memtest?

Just grab a copy of the ultimate boot CD, every home should have one!

-- 
Ian.

___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] diagnosing server hang

2009-01-24 Thread Matt Harrison
Ian Collins wrote:
 Matt Harrison wrote:
 Ian Collins wrote:
 Matt Harrison wrote:
 Hi all,

 We've got an SXCE snv_97 filer that has been working ok for a few
 months now.

 Recently, it has started falling over when large amounts of data are
 being copied from it. There is nothing printed to the console and I
 can't find anything related in the logs. The machine doesn't respond
 via the network or the console.

 I'd guess a hardware fault.  Swap out the NIC and run a thorough memory
 check (memtes86 on an x86 system).

 The memory certainly could be a problem. Seeing as this is a 64bit AMD
 x2 system, should I be looking for a 64bit version of memtest?

 Just grab a copy of the ultimate boot CD, every home should have one!
 

thanks Ian, I'll look into this tomorrow.

-- 
Matt Harrison
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org