Re: [Lustre-discuss] Lustre NOT HEALTHY

2009-01-14 Thread Brock Palen
Ok thanks,

It happened again last night, sooner than normal.  I will send a new  
message with the details.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Jan 13, 2009, at 11:09 PM, Cliff White wrote:

 Brock Palen wrote:
 How common is it for servers to go NOT HEALTHY?  I feel it is   
 happening much more often than it should be with us.  A few times  
 a  month.
 It should not happen at all, in the normal case. It indicates a  
 problem.

 If this happens, we reboot the servers.  Should we do something   
 else?  Maybe it depends on what the problem was?

 Well, determining what the actual problem that caused the NOT  
 HEALTHY would be quite useful, yes. I would not just reboot.

 -Examine consoles of _all_ servers for any error indications
 - Examine syslogs of _all_ servers for any LustreErrors or LBUG
 - Check network and hardware health. Are your disks happy?
 Is your network dropping packets?

 Try to figure out what was happening on the cluster. Does this  
 relate to
 a specific user workload or system load condition? Can you reproduce
 the situation? Does it happen at a specific time of day, time of  
 month?
 If we should not be getting NOT HEALTHY that often, what  
 information  should I collect to report to CFS?

 The lustre-diagnostics package is good start for general system  
 config.
 Beyond that, most of what we would need is listed above.
 cliffw

 Brock Palen
 www.umich.edu/~brockp
 Center for Advanced Computing
 bro...@umich.edu
 (734)936-1985
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre NOT HEALTHY

2009-01-13 Thread Brock Palen
How common is it for servers to go NOT HEALTHY?  I feel it is  
happening much more often than it should be with us.  A few times a  
month.

If this happens, we reboot the servers.  Should we do something  
else?  Maybe it depends on what the problem was?

If we should not be getting NOT HEALTHY that often, what information  
should I collect to report to CFS?


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre NOT HEALTHY

2009-01-13 Thread Cliff White
Brock Palen wrote:
 How common is it for servers to go NOT HEALTHY?  I feel it is  
 happening much more often than it should be with us.  A few times a  
 month.
 
It should not happen at all, in the normal case. It indicates a problem.

 If this happens, we reboot the servers.  Should we do something  
 else?  Maybe it depends on what the problem was?

Well, determining what the actual problem that caused the NOT HEALTHY 
would be quite useful, yes. I would not just reboot.

-Examine consoles of _all_ servers for any error indications
- Examine syslogs of _all_ servers for any LustreErrors or LBUG
- Check network and hardware health. Are your disks happy?
Is your network dropping packets?

Try to figure out what was happening on the cluster. Does this relate to
a specific user workload or system load condition? Can you reproduce
the situation? Does it happen at a specific time of day, time of month?
 
 If we should not be getting NOT HEALTHY that often, what information  
 should I collect to report to CFS?

The lustre-diagnostics package is good start for general system config.
Beyond that, most of what we would need is listed above.
cliffw

 
 
 Brock Palen
 www.umich.edu/~brockp
 Center for Advanced Computing
 bro...@umich.edu
 (734)936-1985
 
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss