Re: [Lustre-discuss] Lustre NOT HEALTHY
Ok thanks, It happened again last night, sooner than normal. I will send a new message with the details. Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On Jan 13, 2009, at 11:09 PM, Cliff White wrote: Brock Palen wrote: How common is it for servers to go NOT HEALTHY? I feel it is happening much more often than it should be with us. A few times a month. It should not happen at all, in the normal case. It indicates a problem. If this happens, we reboot the servers. Should we do something else? Maybe it depends on what the problem was? Well, determining what the actual problem that caused the NOT HEALTHY would be quite useful, yes. I would not just reboot. -Examine consoles of _all_ servers for any error indications - Examine syslogs of _all_ servers for any LustreErrors or LBUG - Check network and hardware health. Are your disks happy? Is your network dropping packets? Try to figure out what was happening on the cluster. Does this relate to a specific user workload or system load condition? Can you reproduce the situation? Does it happen at a specific time of day, time of month? If we should not be getting NOT HEALTHY that often, what information should I collect to report to CFS? The lustre-diagnostics package is good start for general system config. Beyond that, most of what we would need is listed above. cliffw Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Lustre NOT HEALTHY
How common is it for servers to go NOT HEALTHY? I feel it is happening much more often than it should be with us. A few times a month. If this happens, we reboot the servers. Should we do something else? Maybe it depends on what the problem was? If we should not be getting NOT HEALTHY that often, what information should I collect to report to CFS? Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre NOT HEALTHY
Brock Palen wrote: How common is it for servers to go NOT HEALTHY? I feel it is happening much more often than it should be with us. A few times a month. It should not happen at all, in the normal case. It indicates a problem. If this happens, we reboot the servers. Should we do something else? Maybe it depends on what the problem was? Well, determining what the actual problem that caused the NOT HEALTHY would be quite useful, yes. I would not just reboot. -Examine consoles of _all_ servers for any error indications - Examine syslogs of _all_ servers for any LustreErrors or LBUG - Check network and hardware health. Are your disks happy? Is your network dropping packets? Try to figure out what was happening on the cluster. Does this relate to a specific user workload or system load condition? Can you reproduce the situation? Does it happen at a specific time of day, time of month? If we should not be getting NOT HEALTHY that often, what information should I collect to report to CFS? The lustre-diagnostics package is good start for general system config. Beyond that, most of what we would need is listed above. cliffw Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss