Re: Fw: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable to get SEL record
Hey Won, On Wed, 2009-01-28 at 19:15 -0800, Won De Erick wrote: > > From: Al Chu > > > > > Hey Won, > > > > On Tue, 2009-01-27 at 17:43 -0800, Won De Erick wrote: > > > #bmc-watchdog -d -u 4 -p 0 -n -i 300 -l 0 > > > > what is output from bmc-watchdog --get? You don't define a BMC action > > (i.e. power cycle, power down, do nothing), so it depends on what the > > default action is on that system. > > > > # bmc-watchdog --get > Timer Use: SMS/OS > Timer: Running > Logging: Enabled > Timeout Action: Hard Reset > Pre-Timeout Interrupt: None > Pre-Timeout Interval:1 seconds > Timer Use BIOS FRB2 Flag:Clear > Timer Use BIOS POST Flag:Clear > Timer Use BIOS OS Load Flag: Clear > Timer Use BIOS SMS/OS Flag: Set > Timer Use BIOS OEM Flag: Clear > Initial Countdown: 240 seconds > Current Countdown: 201 seconds > > i just checked my script and my complete implementation was > > bmc-watchdog -d -u 4 -p 0 -n -l 0 -a 1 -i 240 > > and not the previous one, my typo. > > I think the SEL could justify that proper timeout action was invoked. > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > 2:OEM defined = 00 00 > 00 00 00 E3 25 86 80 00 00 FF 00 > 3:OEM defined = 02 00 00 FF 00 00 00 00 20 > 00 00 00 00 > 4:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF > 00 > 5:24-Jan-2009 17:09:30:Watchdog 2 Watchdog:Hard Reset > 6:OEM defined = > 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > Well, it seems this is the case then. The watchdog is what triggered the hard reset. It's hard to say why the BMC card locked-up/was busy all of the time. It probably could have been anything really. Al > > -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory ___ Freeipmi-devel mailing list Freeipmi-devel@gnu.org http://lists.gnu.org/mailman/listinfo/freeipmi-devel
Re: Fw: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable to get SEL record
> From: Al Chu > > Hey Won, > > On Tue, 2009-01-27 at 17:43 -0800, Won De Erick wrote: > > #bmc-watchdog -d -u 4 -p 0 -n -i 300 -l 0 > > what is output from bmc-watchdog --get? You don't define a BMC action > (i.e. power cycle, power down, do nothing), so it depends on what the > default action is on that system. > # bmc-watchdog --get Timer Use: SMS/OS Timer: Running Logging: Enabled Timeout Action: Hard Reset Pre-Timeout Interrupt: None Pre-Timeout Interval:1 seconds Timer Use BIOS FRB2 Flag:Clear Timer Use BIOS POST Flag:Clear Timer Use BIOS OS Load Flag: Clear Timer Use BIOS SMS/OS Flag: Set Timer Use BIOS OEM Flag: Clear Initial Countdown: 240 seconds Current Countdown: 201 seconds i just checked my script and my complete implementation was bmc-watchdog -d -u 4 -p 0 -n -l 0 -a 1 -i 240 and not the previous one, my typo. I think the SEL could justify that proper timeout action was invoked. 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 2:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 3:OEM defined = 02 00 00 FF 00 00 00 00 20 00 00 00 00 4:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 5:24-Jan-2009 17:09:30:Watchdog 2 Watchdog:Hard Reset 6:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 ___ Freeipmi-devel mailing list Freeipmi-devel@gnu.org http://lists.gnu.org/mailman/listinfo/freeipmi-devel
Re: Fw: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable to get SEL record
Hey Won, On Tue, 2009-01-27 at 17:43 -0800, Won De Erick wrote: > #bmc-watchdog -d -u 4 -p 0 -n -i 300 -l 0 what is output from bmc-watchdog --get? You don't define a BMC action (i.e. power cycle, power down, do nothing), so it depends on what the default action is on that system. Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory ___ Freeipmi-devel mailing list Freeipmi-devel@gnu.org http://lists.gnu.org/mailman/listinfo/freeipmi-devel
Re: Fw: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable to get SEL record
- Original Message > From: Al Chu > > Hey Won, > > On Mon, 2009-01-26 at 18:53 -0800, Won De Erick wrote: > > - Original Message > > > > > From: Al Chu > > > > > > Hey Won, > > > > > > On Sun, 2009-01-25 at 23:00 -0800, Won De Erick wrote: > > > > I am forwarding this to the FreeIPMI users mailing list. Hope, I can > > > > get > > > > hints from you all. > > > > Thank you. > > > > > > > > > > > > > > > > - Forwarded Message > > > > From: Won De Erick > > > > To: Albert Chu > > > > Cc: freeipmi-devel@gnu.org > > > > Sent: Saturday, January 24, 2009 11:55:24 AM > > > > Subject: Re: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable > > > > to > > > > get SEL record > > > > > > > > Pls disregard previous email. I forgot to attach the file. :) > > > > > > Did you send me the wrong debug file? I see debug output from > > > ipmi-sensors?? > > > > > > > I'm sorry, attached is the correct one. > > Seems that this has a successful ipmi-sel execution in it. So not much > I can debug with :-( > > > > > > > Hi Al, > > > > > > > > With IBM x3650, I noticed that ipmi-sel is unable to get the SEL > > > > record. > > > > > > > > # ipmi-sel --version > > > > IPMI Sensors [ipmi-sel-0.6.10] > > > > > > > > # ipmi-sel > ibm3650-dsc2075-sel.txt > > > > ipmi_cmd_get_sel_entry: BMC busy > > > > ipmi-sel: unable to get SEL record > > > > > > > > After the above, the box automatically rebooted. Is this normal? > > > > > > I have never seen this behavior before, and I wouldn't consider it > > > "good" in any definition. This is likely a bug in the IBM > > > implementation. The "BMC busy" means exactly what it says, the BMC is > > > busy and cannot respond to IPMI requests. It by itself is not a > > > problem. For example, some other IPMI tasks are hogging resources. But > > > you should presumably be able to reach the card eventually. Is it > > > possible you have other IPMI things running in the background? > > > > > > > bmc-watchdog (as daemon) was the only thing running in the background. > > This shouldn't be enough to cause enough IPMI to be *that* busy. Here's > a thought. Perhaps the ipmi-sel logs went full, the BMC card went busy, > and thus the bmc-watchdog couldn't perform IPMI and timed out, thus > leading to a reboot?? Obviously, it depends on how you setup the > bmc-watchdog. > this is my setup: #bmc-watchdog -d -u 4 -p 0 -n -i 300 -l 0 I forgot to tell you that I am using in-band mechanism. IBM x3650 should be installed with an RSA II card to get BMC card (think this is the built-in LAN management port that goes with the box) working. > > > > > > I then cleared the SEL records, thinking that the reboot might have > > > > been > > > > triggered due to a full SEL. > > > > > > I think this is a reasonable guess. It could be anything really. > > > > > > > # ipmi-sel -c > > > > > > > > # reboot > > > > # ipmi-sel > > > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > > # ipmi-sel > > > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > > > > > > # reboot > > > > # ipmi-sel > > > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > > 2:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > > 3:OEM defined = 02 00 00 FF 00 00 00 00 20 00 00 00 00 > > > > > > > > Then retried the previous command that caused an error. > > > > > > > > # ipmi-sel > ibm3650-dsc2075-sel.txt > > > > > > > > # cat ibm3650-dsc2075-sel.txt > > > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > > 2:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > > 3:OEM defined = 02 00 00 FF 00 00 00 00 20 00 00 00 00 > > > > > > > > Then the problem didn't occur anymore. > > > > Besides, what is the meaning of this OEM defined? I can't see any log > > > > that > > > > is more specific, or something like > > > > > > The system event log is allowed to store OEM defined information. Since > > > the information is defined by (in this case) IBM, I have no way to > > > convert the hex into something like what you're used to :-( > > > > > > > I think this is cool. So, is it safe to assume that the system > > rebooted if I see similar OEM defined info ( in this case OEM defined > > = 00 00 00 00 00 E3 25 86 80 00 00 FF 00)? Is there any possibility to > > integrate IBM's OEM defined info in the future too? :D > > I'd be willing to integrate any vendors OEM defined This is nice to know. :) > interpretation/parsing into FreeIPMI. The problem is, I do not know how > to interpret/parse any of their information :-( > > As a customer, you should tell your vendor support about this. Each > user that complains makes it more possible for them to release the > information. > > Al > > > > > 220:19-Sep-2008 14:24:56:Power Unit Sys pwr monitor:Power Off/Power Down > > > > 221:19-Sep-2008 14:25:16:Power Unit Sys pwr monitor:Power Off/Power Down > > > > > > > > I've attached here the ipmi-sel debug output. > > > >
Re: Fw: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable to get SEL record
Hey Won, On Mon, 2009-01-26 at 18:53 -0800, Won De Erick wrote: > - Original Message > > > From: Al Chu > > > > Hey Won, > > > > On Sun, 2009-01-25 at 23:00 -0800, Won De Erick wrote: > > > I am forwarding this to the FreeIPMI users mailing list. Hope, I can get > > > hints > > from you all. > > > Thank you. > > > > > > > > > > > > - Forwarded Message > > > From: Won De Erick > > > To: Albert Chu > > > Cc: freeipmi-devel@gnu.org > > > Sent: Saturday, January 24, 2009 11:55:24 AM > > > Subject: Re: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable > > > to get > > SEL record > > > > > > Pls disregard previous email. I forgot to attach the file. :) > > > > Did you send me the wrong debug file? I see debug output from > > ipmi-sensors?? > > > > I'm sorry, attached is the correct one. Seems that this has a successful ipmi-sel execution in it. So not much I can debug with :-( > > > > Hi Al, > > > > > > With IBM x3650, I noticed that ipmi-sel is unable to get the SEL record. > > > > > > # ipmi-sel --version > > > IPMI Sensors [ipmi-sel-0.6.10] > > > > > > # ipmi-sel > ibm3650-dsc2075-sel.txt > > > ipmi_cmd_get_sel_entry: BMC busy > > > ipmi-sel: unable to get SEL record > > > > > > After the above, the box automatically rebooted. Is this normal? > > > > I have never seen this behavior before, and I wouldn't consider it > > "good" in any definition. This is likely a bug in the IBM > > implementation. The "BMC busy" means exactly what it says, the BMC is > > busy and cannot respond to IPMI requests. It by itself is not a > > problem. For example, some other IPMI tasks are hogging resources. But > > you should presumably be able to reach the card eventually. Is it > > possible you have other IPMI things running in the background? > > > > bmc-watchdog (as daemon) was the only thing running in the background. This shouldn't be enough to cause enough IPMI to be *that* busy. Here's a thought. Perhaps the ipmi-sel logs went full, the BMC card went busy, and thus the bmc-watchdog couldn't perform IPMI and timed out, thus leading to a reboot?? Obviously, it depends on how you setup the bmc-watchdog. > > > > I then cleared the SEL records, thinking that the reboot might have been > > triggered due to a full SEL. > > > > I think this is a reasonable guess. It could be anything really. > > > > > # ipmi-sel -c > > > > > > # reboot > > > # ipmi-sel > > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > # ipmi-sel > > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > > > > # reboot > > > # ipmi-sel > > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > 2:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > 3:OEM defined = 02 00 00 FF 00 00 00 00 20 00 00 00 00 > > > > > > Then retried the previous command that caused an error. > > > > > > # ipmi-sel > ibm3650-dsc2075-sel.txt > > > > > > # cat ibm3650-dsc2075-sel.txt > > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > 2:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > 3:OEM defined = 02 00 00 FF 00 00 00 00 20 00 00 00 00 > > > > > > Then the problem didn't occur anymore. > > > Besides, what is the meaning of this OEM defined? I can't see any log > > > that is > > > more specific, or something like > > > > The system event log is allowed to store OEM defined information. Since > > the information is defined by (in this case) IBM, I have no way to > > convert the hex into something like what you're used to :-( > > > > I think this is cool. So, is it safe to assume that the system > rebooted if I see similar OEM defined info ( in this case OEM defined > = 00 00 00 00 00 E3 25 86 80 00 00 FF 00)? Is there any possibility to > integrate IBM's OEM defined info in the future too? :D I'd be willing to integrate any vendors OEM defined interpretation/parsing into FreeIPMI. The problem is, I do not know how to interpret/parse any of their information :-( As a customer, you should tell your vendor support about this. Each user that complains makes it more possible for them to release the information. Al > > > 220:19-Sep-2008 14:24:56:Power Unit Sys pwr monitor:Power Off/Power Down > > > 221:19-Sep-2008 14:25:16:Power Unit Sys pwr monitor:Power Off/Power Down > > > > > > I've attached here the ipmi-sel debug output. > > > > > > Then one side question, I want to ask the possible reasons of the ff > > > log obtained prior to clearing. I didn't change any in the system. > > > I just noticed that the system halted serving and went back after 4-5 > > > minutes, w/out any other records in SEL that says the box hang and > > > rebooted. > > > > > > 54:23-Jan-2009 11:28:55:System Event #0:System Reconfigured > > > > I'm not quite sure what you're asking. Are you asking why the above log > > message occurs? I'm not too sure. It could really be for one of many > > reasons. Maybe a BIOS changed for a firmware changed? The IPMI spec >
Re: Fw: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable to get SEL record
- Original Message > From: Al Chu > > Hey Won, > > On Sun, 2009-01-25 at 23:00 -0800, Won De Erick wrote: > > I am forwarding this to the FreeIPMI users mailing list. Hope, I can get > > hints > from you all. > > Thank you. > > > > > > > > - Forwarded Message > > From: Won De Erick > > To: Albert Chu > > Cc: freeipmi-devel@gnu.org > > Sent: Saturday, January 24, 2009 11:55:24 AM > > Subject: Re: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable to > > get > SEL record > > > > Pls disregard previous email. I forgot to attach the file. :) > > Did you send me the wrong debug file? I see debug output from > ipmi-sensors?? > I'm sorry, attached is the correct one. > > Hi Al, > > > > With IBM x3650, I noticed that ipmi-sel is unable to get the SEL record. > > > > # ipmi-sel --version > > IPMI Sensors [ipmi-sel-0.6.10] > > > > # ipmi-sel > ibm3650-dsc2075-sel.txt > > ipmi_cmd_get_sel_entry: BMC busy > > ipmi-sel: unable to get SEL record > > > > After the above, the box automatically rebooted. Is this normal? > > I have never seen this behavior before, and I wouldn't consider it > "good" in any definition. This is likely a bug in the IBM > implementation. The "BMC busy" means exactly what it says, the BMC is > busy and cannot respond to IPMI requests. It by itself is not a > problem. For example, some other IPMI tasks are hogging resources. But > you should presumably be able to reach the card eventually. Is it > possible you have other IPMI things running in the background? > bmc-watchdog (as daemon) was the only thing running in the background. > > I then cleared the SEL records, thinking that the reboot might have been > triggered due to a full SEL. > > I think this is a reasonable guess. It could be anything really. > > > # ipmi-sel -c > > > > # reboot > > # ipmi-sel > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > # ipmi-sel > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > > > # reboot > > # ipmi-sel > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > 2:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > 3:OEM defined = 02 00 00 FF 00 00 00 00 20 00 00 00 00 > > > > Then retried the previous command that caused an error. > > > > # ipmi-sel > ibm3650-dsc2075-sel.txt > > > > # cat ibm3650-dsc2075-sel.txt > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > 2:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > 3:OEM defined = 02 00 00 FF 00 00 00 00 20 00 00 00 00 > > > > Then the problem didn't occur anymore. > > Besides, what is the meaning of this OEM defined? I can't see any log that > > is > > more specific, or something like > > The system event log is allowed to store OEM defined information. Since > the information is defined by (in this case) IBM, I have no way to > convert the hex into something like what you're used to :-( > I think this is cool. So, is it safe to assume that the system rebooted if I see similar OEM defined info ( in this case OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00)? Is there any possibility to integrate IBM's OEM defined info in the future too? :D > > 220:19-Sep-2008 14:24:56:Power Unit Sys pwr monitor:Power Off/Power Down > > 221:19-Sep-2008 14:25:16:Power Unit Sys pwr monitor:Power Off/Power Down > > > > I've attached here the ipmi-sel debug output. > > > > Then one side question, I want to ask the possible reasons of the ff > > log obtained prior to clearing. I didn't change any in the system. > > I just noticed that the system halted serving and went back after 4-5 > > minutes, w/out any other records in SEL that says the box hang and > > rebooted. > > > > 54:23-Jan-2009 11:28:55:System Event #0:System Reconfigured > > I'm not quite sure what you're asking. Are you asking why the above log > message occurs? I'm not too sure. It could really be for one of many > reasons. Maybe a BIOS changed for a firmware changed? The IPMI spec > doesn't really define when a "System Reconfigured" event must be > reported. It only defines that a "System Reconfigured" event can occur > and that manufacturers are free to determine what events will make that > information output to the event log. > You exactly got what I should mean. But aside from changes on the BIOS or BMC firmware, I want to know too if there are instances that the event would be reported if there are changes on the OS level. I just wondered why the "System Reconfigured" event log came out, where in fact no changes were made on the BIOS firmware or BMC firmware, or on the OS level. Sorry, this question may not be related to FreeIPMI anymore, but I just want to elicit some ideas from you. > Hope I was helpful, > > Al > > > Thanks, > > > > Won > > > > > > > -- > Albert Chu > ch...@llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory I am receiving mail delivery error(s) when sending mails to fre
Re: Fw: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable to get SEL record
Hey Won, On Sun, 2009-01-25 at 23:00 -0800, Won De Erick wrote: > I am forwarding this to the FreeIPMI users mailing list. Hope, I can get > hints from you all. > Thank you. > > > > - Forwarded Message > From: Won De Erick > To: Albert Chu > Cc: freeipmi-devel@gnu.org > Sent: Saturday, January 24, 2009 11:55:24 AM > Subject: Re: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable to > get SEL record > > Pls disregard previous email. I forgot to attach the file. :) Did you send me the wrong debug file? I see debug output from ipmi-sensors?? > Hi Al, > > With IBM x3650, I noticed that ipmi-sel is unable to get the SEL record. > > # ipmi-sel --version > IPMI Sensors [ipmi-sel-0.6.10] > > # ipmi-sel > ibm3650-dsc2075-sel.txt > ipmi_cmd_get_sel_entry: BMC busy > ipmi-sel: unable to get SEL record > > After the above, the box automatically rebooted. Is this normal? I have never seen this behavior before, and I wouldn't consider it "good" in any definition. This is likely a bug in the IBM implementation. The "BMC busy" means exactly what it says, the BMC is busy and cannot respond to IPMI requests. It by itself is not a problem. For example, some other IPMI tasks are hogging resources. But you should presumably be able to reach the card eventually. Is it possible you have other IPMI things running in the background? > I then cleared the SEL records, thinking that the reboot might have been > triggered due to a full SEL. I think this is a reasonable guess. It could be anything really. > # ipmi-sel -c > > # reboot > # ipmi-sel > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > # ipmi-sel > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > > # reboot > # ipmi-sel > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > 2:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > 3:OEM defined = 02 00 00 FF 00 00 00 00 20 00 00 00 00 > > Then retried the previous command that caused an error. > > # ipmi-sel > ibm3650-dsc2075-sel.txt > > # cat ibm3650-dsc2075-sel.txt > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > 2:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00 > 3:OEM defined = 02 00 00 FF 00 00 00 00 20 00 00 00 00 > > Then the problem didn't occur anymore. > Besides, what is the meaning of this OEM defined? I can't see any log that is > more specific, or something like The system event log is allowed to store OEM defined information. Since the information is defined by (in this case) IBM, I have no way to convert the hex into something like what you're used to :-( > 220:19-Sep-2008 14:24:56:Power Unit Sys pwr monitor:Power Off/Power Down > 221:19-Sep-2008 14:25:16:Power Unit Sys pwr monitor:Power Off/Power Down > > I've attached here the ipmi-sel debug output. > > Then one side question, I want to ask the possible reasons of the ff > log obtained prior to clearing. I didn't change any in the system. > I just noticed that the system halted serving and went back after 4-5 > minutes, w/out any other records in SEL that says the box hang and > rebooted. > > 54:23-Jan-2009 11:28:55:System Event #0:System Reconfigured I'm not quite sure what you're asking. Are you asking why the above log message occurs? I'm not too sure. It could really be for one of many reasons. Maybe a BIOS changed for a firmware changed? The IPMI spec doesn't really define when a "System Reconfigured" event must be reported. It only defines that a "System Reconfigured" event can occur and that manufacturers are free to determine what events will make that information output to the event log. Hope I was helpful, Al > Thanks, > > Won > > > -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory ___ Freeipmi-devel mailing list Freeipmi-devel@gnu.org http://lists.gnu.org/mailman/listinfo/freeipmi-devel