Re: [ovirt-users] problems with power management using idrac7 on r620
On 06/16/2015 09:37 AM, Eli Mesika wrote: CCing Marek Grac - Original Message - From: Jason Keltz jason.ke...@gmail.com To: users users@ovirt.org Cc: Eli Mesika emes...@redhat.com Sent: Monday, June 15, 2015 11:08:35 PM Subject: problems with power management using idrac7 on r620 Hi. I've been having problem with power management using iDRAC 7 EXPRESS on a Dell R620. This uses a shared LOM as opposed to Enterprise that has a dedicated one. Every now and then, idrac simply stops responding to ping, so it can't respond to status commands from the proxy. If I send a reboot with ipmitool mc reset cold command, the idrac reboots and comes back, but after the problem has occurred, even after a reboot, it responds to ping, but drops 80+% of packets. The only way I can solve the problem is to physically restart the server.This isn't just happening on one R620 - it's happening on all of my ovirt hosts. I highly suspect it has to do with a memory leak, and being monitored by engine causes the problem.I had applied a recent firmware upgrade that was supposed to solve this kind of problem, but it doesn't. In other to provide Dell with more details, can someone tell me how often each host is being queried for status? I can't seem to find that info. The idrac on my file server doesn't seem to exhibit the same problem, and I suspect that is because it isn't being queried. Hi, fence agent for IPMI is based on ipmitool. So if ping/ipmitool is not working there is not much to do about it. I don't know enough about oVirt engine but there is no real place where fence agent can memory leak because it does not run as daemon. m, ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] problems with power management using idrac7 on r620
Hi Marek. Actually its the idrac that I believe has the memory leak. Dell wants to know how often ovirt is querying the idrac for status and whether the delay is configurable. Jason. On Jun 17, 2015 2:42 AM, Marek marx Grac mg...@redhat.com wrote: On 06/16/2015 09:37 AM, Eli Mesika wrote: CCing Marek Grac - Original Message - From: Jason Keltz jason.ke...@gmail.com To: users users@ovirt.org Cc: Eli Mesika emes...@redhat.com Sent: Monday, June 15, 2015 11:08:35 PM Subject: problems with power management using idrac7 on r620 Hi. I've been having problem with power management using iDRAC 7 EXPRESS on a Dell R620. This uses a shared LOM as opposed to Enterprise that has a dedicated one. Every now and then, idrac simply stops responding to ping, so it can't respond to status commands from the proxy. If I send a reboot with ipmitool mc reset cold command, the idrac reboots and comes back, but after the problem has occurred, even after a reboot, it responds to ping, but drops 80+% of packets. The only way I can solve the problem is to physically restart the server.This isn't just happening on one R620 - it's happening on all of my ovirt hosts. I highly suspect it has to do with a memory leak, and being monitored by engine causes the problem.I had applied a recent firmware upgrade that was supposed to solve this kind of problem, but it doesn't. In other to provide Dell with more details, can someone tell me how often each host is being queried for status? I can't seem to find that info. The idrac on my file server doesn't seem to exhibit the same problem, and I suspect that is because it isn't being queried. Hi, fence agent for IPMI is based on ipmitool. So if ping/ipmitool is not working there is not much to do about it. I don't know enough about oVirt engine but there is no real place where fence agent can memory leak because it does not run as daemon. m, ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] problems with power management using idrac7 on r620
Hi Eli.. Thank you! I checked and health check is not enabled So the problem causing the idrac to go away is not status monitoring from ovirt after all...Hmm... Makes me wonder if actually enabling it will prevent the problem from happening. Jas Sent with AquaMail for Android http://www.aqua-mail.com On June 17, 2015 5:19:28 AM Eli Mesika emes...@redhat.com wrote: - Original Message - From: Jason Keltz jason.ke...@gmail.com To: Marek marx Grac mg...@redhat.com Cc: Eli Mesika emes...@redhat.com, users users@ovirt.org Sent: Wednesday, June 17, 2015 12:02:48 PM Subject: Re: problems with power management using idrac7 on r620 Hi Marek. Actually its the idrac that I believe has the memory leak. Dell wants to know how often ovirt is querying the idrac for status and whether the delay is configurable. Well oVirt does not query the status automatically by default There is a feature that enables that http://www.ovirt.org/Features/PMHealthCheck Basically this feature depends on 2 configuration values : PMHealthCheckEnabled that shoul be true if the feature is enabled PMHealthCheckIntervalInSec which is defaulted to 3600 Sec , so it is checked in that case once in an hour So, first please check if this is enabled in your environment engine-config -g PMHealthCheckEnabled engine-config -g PMHealthCheckIntervalInSec Other scenario when status is used is when host becomes non-responsive In case that host become non responsive : After a grace period that depends on the host load and if it is SPM or not a soft-fence attempt (vdsmd service restart) is issued If the soft-fence attempt fails we will do a real fencing (if power management is configured correctly on the host and a proxy host is found) We are sending a STOP command We are sending by default 18 status command , one each 10 sec until we get 'off' status from the agent We are sending a START command We are sending by default 18 status command , one each 10 sec until we get 'on' status from the agent Those depends on the following configuration variables : FenceStopStatusRetries - default 18 FenceStopStatusDelayBetweenRetriesInSec - default 10 FenceStartStatusRetries - default 18 FenceStartStatusDelayBetweenRetriesInSec - default 10 This can be changed using the engine-config tool (requires restart to take affect) Jason. On Jun 17, 2015 2:42 AM, Marek marx Grac mg...@redhat.com wrote: On 06/16/2015 09:37 AM, Eli Mesika wrote: CCing Marek Grac - Original Message - From: Jason Keltz jason.ke...@gmail.com To: users users@ovirt.org Cc: Eli Mesika emes...@redhat.com Sent: Monday, June 15, 2015 11:08:35 PM Subject: problems with power management using idrac7 on r620 Hi. I've been having problem with power management using iDRAC 7 EXPRESS on a Dell R620. This uses a shared LOM as opposed to Enterprise that has a dedicated one. Every now and then, idrac simply stops responding to ping, so it can't respond to status commands from the proxy. If I send a reboot with ipmitool mc reset cold command, the idrac reboots and comes back, but after the problem has occurred, even after a reboot, it responds to ping, but drops 80+% of packets. The only way I can solve the problem is to physically restart the server.This isn't just happening on one R620 - it's happening on all of my ovirt hosts. I highly suspect it has to do with a memory leak, and being monitored by engine causes the problem.I had applied a recent firmware upgrade that was supposed to solve this kind of problem, but it doesn't. In other to provide Dell with more details, can someone tell me how often each host is being queried for status? I can't seem to find that info. The idrac on my file server doesn't seem to exhibit the same problem, and I suspect that is because it isn't being queried. Hi, fence agent for IPMI is based on ipmitool. So if ping/ipmitool is not working there is not much to do about it. I don't know enough about oVirt engine but there is no real place where fence agent can memory leak because it does not run as daemon. m, ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] problems with power management using idrac7 on r620
- Original Message - From: Jason Keltz jason.ke...@gmail.com To: Marek marx Grac mg...@redhat.com Cc: Eli Mesika emes...@redhat.com, users users@ovirt.org Sent: Wednesday, June 17, 2015 12:02:48 PM Subject: Re: problems with power management using idrac7 on r620 Hi Marek. Actually its the idrac that I believe has the memory leak. Dell wants to know how often ovirt is querying the idrac for status and whether the delay is configurable. Well oVirt does not query the status automatically by default There is a feature that enables that http://www.ovirt.org/Features/PMHealthCheck Basically this feature depends on 2 configuration values : PMHealthCheckEnabled that shoul be true if the feature is enabled PMHealthCheckIntervalInSec which is defaulted to 3600 Sec , so it is checked in that case once in an hour So, first please check if this is enabled in your environment engine-config -g PMHealthCheckEnabled engine-config -g PMHealthCheckIntervalInSec Other scenario when status is used is when host becomes non-responsive In case that host become non responsive : After a grace period that depends on the host load and if it is SPM or not a soft-fence attempt (vdsmd service restart) is issued If the soft-fence attempt fails we will do a real fencing (if power management is configured correctly on the host and a proxy host is found) We are sending a STOP command We are sending by default 18 status command , one each 10 sec until we get 'off' status from the agent We are sending a START command We are sending by default 18 status command , one each 10 sec until we get 'on' status from the agent Those depends on the following configuration variables : FenceStopStatusRetries - default 18 FenceStopStatusDelayBetweenRetriesInSec - default 10 FenceStartStatusRetries - default 18 FenceStartStatusDelayBetweenRetriesInSec - default 10 This can be changed using the engine-config tool (requires restart to take affect) Jason. On Jun 17, 2015 2:42 AM, Marek marx Grac mg...@redhat.com wrote: On 06/16/2015 09:37 AM, Eli Mesika wrote: CCing Marek Grac - Original Message - From: Jason Keltz jason.ke...@gmail.com To: users users@ovirt.org Cc: Eli Mesika emes...@redhat.com Sent: Monday, June 15, 2015 11:08:35 PM Subject: problems with power management using idrac7 on r620 Hi. I've been having problem with power management using iDRAC 7 EXPRESS on a Dell R620. This uses a shared LOM as opposed to Enterprise that has a dedicated one. Every now and then, idrac simply stops responding to ping, so it can't respond to status commands from the proxy. If I send a reboot with ipmitool mc reset cold command, the idrac reboots and comes back, but after the problem has occurred, even after a reboot, it responds to ping, but drops 80+% of packets. The only way I can solve the problem is to physically restart the server.This isn't just happening on one R620 - it's happening on all of my ovirt hosts. I highly suspect it has to do with a memory leak, and being monitored by engine causes the problem.I had applied a recent firmware upgrade that was supposed to solve this kind of problem, but it doesn't. In other to provide Dell with more details, can someone tell me how often each host is being queried for status? I can't seem to find that info. The idrac on my file server doesn't seem to exhibit the same problem, and I suspect that is because it isn't being queried. Hi, fence agent for IPMI is based on ipmitool. So if ping/ipmitool is not working there is not much to do about it. I don't know enough about oVirt engine but there is no real place where fence agent can memory leak because it does not run as daemon. m, ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] problems with power management using idrac7 on r620
CCing Marek Grac - Original Message - From: Jason Keltz jason.ke...@gmail.com To: users users@ovirt.org Cc: Eli Mesika emes...@redhat.com Sent: Monday, June 15, 2015 11:08:35 PM Subject: problems with power management using idrac7 on r620 Hi. I've been having problem with power management using iDRAC 7 EXPRESS on a Dell R620. This uses a shared LOM as opposed to Enterprise that has a dedicated one. Every now and then, idrac simply stops responding to ping, so it can't respond to status commands from the proxy. If I send a reboot with ipmitool mc reset cold command, the idrac reboots and comes back, but after the problem has occurred, even after a reboot, it responds to ping, but drops 80+% of packets. The only way I can solve the problem is to physically restart the server.This isn't just happening on one R620 - it's happening on all of my ovirt hosts. I highly suspect it has to do with a memory leak, and being monitored by engine causes the problem.I had applied a recent firmware upgrade that was supposed to solve this kind of problem, but it doesn't. In other to provide Dell with more details, can someone tell me how often each host is being queried for status? I can't seem to find that info. The idrac on my file server doesn't seem to exhibit the same problem, and I suspect that is because it isn't being queried. Thanks, Jason. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] problems with power management using idrac7 on r620
Hi. I've been having problem with power management using iDRAC 7 EXPRESS on a Dell R620. This uses a shared LOM as opposed to Enterprise that has a dedicated one. Every now and then, idrac simply stops responding to ping, so it can't respond to status commands from the proxy. If I send a reboot with ipmitool mc reset cold command, the idrac reboots and comes back, but after the problem has occurred, even after a reboot, it responds to ping, but drops 80+% of packets. The only way I can solve the problem is to physically restart the server.This isn't just happening on one R620 - it's happening on all of my ovirt hosts. I highly suspect it has to do with a memory leak, and being monitored by engine causes the problem.I had applied a recent firmware upgrade that was supposed to solve this kind of problem, but it doesn't. In other to provide Dell with more details, can someone tell me how often each host is being queried for status? I can't seem to find that info. The idrac on my file server doesn't seem to exhibit the same problem, and I suspect that is because it isn't being queried. Thanks, Jason. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users