Re: [ovirt-users] problems with power management using idrac7 on r620

2015-06-17 Thread Marek marx Grac



On 06/16/2015 09:37 AM, Eli Mesika wrote:

CCing Marek Grac

- Original Message -

From: Jason Keltz jason.ke...@gmail.com
To: users users@ovirt.org
Cc: Eli Mesika emes...@redhat.com
Sent: Monday, June 15, 2015 11:08:35 PM
Subject: problems with power management using idrac7 on r620

Hi.

I've been having problem with power management using iDRAC 7 EXPRESS on
a Dell R620.  This uses a shared LOM as opposed to Enterprise that has a
dedicated one.   Every now and then, idrac simply stops responding to
ping, so it can't respond to status commands from the proxy.  If I send
a reboot with ipmitool mc reset cold command, the idrac reboots and
comes back, but after the problem has occurred, even after a reboot, it
responds to ping, but drops 80+% of packets.  The only way I can solve
the problem is to physically restart the server.This isn't just
happening on  one R620 - it's happening on all of my ovirt hosts.  I
highly suspect it has to do with a memory leak, and being monitored by
engine causes the problem.I had applied a recent firmware upgrade
that was supposed to solve this kind of problem, but it doesn't.  In
other to provide Dell with more details, can someone tell me how often
each host is being queried for status?  I can't seem to find that info.
The idrac on my file server doesn't seem to exhibit the same problem,
and I suspect that is because it isn't being queried.

Hi,

fence agent for IPMI is based on ipmitool. So if ping/ipmitool is not 
working there is not much to do about it. I don't know enough about 
oVirt engine but there is no real place where fence agent can memory 
leak because it does not run as daemon.


m,
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] problems with power management using idrac7 on r620

2015-06-17 Thread Jason Keltz
Hi Marek.

Actually its the idrac that I believe has the memory leak.  Dell wants to
know how often ovirt is querying the idrac for status and whether the delay
is configurable.

Jason.
On Jun 17, 2015 2:42 AM, Marek marx Grac mg...@redhat.com wrote:



 On 06/16/2015 09:37 AM, Eli Mesika wrote:

 CCing Marek Grac

 - Original Message -

 From: Jason Keltz jason.ke...@gmail.com
 To: users users@ovirt.org
 Cc: Eli Mesika emes...@redhat.com
 Sent: Monday, June 15, 2015 11:08:35 PM
 Subject: problems with power management using idrac7 on r620

 Hi.

 I've been having problem with power management using iDRAC 7 EXPRESS on
 a Dell R620.  This uses a shared LOM as opposed to Enterprise that has a
 dedicated one.   Every now and then, idrac simply stops responding to
 ping, so it can't respond to status commands from the proxy.  If I send
 a reboot with ipmitool mc reset cold command, the idrac reboots and
 comes back, but after the problem has occurred, even after a reboot, it
 responds to ping, but drops 80+% of packets.  The only way I can solve
 the problem is to physically restart the server.This isn't just
 happening on  one R620 - it's happening on all of my ovirt hosts.  I
 highly suspect it has to do with a memory leak, and being monitored by
 engine causes the problem.I had applied a recent firmware upgrade
 that was supposed to solve this kind of problem, but it doesn't.  In
 other to provide Dell with more details, can someone tell me how often
 each host is being queried for status?  I can't seem to find that info.
 The idrac on my file server doesn't seem to exhibit the same problem,
 and I suspect that is because it isn't being queried.

 Hi,

 fence agent for IPMI is based on ipmitool. So if ping/ipmitool is not
 working there is not much to do about it. I don't know enough about oVirt
 engine but there is no real place where fence agent can memory leak because
 it does not run as daemon.

 m,

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] problems with power management using idrac7 on r620

2015-06-17 Thread Jason Keltz

Hi Eli..
Thank you!
I checked and health check is not enabled So the problem causing the 
idrac to go away is not status monitoring from ovirt after all...Hmm... 
Makes me wonder if actually enabling it will prevent the problem from 
happening.


Jas

Sent with AquaMail for Android
http://www.aqua-mail.com


On June 17, 2015 5:19:28 AM Eli Mesika emes...@redhat.com wrote:




- Original Message -
 From: Jason Keltz jason.ke...@gmail.com
 To: Marek marx Grac mg...@redhat.com
 Cc: Eli Mesika emes...@redhat.com, users users@ovirt.org
 Sent: Wednesday, June 17, 2015 12:02:48 PM
 Subject: Re: problems with power management using idrac7 on r620

 Hi Marek.

 Actually its the idrac that I believe has the memory leak.  Dell wants to
 know how often ovirt is querying the idrac for status and whether the delay
 is configurable.

Well oVirt does not query the status automatically by default
There is a feature that enables that
http://www.ovirt.org/Features/PMHealthCheck
Basically this feature depends on 2 configuration values :

PMHealthCheckEnabled that shoul be true if the feature is enabled
PMHealthCheckIntervalInSec which is defaulted to 3600 Sec , so it is 
checked in that case once in an hour


So, first please check if this is enabled in your environment

engine-config -g PMHealthCheckEnabled

engine-config -g PMHealthCheckIntervalInSec

Other scenario when status is used is when host becomes non-responsive

In case that host become non responsive :

After a grace period that depends on the host load and if it is SPM or not 
a soft-fence attempt (vdsmd service restart) is issued
If the soft-fence attempt fails we will do a real fencing (if power 
management is configured correctly on the host and a proxy host is found)

We are sending a STOP command
We are sending by default 18 status command , one each 10 sec until we get 
'off' status from the agent

We are sending a START command
We are sending by default 18 status command , one each 10 sec until we get 
'on' status from the agent


Those depends on the following configuration variables :

FenceStopStatusRetries - default 18
FenceStopStatusDelayBetweenRetriesInSec - default 10
FenceStartStatusRetries - default 18
FenceStartStatusDelayBetweenRetriesInSec - default 10

This can be changed using the engine-config tool (requires restart to take 
affect)





 Jason.
 On Jun 17, 2015 2:42 AM, Marek marx Grac mg...@redhat.com wrote:

 
 
  On 06/16/2015 09:37 AM, Eli Mesika wrote:
 
  CCing Marek Grac
 
  - Original Message -
 
  From: Jason Keltz jason.ke...@gmail.com
  To: users users@ovirt.org
  Cc: Eli Mesika emes...@redhat.com
  Sent: Monday, June 15, 2015 11:08:35 PM
  Subject: problems with power management using idrac7 on r620
 
  Hi.
 
  I've been having problem with power management using iDRAC 7 EXPRESS on
  a Dell R620.  This uses a shared LOM as opposed to Enterprise that has a
  dedicated one.   Every now and then, idrac simply stops responding to
  ping, so it can't respond to status commands from the proxy.  If I send
  a reboot with ipmitool mc reset cold command, the idrac reboots and
  comes back, but after the problem has occurred, even after a reboot, it
  responds to ping, but drops 80+% of packets.  The only way I can solve
  the problem is to physically restart the server.This isn't just
  happening on  one R620 - it's happening on all of my ovirt hosts.  I
  highly suspect it has to do with a memory leak, and being monitored by
  engine causes the problem.I had applied a recent firmware upgrade
  that was supposed to solve this kind of problem, but it doesn't.  In
  other to provide Dell with more details, can someone tell me how often
  each host is being queried for status?  I can't seem to find that info.
  The idrac on my file server doesn't seem to exhibit the same problem,
  and I suspect that is because it isn't being queried.
 
  Hi,
 
  fence agent for IPMI is based on ipmitool. So if ping/ipmitool is not
  working there is not much to do about it. I don't know enough about oVirt
  engine but there is no real place where fence agent can memory leak because
  it does not run as daemon.
 
  m,
 

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users




___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] problems with power management using idrac7 on r620

2015-06-17 Thread Eli Mesika


- Original Message -
 From: Jason Keltz jason.ke...@gmail.com
 To: Marek marx Grac mg...@redhat.com
 Cc: Eli Mesika emes...@redhat.com, users users@ovirt.org
 Sent: Wednesday, June 17, 2015 12:02:48 PM
 Subject: Re: problems with power management using idrac7 on r620
 
 Hi Marek.
 
 Actually its the idrac that I believe has the memory leak.  Dell wants to
 know how often ovirt is querying the idrac for status and whether the delay
 is configurable.

Well oVirt does not query the status automatically by default 
There is a feature that enables that 
http://www.ovirt.org/Features/PMHealthCheck
Basically this feature depends on 2 configuration values :

PMHealthCheckEnabled that shoul be true if the feature is enabled 
PMHealthCheckIntervalInSec which is defaulted to 3600 Sec , so it is checked in 
that case once in an hour 

So, first please check if this is enabled in your environment 

engine-config -g PMHealthCheckEnabled

engine-config -g PMHealthCheckIntervalInSec

Other scenario when status is used is when host becomes non-responsive 

In case that host become non responsive : 

After a grace period that depends on the host load and if it is SPM or not a 
soft-fence attempt (vdsmd service restart) is issued 
If the soft-fence attempt fails we will do a real fencing (if power management 
is configured correctly on the host and a proxy host is found)
We are sending a STOP command 
We are sending by default 18 status command , one each 10 sec until we get 
'off' status from the agent 
We are sending a START command 
We are sending by default 18 status command , one each 10 sec until we get 'on' 
status from the agent

Those depends on the following configuration variables :

FenceStopStatusRetries - default 18
FenceStopStatusDelayBetweenRetriesInSec - default 10 
FenceStartStatusRetries - default 18
FenceStartStatusDelayBetweenRetriesInSec - default 10 

This can be changed using the engine-config tool (requires restart to take 
affect)



 
 Jason.
 On Jun 17, 2015 2:42 AM, Marek marx Grac mg...@redhat.com wrote:
 
 
 
  On 06/16/2015 09:37 AM, Eli Mesika wrote:
 
  CCing Marek Grac
 
  - Original Message -
 
  From: Jason Keltz jason.ke...@gmail.com
  To: users users@ovirt.org
  Cc: Eli Mesika emes...@redhat.com
  Sent: Monday, June 15, 2015 11:08:35 PM
  Subject: problems with power management using idrac7 on r620
 
  Hi.
 
  I've been having problem with power management using iDRAC 7 EXPRESS on
  a Dell R620.  This uses a shared LOM as opposed to Enterprise that has a
  dedicated one.   Every now and then, idrac simply stops responding to
  ping, so it can't respond to status commands from the proxy.  If I send
  a reboot with ipmitool mc reset cold command, the idrac reboots and
  comes back, but after the problem has occurred, even after a reboot, it
  responds to ping, but drops 80+% of packets.  The only way I can solve
  the problem is to physically restart the server.This isn't just
  happening on  one R620 - it's happening on all of my ovirt hosts.  I
  highly suspect it has to do with a memory leak, and being monitored by
  engine causes the problem.I had applied a recent firmware upgrade
  that was supposed to solve this kind of problem, but it doesn't.  In
  other to provide Dell with more details, can someone tell me how often
  each host is being queried for status?  I can't seem to find that info.
  The idrac on my file server doesn't seem to exhibit the same problem,
  and I suspect that is because it isn't being queried.
 
  Hi,
 
  fence agent for IPMI is based on ipmitool. So if ping/ipmitool is not
  working there is not much to do about it. I don't know enough about oVirt
  engine but there is no real place where fence agent can memory leak because
  it does not run as daemon.
 
  m,
 
 
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] problems with power management using idrac7 on r620

2015-06-16 Thread Eli Mesika
CCing Marek Grac 

- Original Message -
 From: Jason Keltz jason.ke...@gmail.com
 To: users users@ovirt.org
 Cc: Eli Mesika emes...@redhat.com
 Sent: Monday, June 15, 2015 11:08:35 PM
 Subject: problems with power management using idrac7 on r620
 
 Hi.
 
 I've been having problem with power management using iDRAC 7 EXPRESS on
 a Dell R620.  This uses a shared LOM as opposed to Enterprise that has a
 dedicated one.   Every now and then, idrac simply stops responding to
 ping, so it can't respond to status commands from the proxy.  If I send
 a reboot with ipmitool mc reset cold command, the idrac reboots and
 comes back, but after the problem has occurred, even after a reboot, it
 responds to ping, but drops 80+% of packets.  The only way I can solve
 the problem is to physically restart the server.This isn't just
 happening on  one R620 - it's happening on all of my ovirt hosts.  I
 highly suspect it has to do with a memory leak, and being monitored by
 engine causes the problem.I had applied a recent firmware upgrade
 that was supposed to solve this kind of problem, but it doesn't.  In
 other to provide Dell with more details, can someone tell me how often
 each host is being queried for status?  I can't seem to find that info.
 The idrac on my file server doesn't seem to exhibit the same problem,
 and I suspect that is because it isn't being queried.
 
 Thanks,
 
 Jason.
 
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] problems with power management using idrac7 on r620

2015-06-15 Thread Jason Keltz

Hi.

I've been having problem with power management using iDRAC 7 EXPRESS on 
a Dell R620.  This uses a shared LOM as opposed to Enterprise that has a 
dedicated one.   Every now and then, idrac simply stops responding to 
ping, so it can't respond to status commands from the proxy.  If I send 
a reboot with ipmitool mc reset cold command, the idrac reboots and 
comes back, but after the problem has occurred, even after a reboot, it 
responds to ping, but drops 80+% of packets.  The only way I can solve 
the problem is to physically restart the server.This isn't just 
happening on  one R620 - it's happening on all of my ovirt hosts.  I 
highly suspect it has to do with a memory leak, and being monitored by 
engine causes the problem.I had applied a recent firmware upgrade 
that was supposed to solve this kind of problem, but it doesn't.  In 
other to provide Dell with more details, can someone tell me how often 
each host is being queried for status?  I can't seem to find that info.  
The idrac on my file server doesn't seem to exhibit the same problem, 
and I suspect that is because it isn't being queried.


Thanks,

Jason.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users