Re: [Linux-HA] MySQL unknown exec error

2010-11-08 Thread Dejan Muhamedagic
On Thu, Nov 04, 2010 at 02:54:59PM -0300, mike wrote:
 On 10-11-04 12:38 PM, Dejan Muhamedagic wrote:
  Hi,
 
  On Thu, Nov 04, 2010 at 11:06:48AM -0300, mike wrote:
 
  Looking for a more experienced person who can explain this issue we had
  last night.
 
  Our backups kicked in during the night at 1AM. At 1:01AM, our mysql
  cluster had issues. Specifically I can see in crm_mon where the cluster
  has it as failed due to an unknown exec error. Looking at the
  performance of the node, I can see where wait on I/O went through the
  roof at 1AM when the tsm backups kicked in. I can see where this caused
  heartbeat issues because mysql was late checking its instances - it
  generally takes a few seconds but in this case it took 3 minutes. Of
  course this is all due to the extremely high wait on I/O but I am
  curious - why didn't the cluster fail over? Why put MySQL in an
  unmanaged state and simply say there was an unknown exec error?.
   
  Can't say without looking at the logs and the PE files. One
  possible explanation is that a resource was for whatever reason
  not allowed to run on the other node: a failure in the past
  which didn't expire or a negative location constraint. Or the
  fail count reached migration threshold (if defined).
 
  Thanks,
 
  Dejan
 
 
 
  Thanks for any comments
 
 
 Thanks for the reply Dejan. I have the failcount threshold set to 3 on 
 both nodes and if I understand it correctly, after a 3rd failure it 
 should fail over to then backup node. Correct?

Yes.

 What do you mean by a 
 negative location constraint?

A location constraint with a negative score. For instance, such
constraint is inserted by the crm resource move command.

Thanks,

Dejan

 Mike
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] MySQL unknown exec error

2010-11-04 Thread mike
Looking for a more experienced person who can explain this issue we had 
last night.

Our backups kicked in during the night at 1AM. At 1:01AM, our mysql 
cluster had issues. Specifically I can see in crm_mon where the cluster 
has it as failed due to an unknown exec error. Looking at the 
performance of the node, I can see where wait on I/O went through the 
roof at 1AM when the tsm backups kicked in. I can see where this caused 
heartbeat issues because mysql was late checking its instances - it 
generally takes a few seconds but in this case it took 3 minutes. Of 
course this is all due to the extremely high wait on I/O but I am 
curious - why didn't the cluster fail over? Why put MySQL in an 
unmanaged state and simply say there was an unknown exec error?.

Thanks for any comments
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] MySQL unknown exec error

2010-11-04 Thread Dejan Muhamedagic
Hi,

On Thu, Nov 04, 2010 at 11:06:48AM -0300, mike wrote:
 Looking for a more experienced person who can explain this issue we had 
 last night.
 
 Our backups kicked in during the night at 1AM. At 1:01AM, our mysql 
 cluster had issues. Specifically I can see in crm_mon where the cluster 
 has it as failed due to an unknown exec error. Looking at the 
 performance of the node, I can see where wait on I/O went through the 
 roof at 1AM when the tsm backups kicked in. I can see where this caused 
 heartbeat issues because mysql was late checking its instances - it 
 generally takes a few seconds but in this case it took 3 minutes. Of 
 course this is all due to the extremely high wait on I/O but I am 
 curious - why didn't the cluster fail over? Why put MySQL in an 
 unmanaged state and simply say there was an unknown exec error?.

Can't say without looking at the logs and the PE files. One
possible explanation is that a resource was for whatever reason
not allowed to run on the other node: a failure in the past
which didn't expire or a negative location constraint. Or the
fail count reached migration threshold (if defined).

Thanks,

Dejan


 Thanks for any comments
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] MySQL unknown exec error

2010-11-04 Thread mike
On 10-11-04 12:38 PM, Dejan Muhamedagic wrote:
 Hi,

 On Thu, Nov 04, 2010 at 11:06:48AM -0300, mike wrote:

 Looking for a more experienced person who can explain this issue we had
 last night.

 Our backups kicked in during the night at 1AM. At 1:01AM, our mysql
 cluster had issues. Specifically I can see in crm_mon where the cluster
 has it as failed due to an unknown exec error. Looking at the
 performance of the node, I can see where wait on I/O went through the
 roof at 1AM when the tsm backups kicked in. I can see where this caused
 heartbeat issues because mysql was late checking its instances - it
 generally takes a few seconds but in this case it took 3 minutes. Of
 course this is all due to the extremely high wait on I/O but I am
 curious - why didn't the cluster fail over? Why put MySQL in an
 unmanaged state and simply say there was an unknown exec error?.
  
 Can't say without looking at the logs and the PE files. One
 possible explanation is that a resource was for whatever reason
 not allowed to run on the other node: a failure in the past
 which didn't expire or a negative location constraint. Or the
 fail count reached migration threshold (if defined).

 Thanks,

 Dejan



 Thanks for any comments
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
  
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems


Thanks for the reply Dejan. I have the failcount threshold set to 3 on 
both nodes and if I understand it correctly, after a 3rd failure it 
should fail over to then backup node. Correct? What do you mean by a 
negative location constraint?

Mike
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems