Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

2018-03-04 Thread Ondrej Famera
On 02/19/2018 11:25 PM, Dileep V Nair wrote:
> Hello Ondrej,
> 
> I am still having issues with my DB2 HADR on Pacemaker. When I do a
> db2_kill on Primary for testing, initially it does a restart of DB2 on
> the same node. But if I let it run for some days and then try the same
> test, it goes into fencing and then reboots the Primary Node.
> 
> I am not sure how exactly it should behave in case my DB2 crashes on
> Primary.
> 
> Also if I crash the Node 1 (the node itself, not only DB2), it promotes
> Node 2 to Primary, but once the Pacemaker is started again on Node 1,
> the DB on Node 1 is also promoted to Primary. Is that expected behaviour ?
> Regards,
> 
> *Dileep V Nair*
> Senior AIX Administrator
> Cloud Managed Services Delivery (MSD), India
> IBM Cloud
> 
> 
> *E-mail:*_dilen...@in.ibm.com_    
> Outer Ring Road, Embassy Manya
> Bangalore, KA 560045
> India

Hello Dileep,

Sorry for later reply. (my email filters sometimes misbehaves)

Seeing a fencing after db2_kill is interesting but questions is what has
triggered the fencing. Was is failure of DB2 to stop or some other
resource failure?

When DB2 was successfully promoted on one node while previous has
crashed, the one that was crashed should detect that it is 'outdated
Primary' in DB2. When this happens the cluster will not attempt to
promote it to Master and will leave it as slave. Investigation on DB2
side might be needed to determine if this didn't happen.

In case that you have some procedure that results in this behavior
constantly I can check on my testing machine to see if I can reproduce
it - this may give a hint if it is more cluster issue or DB2 issue that
needs to be addressed.

-- 
Ondrej Faměra
@Red Hat
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

2018-02-19 Thread Dileep V Nair

Hello Ondrej,

I am still having issues with my DB2 HADR on Pacemaker. When I do a
db2_kill on Primary for testing, initially it does a restart of DB2 on the
same node. But if I let it run for some days and then try the same test, it
goes into fencing and then reboots the Primary Node.

I am not sure how exactly it should behave in case my DB2 crashes on
Primary.

Also if I crash the Node 1 (the node itself, not only DB2), it
promotes Node 2  to Primary, but once the Pacemaker is started again on
Node 1, the DB on Node 1 is also promoted to Primary. Is that expected
behaviour ?

   
 Regards,   
   

   
 Dileep V Nair  
   
 Senior AIX Administrator   
   
 Cloud Managed Services Delivery (MSD), India   
   
 IBM Cloud  
   

   






 E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy 
Manya 
   Bangalore, KA 
560045 
  
India 









From:   Ondrej Famera <ofam...@redhat.com>
To: Dileep V Nair <dilen...@in.ibm.com>
Cc: Cluster Labs - All topics related to open-source clustering
welcomed <users@clusterlabs.org>
Date:   02/12/2018 11:46 AM
Subject:    Re: [ClusterLabs] Issues with DB2 HADR Resource Agent



On 02/01/2018 07:24 PM, Dileep V Nair wrote:
> Thanks Ondrej for the response. I have set the PEER_WINDOWto 1000 which
> I guess is a reasonable value. What I am noticing is it does not wait
> for the PEER_WINDOW. Before that itself the DB goes into a
> REMOTE_CATCHUP_PENDING state and Pacemaker give an Error saying a DB in
> STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted.
>
>
> Regards,
>
> *Dileep V Nair*

Hi Dileep,

sorry for later response. The DB2 should not get into the
'REMOTE_CATCHUP' phase or the DB2 resource agent will indeed not
promote. From my experience it usually gets into that state when the DB2
on standby was restarted during or after PEER_WINDOW timeout.

When the primary DB2 fails then standby should end up in some state that
would match the one on line 770 of DB2 resource agent and the promote
operation is attempted.

  770  STANDBY/*PEER/DISCONNECTED|Standby/DisconnectedPeer)

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ClusterLabs_resource-2Dagents_blob_master_heartbeat_db2-23L770=DwIDBA=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=dhvUwjWghTBfDEHmzU3P5eaU9Ce3DkCRdRPNd71L1bU=3vPiNA4KGdZzc0xJOYv5hMCObjWdlxZDO_bLb86YaGM=


The DB2 on standby can get restarted when the 'promote' operation times
out, so you can try increasing the 'promote' timeout to something higher
if this was the case.

So if you see that DB2 was restarted after Primary failed, increase the
promote timeout. If DB2 was not restarted then question is why DB2 has
decided to change the status in this way.

Let me know if above helped.

--
Ondrej Faměra
@Red Hat



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

2018-02-12 Thread Dileep V Nair

Thanks Ondrej for the response. I also figured out the same and reduced the
HADR_TIMEOUT and increased the promote timeout which helped in resolving
the issue.



   
 Regards,   
   

   
 Dileep V Nair  
   
 Senior AIX Administrator   
   
 Cloud Managed Services Delivery (MSD), India   
   
 IBM Cloud  
   

   






 E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy 
Manya 
   Bangalore, KA 
560045 
  
India 









From:   Ondrej Famera <ofam...@redhat.com>
To: Dileep V Nair <dilen...@in.ibm.com>
Cc: Cluster Labs - All topics related to open-source clustering
welcomed <users@clusterlabs.org>
Date:   02/12/2018 11:46 AM
Subject:    Re: [ClusterLabs] Issues with DB2 HADR Resource Agent



On 02/01/2018 07:24 PM, Dileep V Nair wrote:
> Thanks Ondrej for the response. I have set the PEER_WINDOWto 1000 which
> I guess is a reasonable value. What I am noticing is it does not wait
> for the PEER_WINDOW. Before that itself the DB goes into a
> REMOTE_CATCHUP_PENDING state and Pacemaker give an Error saying a DB in
> STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted.
>
>
> Regards,
>
> *Dileep V Nair*

Hi Dileep,

sorry for later response. The DB2 should not get into the
'REMOTE_CATCHUP' phase or the DB2 resource agent will indeed not
promote. From my experience it usually gets into that state when the DB2
on standby was restarted during or after PEER_WINDOW timeout.

When the primary DB2 fails then standby should end up in some state that
would match the one on line 770 of DB2 resource agent and the promote
operation is attempted.

  770  STANDBY/*PEER/DISCONNECTED|Standby/DisconnectedPeer)

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ClusterLabs_resource-2Dagents_blob_master_heartbeat_db2-23L770=DwIDBA=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=dhvUwjWghTBfDEHmzU3P5eaU9Ce3DkCRdRPNd71L1bU=3vPiNA4KGdZzc0xJOYv5hMCObjWdlxZDO_bLb86YaGM=


The DB2 on standby can get restarted when the 'promote' operation times
out, so you can try increasing the 'promote' timeout to something higher
if this was the case.

So if you see that DB2 was restarted after Primary failed, increase the
promote timeout. If DB2 was not restarted then question is why DB2 has
decided to change the status in this way.

Let me know if above helped.

--
Ondrej Faměra
@Red Hat



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

2018-02-11 Thread Ondrej Famera
On 02/01/2018 07:24 PM, Dileep V Nair wrote:
> Thanks Ondrej for the response. I have set the PEER_WINDOWto 1000 which
> I guess is a reasonable value. What I am noticing is it does not wait
> for the PEER_WINDOW. Before that itself the DB goes into a
> REMOTE_CATCHUP_PENDING state and Pacemaker give an Error saying a DB in
> STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted.
> 
> 
> Regards,
> 
> *Dileep V Nair*

Hi Dileep,

sorry for later response. The DB2 should not get into the
'REMOTE_CATCHUP' phase or the DB2 resource agent will indeed not
promote. From my experience it usually gets into that state when the DB2
on standby was restarted during or after PEER_WINDOW timeout.

When the primary DB2 fails then standby should end up in some state that
would match the one on line 770 of DB2 resource agent and the promote
operation is attempted.

  770  STANDBY/*PEER/DISCONNECTED|Standby/DisconnectedPeer)

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/db2#L770

The DB2 on standby can get restarted when the 'promote' operation times
out, so you can try increasing the 'promote' timeout to something higher
if this was the case.

So if you see that DB2 was restarted after Primary failed, increase the
promote timeout. If DB2 was not restarted then question is why DB2 has
decided to change the status in this way.

Let me know if above helped.

-- 
Ondrej Faměra
@Red Hat
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

2018-02-01 Thread Dileep V Nair

Thanks Ondrej for the response. I have set the PEER_WINDOW to 1000 which I
guess is a reasonable value. What I am noticing is it does not wait for the
PEER_WINDOW. Before that itself the DB goes into a REMOTE_CATCHUP_PENDING
state and Pacemaker give an Error saying a DB in
STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted.



   
 Regards,   
   

   
 Dileep V Nair  
   
 Senior AIX Administrator   
   
 Cloud Managed Services Delivery (MSD), India   
   
 IBM Cloud  
   

   






 E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy 
Manya 
   Bangalore, KA 
560045 
  
India 









From:   Ondrej Famera <ofam...@redhat.com>
To: Dileep V Nair <dilen...@in.ibm.com>
Cc: Cluster Labs - All topics related to open-source clustering
welcomed <users@clusterlabs.org>
Date:   02/01/2018 02:48 PM
Subject:    Re: [ClusterLabs] Issues with DB2 HADR Resource Agent



On 02/01/2018 05:57 PM, Dileep V Nair wrote:
> Now the second issue I am facing is that when I crash the node were DB
> is primary, the STANDBY DB is not getting promoted to PRIMARY. I could
> fix that by adding below lines in db2_promote()
>
> 773 *)
> 774 # must take over forced
> 775 force="by force"
> 776
> 777 ;;
>
> But I am not sure of the implications that this can cause.
>
> Can someone suggest whether what I am doing is correct OR will this lead
> to any Data loss.


Hi Dileep,

As for the 'by force' implications you may check the documentation on
what it brings. In short: the data can get corrupted.

https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.cmd.doc/doc/r0011553.html#r0011553__byforce


The original 'by force peer window only' is limiting the takeover to
period when DB2 is within PEER_WINDOW which gives a bit more safety.
(the table in link above also explains how much safer it is)

Instead of changing the resource agent I would rather suggest checking
the PEER_WINDOW and HADR_TIMEOUT variables in DB2. They determine how
long it is possible to do takeover 'by force peer window only'.

--
Ondrej Faměra
@Red Hat



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

2018-02-01 Thread Ondrej Famera
On 02/01/2018 05:57 PM, Dileep V Nair wrote:
> Now the second issue I am facing is that when I crash the node were DB
> is primary, the STANDBY DB is not getting promoted to PRIMARY. I could
> fix that by adding below lines in db2_promote()
> 
> 773 *)
> 774 # must take over forced
> 775 force="by force"
> 776
> 777 ;;
> 
> But I am not sure of the implications that this can cause.
> 
> Can someone suggest whether what I am doing is correct OR will this lead
> to any Data loss.


Hi Dileep,

As for the 'by force' implications you may check the documentation on
what it brings. In short: the data can get corrupted.

https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.cmd.doc/doc/r0011553.html#r0011553__byforce

The original 'by force peer window only' is limiting the takeover to
period when DB2 is within PEER_WINDOW which gives a bit more safety.
(the table in link above also explains how much safer it is)

Instead of changing the resource agent I would rather suggest checking
the PEER_WINDOW and HADR_TIMEOUT variables in DB2. They determine how
long it is possible to do takeover 'by force peer window only'.

-- 
Ondrej Faměra
@Red Hat
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org