Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
On 09/07/2014 08:12 PM, Andrew Beekhof wrote: On 5 Sep 2014, at 2:22 pm, renayama19661...@ybb.ne.jp wrote: Hi All, We confirmed that lrmd caused the time-out of the monitor when the time of the system was revised. When a system considers revision of the time when I used ntpd, it is a problem very much. We can confirm this problem in the next procedure. Step1) Start Pacemaker in a single node. [root@snmp1 ~]# start pacemaker.combined pacemaker.combined start/running, process 11382 Step2) Send simple crm. trac2915-3.crm primitive prmDummyA ocf:pacemaker:Dummy1 \ op start interval=0s timeout=60s on-fail=restart \ op monitor interval=10s timeout=30s on-fail=restart \ op stop interval=0s timeout=60s on-fail=block group grpA prmDummyA location rsc_location-grpA-1 grpA \ rule $id=rsc_location-grpA-1-rule 200: #uname eq snmp1 \ rule $id=rsc_location-grpA-1-rule-0 100: #uname eq snmp2 property $id=cib-bootstrap-options \ no-quorum-policy=ignore \ stonith-enabled=false \ crmd-transition-delay=2s rsc_defaults $id=rsc-options \ resource-stickiness=INFINITY \ migration-threshold=1 -- [root@snmp1 ~]# crm configure load update trac2915-3.crm WARNING: rsc_location-grpA-1: referenced node snmp2 does not exist [root@snmp1 ~]# crm_mon -1 -Af Last updated: Fri Sep 5 13:09:45 2014 Last change: Fri Sep 5 13:09:13 2014 Stack: corosync Current DC: snmp1 (3232238180) - partition WITHOUT quorum Version: 1.1.12-561c4cf 1 Nodes configured 1 Resources configured Online: [ snmp1 ] Resource Group: grpA prmDummyA (ocf::pacemaker:Dummy1):Started snmp1 Node Attributes: * Node snmp1: Migration summary: * Node snmp1: Step3) After the monitor of the resource just began, we push forward time than the timeout(timeout=30s) of the monitor. [root@snmp1 ~]# date -s +40sec Fri Sep 5 13:11:04 JST 2014 Step4) The time-out of the monitor occurs. [root@snmp1 ~]# crm_mon -1 -Af Last updated: Fri Sep 5 13:11:24 2014 Last change: Fri Sep 5 13:09:13 2014 Stack: corosync Current DC: snmp1 (3232238180) - partition WITHOUT quorum Version: 1.1.12-561c4cf 1 Nodes configured 1 Resources configured Online: [ snmp1 ] Node Attributes: * Node snmp1: Migration summary: * Node snmp1: prmDummyA: migration-threshold=1 fail-count=1 last-failure='Fri Sep 5 13:11:04 2014' Failed actions: prmDummyA_monitor_1 on snmp1 'unknown error' (1): call=7, status=Timed Out, last-rc-change='Fri Sep 5 13:11:04 2014', queued=0ms, exec=0ms I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor. So if you create a trivial program with g_main_loop and a timer, and then change the system time, does the timer expire early? This problem does not seem to happen somehow or other in lrmd of PM1.0. cluster-glue was probably using custom timeout code. Yes it was. Exactly because of this well-known problem. I think recent versions of the glib code have fixed that. I can't tell you how many different bugs we ran into that related to timing - like this -- or time wraparound. There were probably a dozen time-related bugs. Most of them weren't in the LRM code, but the rest of the universe -- like this one. I filed the bug against glib probably 8-10 years ago. It takes a while for things to get fixed, then it takes even longer for them to get fixed in the various distros. Some of them are 5+ years behind current code. FreeBSD had a similar problem - even with our custom code because they weren't following POSIX. But I filed the bug against them, and they fixed it (eventually). -- Alan Robertson al...@unix.sh ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
Hi Andrew, I confirmed it in various ways. The conclusion varies in movement by a version of glib. * The problem occurs in RHEL6.x. * The problem does not occur in RHEL7.0. And this problem is solved in glib of a new version. A change of next glib seems to solve a problem in a new version. * https://github.com/GNOME/glib/commit/91113a8aeea40cc2d7dda65b09537980bb602a06#diff-fc9b4bb280a13f8e51c51b434e7d26fd Many users expect right movement in old glib. * Till it shifts to RHEL7... Do you not make modifications in Pacemaker to support an old version? * Model it on old G_() function. Best Regards, Hideo Yamauchi. - Original Message - From: Andrew Beekhof and...@beekhof.net To: renayama19661...@ybb.ne.jp Cc: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Date: 2014/9/8, Mon 19:55 Subject: Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time. On 8 Sep 2014, at 7:12 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor. So if you create a trivial program with g_main_loop and a timer, and then change the system time, does the timer expire early? Yes. That sounds like a glib bug. Ideally we'd get it fixed there rather than work-around it in pacemaker. Have you spoken to them at all? No. I investigate glib library a little more. And I talk with community of glib. I may talk again afterwards. Cool. I somewhat expect them to say working as designed. Which would be unfortunate, but it shouldn't be too hard to work around. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
On 10 Sep 2014, at 2:48 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, I confirmed it in various ways. The conclusion varies in movement by a version of glib. * The problem occurs in RHEL6.x. * The problem does not occur in RHEL7.0. And this problem is solved in glib of a new version. A change of next glib seems to solve a problem in a new version. * https://github.com/GNOME/glib/commit/91113a8aeea40cc2d7dda65b09537980bb602a06#diff-fc9b4bb280a13f8e51c51b434e7d26fd Many users expect right movement in old glib. * Till it shifts to RHEL7... Do you not make modifications in Pacemaker to support an old version? * Model it on old G_() function. I'll file a bug against glib on RHEL6 so that it gets fixed there. Can you send me your simple reproducer program? Best Regards, Hideo Yamauchi. - Original Message - From: Andrew Beekhof and...@beekhof.net To: renayama19661...@ybb.ne.jp Cc: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Date: 2014/9/8, Mon 19:55 Subject: Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time. On 8 Sep 2014, at 7:12 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor. So if you create a trivial program with g_main_loop and a timer, and then change the system time, does the timer expire early? Yes. That sounds like a glib bug. Ideally we'd get it fixed there rather than work-around it in pacemaker. Have you spoken to them at all? No. I investigate glib library a little more. And I talk with community of glib. I may talk again afterwards. Cool. I somewhat expect them to say working as designed. Which would be unfortunate, but it shouldn't be too hard to work around. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
Hi Andrew, Thank you for comments. I'll file a bug against glib on RHEL6 so that it gets fixed there. Can you send me your simple reproducer program? I make revision during practice of timer_func2() at the time When timer_func2() is carried out, time-out of timer_func() is completed before planned time. - #include stdio.h #include glib.h #include sys/times.h gboolean timer_func(gpointer data){ printf(TIMER EXPIRE!\n); fflush(stdout); exit(1); // return FALSE; } gboolean timer_func2(gpointer data){ clock_t ret; struct tms buff; ret = times(buff); printf(TIMER2 EXPIRE! %d\n, ret); fflush(stdout); return TRUE; } int main(int argc, char** argv){ GMainLoop *m; clock_t ret; struct tms buff; gint64 t; // t = g_get_monotonic_time(); m = g_main_new(FALSE); g_timeout_add(5000, timer_func2, NULL); g_timeout_add(6, timer_func, NULL); ret = times(buff); printf(START! %d\n, ret);] g_main_run(m); } - Many Thanks, Hideo Yamauchi. - Original Message - From: Andrew Beekhof and...@beekhof.net To: renayama19661...@ybb.ne.jp Cc: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Date: 2014/9/10, Wed 13:56 Subject: Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time. On 10 Sep 2014, at 2:48 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, I confirmed it in various ways. The conclusion varies in movement by a version of glib. * The problem occurs in RHEL6.x. * The problem does not occur in RHEL7.0. And this problem is solved in glib of a new version. A change of next glib seems to solve a problem in a new version. * https://github.com/GNOME/glib/commit/91113a8aeea40cc2d7dda65b09537980bb602a06#diff-fc9b4bb280a13f8e51c51b434e7d26fd Many users expect right movement in old glib. * Till it shifts to RHEL7... Do you not make modifications in Pacemaker to support an old version? * Model it on old G_() function. I'll file a bug against glib on RHEL6 so that it gets fixed there. Can you send me your simple reproducer program? Best Regards, Hideo Yamauchi. - Original Message - From: Andrew Beekhof and...@beekhof.net To: renayama19661...@ybb.ne.jp Cc: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Date: 2014/9/8, Mon 19:55 Subject: Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time. On 8 Sep 2014, at 7:12 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor. So if you create a trivial program with g_main_loop and a timer, and then change the system time, does the timer expire early? Yes. That sounds like a glib bug. Ideally we'd get it fixed there rather than work-around it in pacemaker. Have you spoken to them at all? No. I investigate glib library a little more. And I talk with community of glib. I may talk again afterwards. Cool. I somewhat expect them to say working as designed. Which would be unfortunate, but it shouldn't be too hard to work around. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
Hi Andrew, I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor. So if you create a trivial program with g_main_loop and a timer, and then change the system time, does the timer expire early? Yes. That sounds like a glib bug. Ideally we'd get it fixed there rather than work-around it in pacemaker. Have you spoken to them at all? No. I investigate glib library a little more. And I talk with community of glib. I may talk again afterwards. Many Thanks, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
On 8 Sep 2014, at 7:12 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor. So if you create a trivial program with g_main_loop and a timer, and then change the system time, does the timer expire early? Yes. That sounds like a glib bug. Ideally we'd get it fixed there rather than work-around it in pacemaker. Have you spoken to them at all? No. I investigate glib library a little more. And I talk with community of glib. I may talk again afterwards. Cool. I somewhat expect them to say working as designed. Which would be unfortunate, but it shouldn't be too hard to work around. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
On 5 Sep 2014, at 2:22 pm, renayama19661...@ybb.ne.jp wrote: Hi All, We confirmed that lrmd caused the time-out of the monitor when the time of the system was revised. When a system considers revision of the time when I used ntpd, it is a problem very much. We can confirm this problem in the next procedure. Step1) Start Pacemaker in a single node. [root@snmp1 ~]# start pacemaker.combined pacemaker.combined start/running, process 11382 Step2) Send simple crm. trac2915-3.crm primitive prmDummyA ocf:pacemaker:Dummy1 \ op start interval=0s timeout=60s on-fail=restart \ op monitor interval=10s timeout=30s on-fail=restart \ op stop interval=0s timeout=60s on-fail=block group grpA prmDummyA location rsc_location-grpA-1 grpA \ rule $id=rsc_location-grpA-1-rule 200: #uname eq snmp1 \ rule $id=rsc_location-grpA-1-rule-0 100: #uname eq snmp2 property $id=cib-bootstrap-options \ no-quorum-policy=ignore \ stonith-enabled=false \ crmd-transition-delay=2s rsc_defaults $id=rsc-options \ resource-stickiness=INFINITY \ migration-threshold=1 -- [root@snmp1 ~]# crm configure load update trac2915-3.crm WARNING: rsc_location-grpA-1: referenced node snmp2 does not exist [root@snmp1 ~]# crm_mon -1 -Af Last updated: Fri Sep 5 13:09:45 2014 Last change: Fri Sep 5 13:09:13 2014 Stack: corosync Current DC: snmp1 (3232238180) - partition WITHOUT quorum Version: 1.1.12-561c4cf 1 Nodes configured 1 Resources configured Online: [ snmp1 ] Resource Group: grpA prmDummyA (ocf::pacemaker:Dummy1):Started snmp1 Node Attributes: * Node snmp1: Migration summary: * Node snmp1: Step3) After the monitor of the resource just began, we push forward time than the timeout(timeout=30s) of the monitor. [root@snmp1 ~]# date -s +40sec Fri Sep 5 13:11:04 JST 2014 Step4) The time-out of the monitor occurs. [root@snmp1 ~]# crm_mon -1 -Af Last updated: Fri Sep 5 13:11:24 2014 Last change: Fri Sep 5 13:09:13 2014 Stack: corosync Current DC: snmp1 (3232238180) - partition WITHOUT quorum Version: 1.1.12-561c4cf 1 Nodes configured 1 Resources configured Online: [ snmp1 ] Node Attributes: * Node snmp1: Migration summary: * Node snmp1: prmDummyA: migration-threshold=1 fail-count=1 last-failure='Fri Sep 5 13:11:04 2014' Failed actions: prmDummyA_monitor_1 on snmp1 'unknown error' (1): call=7, status=Timed Out, last-rc-change='Fri Sep 5 13:11:04 2014', queued=0ms, exec=0ms I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor. So if you create a trivial program with g_main_loop and a timer, and then change the system time, does the timer expire early? This problem does not seem to happen somehow or other in lrmd of PM1.0. cluster-glue was probably using custom timeout code. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
Hi Andrew, Thank you for comments. I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor. So if you create a trivial program with g_main_loop and a timer, and then change the system time, does the timer expire early? Yes. This problem does not seem to happen somehow or other in lrmd of PM1.0. cluster-glue was probably using custom timeout code. I watched implementation of glue, too. The time-out handling of new lrmd seems to have to perform implementation similar to glue somehow or other. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
On 8 Sep 2014, at 12:46 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comments. I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor. So if you create a trivial program with g_main_loop and a timer, and then change the system time, does the timer expire early? Yes. That sounds like a glib bug. Ideally we'd get it fixed there rather than work-around it in pacemaker. Have you spoken to them at all? This problem does not seem to happen somehow or other in lrmd of PM1.0. cluster-glue was probably using custom timeout code. I watched implementation of glue, too. The time-out handling of new lrmd seems to have to perform implementation similar to glue somehow or other. Best Regards, Hideo Yamauchi. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
Hi All, We confirmed that lrmd caused the time-out of the monitor when the time of the system was revised. When a system considers revision of the time when I used ntpd, it is a problem very much. We can confirm this problem in the next procedure. Step1) Start Pacemaker in a single node. [root@snmp1 ~]# start pacemaker.combined pacemaker.combined start/running, process 11382 Step2) Send simple crm. trac2915-3.crm primitive prmDummyA ocf:pacemaker:Dummy1 \ op start interval=0s timeout=60s on-fail=restart \ op monitor interval=10s timeout=30s on-fail=restart \ op stop interval=0s timeout=60s on-fail=block group grpA prmDummyA location rsc_location-grpA-1 grpA \ rule $id=rsc_location-grpA-1-rule 200: #uname eq snmp1 \ rule $id=rsc_location-grpA-1-rule-0 100: #uname eq snmp2 property $id=cib-bootstrap-options \ no-quorum-policy=ignore \ stonith-enabled=false \ crmd-transition-delay=2s rsc_defaults $id=rsc-options \ resource-stickiness=INFINITY \ migration-threshold=1 -- [root@snmp1 ~]# crm configure load update trac2915-3.crm WARNING: rsc_location-grpA-1: referenced node snmp2 does not exist [root@snmp1 ~]# crm_mon -1 -Af Last updated: Fri Sep 5 13:09:45 2014 Last change: Fri Sep 5 13:09:13 2014 Stack: corosync Current DC: snmp1 (3232238180) - partition WITHOUT quorum Version: 1.1.12-561c4cf 1 Nodes configured 1 Resources configured Online: [ snmp1 ] Resource Group: grpA prmDummyA (ocf::pacemaker:Dummy1): Started snmp1 Node Attributes: * Node snmp1: Migration summary: * Node snmp1: Step3) After the monitor of the resource just began, we push forward time than the timeout(timeout=30s) of the monitor. [root@snmp1 ~]# date -s +40sec Fri Sep 5 13:11:04 JST 2014 Step4) The time-out of the monitor occurs. [root@snmp1 ~]# crm_mon -1 -Af Last updated: Fri Sep 5 13:11:24 2014 Last change: Fri Sep 5 13:09:13 2014 Stack: corosync Current DC: snmp1 (3232238180) - partition WITHOUT quorum Version: 1.1.12-561c4cf 1 Nodes configured 1 Resources configured Online: [ snmp1 ] Node Attributes: * Node snmp1: Migration summary: * Node snmp1: prmDummyA: migration-threshold=1 fail-count=1 last-failure='Fri Sep 5 13:11:04 2014' Failed actions: prmDummyA_monitor_1 on snmp1 'unknown error' (1): call=7, status=Timed Out, last-rc-change='Fri Sep 5 13:11:04 2014', queued=0ms, exec=0ms I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor. This problem does not seem to happen somehow or other in lrmd of PM1.0. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org