[tickets] [opensaf:tickets] #1526 imm: exit the 1PBE when pbeBeginTrans sees db as locked
Question: How can this case happen for the 1PBE case when there is only one user thread using the sqlite instance ? Another relevant question is why/when do you observe this now ? The test case or test setup must be special somehow. With only one thread this case should be impossible. It suggest heap correuption could be the cause. Some years ago we did see problems although not exactly this kind, in conjunction with repeated failovers, where the new PBE managed to start while the old PBE (on the other SC) was still executing (slow to terminate). But the distributes file level protection uses file system locking and the symptoms should be different. --- ** [tickets:#1526] imm: exit the 1PBE when pbeBeginTrans sees db as locked** **Status:** review **Milestone:** 4.5.2 **Created:** Wed Oct 07, 2015 09:43 AM UTC by Neelakanta Reddy **Last Updated:** Thu Oct 08, 2015 07:21 AM UTC **Owner:** Neelakanta Reddy when the disk is full the sqlite will return error. Sep 18 13:42:02 SC-2 osafimmpbed: ER SQL statement ('COMMIT TRANSACTION') failed because: disk I/O error Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Invalid error reported implementer 'OpenSafImmPBE', Ccb 321 will be aborted Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Ccb 321 ABORTED (TraceC) Sep 18 13:42:02 SC-2 osafimmpbed: WA Failed to find CCB object for 141/321 Due to continoues CCB operations (even though disk is full) the 1PBE is seeing the following mesages for more than 3 hours: messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:22 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:24 SC-2 osafimmpbed: WA Sqlite db locked by other thread After freeing the space still the PBE is got struck in Sqlite db locked by other thread. This is preventing any further operations. once the PBE is killed, the imm.db re-generated and the CCB operations are applied. Solution(1PBE): For the 1PBE case, which is not multi threaded, if the sqlite db locked case is reached abort the PBE and let the PBE be re-generated(instead of blocking the PBE process). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1526 imm: exit the 1PBE when pbeBeginTrans sees db as locked
I looked at the code and the error message is correct but the "lock" is the PBE "spin lock" created for handling 2PBE. The fact that it finds it locked in 1PBE means there is a logical bug somewhere in 1PBE. Most likely some error case where there is a bailout from commit processing without correct cleanup. --- ** [tickets:#1526] imm: exit the 1PBE when pbeBeginTrans sees db as locked** **Status:** review **Milestone:** 4.5.2 **Created:** Wed Oct 07, 2015 09:43 AM UTC by Neelakanta Reddy **Last Updated:** Thu Oct 08, 2015 07:43 AM UTC **Owner:** Neelakanta Reddy when the disk is full the sqlite will return error. Sep 18 13:42:02 SC-2 osafimmpbed: ER SQL statement ('COMMIT TRANSACTION') failed because: disk I/O error Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Invalid error reported implementer 'OpenSafImmPBE', Ccb 321 will be aborted Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Ccb 321 ABORTED (TraceC) Sep 18 13:42:02 SC-2 osafimmpbed: WA Failed to find CCB object for 141/321 Due to continoues CCB operations (even though disk is full) the 1PBE is seeing the following mesages for more than 3 hours: messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:22 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:24 SC-2 osafimmpbed: WA Sqlite db locked by other thread After freeing the space still the PBE is got struck in Sqlite db locked by other thread. This is preventing any further operations. once the PBE is killed, the imm.db re-generated and the CCB operations are applied. Solution(1PBE): For the 1PBE case, which is not multi threaded, if the sqlite db locked case is reached abort the PBE and let the PBE be re-generated(instead of blocking the PBE process). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1526 imm: exit the 1PBE when pbeBeginTrans sees db as locked
- **summary**: imm: abort the 1PBE when pbeBeginTrans sees db as locked --> imm: exit the 1PBE when pbeBeginTrans sees db as locked - **status**: accepted --> review --- ** [tickets:#1526] imm: exit the 1PBE when pbeBeginTrans sees db as locked** **Status:** review **Milestone:** 4.5.2 **Created:** Wed Oct 07, 2015 09:43 AM UTC by Neelakanta Reddy **Last Updated:** Wed Oct 07, 2015 09:49 AM UTC **Owner:** Neelakanta Reddy when the disk is full the sqlite will return error. Sep 18 13:42:02 SC-2 osafimmpbed: ER SQL statement ('COMMIT TRANSACTION') failed because: disk I/O error Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Invalid error reported implementer 'OpenSafImmPBE', Ccb 321 will be aborted Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Ccb 321 ABORTED (TraceC) Sep 18 13:42:02 SC-2 osafimmpbed: WA Failed to find CCB object for 141/321 Due to continoues CCB operations (even though disk is full) the 1PBE is seeing the following mesages for more than 3 hours: messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:22 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:24 SC-2 osafimmpbed: WA Sqlite db locked by other thread After freeing the space still the PBE is got struck in Sqlite db locked by other thread. This is preventing any further operations. once the PBE is killed, the imm.db re-generated and the CCB operations are applied. Solution(1PBE): For the 1PBE case, which is not multi threaded, if the sqlite db locked case is reached abort the PBE and let the PBE be re-generated(instead of blocking the PBE process). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1529 Opensaf cluster went for reset wihle invoking failover
- **Milestone**: 4.7.RC1 --> future --- ** [tickets:#1529] Opensaf cluster went for reset wihle invoking failover** **Status:** unassigned **Milestone:** future **Created:** Thu Oct 08, 2015 07:53 AM UTC by Chani Srivastava **Last Updated:** Thu Oct 08, 2015 07:53 AM UTC **Owner:** nobody **Attachments:** - [SC1_syslog.txt](https://sourceforge.net/p/opensaf/tickets/1529/attachment/SC1_syslog.txt) (436.4 kB; text/plain) - [SC2_syslog.txt](https://sourceforge.net/p/opensaf/tickets/1529/attachment/SC2_syslog.txt) (425.6 kB; text/plain) Setup: Changeset-6901 Invoked continuous failovers on a 4-node Cluster with 2 controllers and 2 payloads. All nodes have 64bit architecture. 2PBE enabled with 25K objects Issue Observed: Cluster reset occurred on invoking continuous failovers Attachments: Attaching syslogs for SC-1 and SC-2 Traces for immnd and immd can be shared seperately if required Steps: * Initially SC-1 is active and SC-2 standby * A test script invoked failover via killing osafclmd on SC1 * SC-2 became active Oct 7 18:23:32 OSAF-SC1 root: killing osafclmd from invoke_failover.sh Oct 7 19:25:20 OSAF-SC2 osafamfd[2191]: NO FAILOVER StandBy --> Active * On the new active controler, saImmOiInitialize_2 failed Oct 7 19:25:22 OSAF-SC2 osafntfimcnd[2735]: ER ntfimcn_imm_init saImmOiInitialize_2 failed SA_AIS_ERR_TIMEOUT (5) Oct 7 19:25:22 OSAF-SC2 osafntfimcnd[2735]: ER ntfimcn_imm_init() Fail Oct 7 19:25:22 OSAF-SC2 osafimmnd[2131]: NO Implementer connected: 333 (safLckService) <299, 2020f> Oct 7 19:25:22 OSAF-SC2 osafimmnd[2131]: NO Implementer connected: 334 (safEvtService) <298, 2020f> Oct 7 19:25:23 OSAF-SC2 osafntfimcnd[2738]: ER ntfimcn_imm_init saImmOiInitialize_2 failed SA_AIS_ERR_TIMEOUT (5) Oct 7 19:25:23 OSAF-SC2 osafntfimcnd[2738]: ER ntfimcn_imm_init() Fail Oct 7 19:25:23 OSAF-SC2 osafimmnd[2131]: WA MDS Send Failed Oct 7 19:25:23 OSAF-SC2 osafimmnd[2131]: WA Error code 2 returned for message type 4 - ignoring * Other services also fail to initialize with IMM on new active controller..i.e. SC-2 * And finally SMF had csi set timeout * SC-2 went for reboot and hence the entire cluster reset, as SC-2 is the only active controller at the time Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: NO 'safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'csiSetcallbackTimeout' : Recovery is 'nodeFailfast' Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: ER safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:csiSetcallbackTimeout Recovery is:nodeFailfast Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60 Oct 7 19:25:51 OSAF-SC2 opensaf_reboot: Rebooting local node; timeout=60 --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1526 imm: exit the 1PBE when pbeBeginTrans sees db as locked
I guess it could be that the pbe level message "Sqlite db locked by other thread" is plain wrong, i.e. missleading. --- ** [tickets:#1526] imm: exit the 1PBE when pbeBeginTrans sees db as locked** **Status:** review **Milestone:** 4.5.2 **Created:** Wed Oct 07, 2015 09:43 AM UTC by Neelakanta Reddy **Last Updated:** Thu Oct 08, 2015 07:35 AM UTC **Owner:** Neelakanta Reddy when the disk is full the sqlite will return error. Sep 18 13:42:02 SC-2 osafimmpbed: ER SQL statement ('COMMIT TRANSACTION') failed because: disk I/O error Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Invalid error reported implementer 'OpenSafImmPBE', Ccb 321 will be aborted Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Ccb 321 ABORTED (TraceC) Sep 18 13:42:02 SC-2 osafimmpbed: WA Failed to find CCB object for 141/321 Due to continoues CCB operations (even though disk is full) the 1PBE is seeing the following mesages for more than 3 hours: messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:22 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:24 SC-2 osafimmpbed: WA Sqlite db locked by other thread After freeing the space still the PBE is got struck in Sqlite db locked by other thread. This is preventing any further operations. once the PBE is killed, the imm.db re-generated and the CCB operations are applied. Solution(1PBE): For the 1PBE case, which is not multi threaded, if the sqlite db locked case is reached abort the PBE and let the PBE be re-generated(instead of blocking the PBE process). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1529 Opensaf cluster went for reset wihle invoking failover
--- ** [tickets:#1529] Opensaf cluster went for reset wihle invoking failover** **Status:** unassigned **Milestone:** 4.7.RC1 **Created:** Thu Oct 08, 2015 07:53 AM UTC by Chani Srivastava **Last Updated:** Thu Oct 08, 2015 07:53 AM UTC **Owner:** nobody **Attachments:** - [SC1_syslog.txt](https://sourceforge.net/p/opensaf/tickets/1529/attachment/SC1_syslog.txt) (436.4 kB; text/plain) - [SC2_syslog.txt](https://sourceforge.net/p/opensaf/tickets/1529/attachment/SC2_syslog.txt) (425.6 kB; text/plain) Setup: Changeset-6901 Invoked continuous failovers on a 4-node Cluster with 2 controllers and 2 payloads. All nodes have 64bit architecture. 2PBE enabled with 25K objects Issue Observed: Cluster reset occurred on invoking continuous failovers Attachments: Attaching syslogs for SC-1 and SC-2 Traces for immnd and immd can be shared seperately if required Steps: * Initially SC-1 is active and SC-2 standby * A test script invoked failover via killing osafclmd on SC1 * SC-2 became active Oct 7 18:23:32 OSAF-SC1 root: killing osafclmd from invoke_failover.sh Oct 7 19:25:20 OSAF-SC2 osafamfd[2191]: NO FAILOVER StandBy --> Active * On the new active controler, saImmOiInitialize_2 failed Oct 7 19:25:22 OSAF-SC2 osafntfimcnd[2735]: ER ntfimcn_imm_init saImmOiInitialize_2 failed SA_AIS_ERR_TIMEOUT (5) Oct 7 19:25:22 OSAF-SC2 osafntfimcnd[2735]: ER ntfimcn_imm_init() Fail Oct 7 19:25:22 OSAF-SC2 osafimmnd[2131]: NO Implementer connected: 333 (safLckService) <299, 2020f> Oct 7 19:25:22 OSAF-SC2 osafimmnd[2131]: NO Implementer connected: 334 (safEvtService) <298, 2020f> Oct 7 19:25:23 OSAF-SC2 osafntfimcnd[2738]: ER ntfimcn_imm_init saImmOiInitialize_2 failed SA_AIS_ERR_TIMEOUT (5) Oct 7 19:25:23 OSAF-SC2 osafntfimcnd[2738]: ER ntfimcn_imm_init() Fail Oct 7 19:25:23 OSAF-SC2 osafimmnd[2131]: WA MDS Send Failed Oct 7 19:25:23 OSAF-SC2 osafimmnd[2131]: WA Error code 2 returned for message type 4 - ignoring * Other services also fail to initialize with IMM on new active controller..i.e. SC-2 * And finally SMF had csi set timeout * SC-2 went for reboot and hence the entire cluster reset, as SC-2 is the only active controller at the time Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: NO 'safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'csiSetcallbackTimeout' : Recovery is 'nodeFailfast' Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: ER safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:csiSetcallbackTimeout Recovery is:nodeFailfast Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60 Oct 7 19:25:51 OSAF-SC2 opensaf_reboot: Rebooting local node; timeout=60 --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1526 imm: 1PBE can see db as locked
- **summary**: imm: exit the 1PBE when pbeBeginTrans sees db as locked --> imm: 1PBE can see db as locked - **Comment**: Changed ticket slogan to describe the problem. --- ** [tickets:#1526] imm: 1PBE can see db as locked** **Status:** review **Milestone:** 4.5.2 **Created:** Wed Oct 07, 2015 09:43 AM UTC by Neelakanta Reddy **Last Updated:** Thu Oct 08, 2015 07:55 AM UTC **Owner:** Neelakanta Reddy when the disk is full the sqlite will return error. Sep 18 13:42:02 SC-2 osafimmpbed: ER SQL statement ('COMMIT TRANSACTION') failed because: disk I/O error Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Invalid error reported implementer 'OpenSafImmPBE', Ccb 321 will be aborted Sep 18 13:42:02 SC-2 osafimmnd[13067]: NO Ccb 321 ABORTED (TraceC) Sep 18 13:42:02 SC-2 osafimmpbed: WA Failed to find CCB object for 141/321 Due to continoues CCB operations (even though disk is full) the 1PBE is seeing the following mesages for more than 3 hours: messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:46 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages:Sep 18 17:58:47 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:22 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:23 SC-2 osafimmpbed: WA Sqlite db locked by other thread. messages.7:Sep 18 14:22:24 SC-2 osafimmpbed: WA Sqlite db locked by other thread After freeing the space still the PBE is got struck in Sqlite db locked by other thread. This is preventing any further operations. once the PBE is killed, the imm.db re-generated and the CCB operations are applied. Solution(1PBE): For the 1PBE case, which is not multi threaded, if the sqlite db locked case is reached abort the PBE and let the PBE be re-generated(instead of blocking the PBE process). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1519 log: The log notice message should be informative
- **Comment**: Also, it is better to use PRIx64 while printing MDS addresses. The output will be more readable, i guess. --- ** [tickets:#1519] log: The log notice message should be informative** **Status:** unassigned **Milestone:** 5.0 **Created:** Tue Oct 06, 2015 09:18 AM UTC by Neelakanta Reddy **Last Updated:** Thu Oct 08, 2015 11:07 AM UTC **Owner:** nobody while an application upgrade the following is observed, here NO(notice) can be move to IN or trace. Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683243x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683243x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683243x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683243x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1518 AMF: SU presence state transition is not correct during admin op restart
- **Milestone**: 4.5.2 --> 4.7.RC1 --- ** [tickets:#1518] AMF: SU presence state transition is not correct during admin op restart** **Status:** unassigned **Milestone:** 4.7.RC1 **Created:** Tue Oct 06, 2015 07:41 AM UTC by Quyen Dao **Last Updated:** Thu Oct 08, 2015 10:16 AM UTC **Owner:** nobody >From my observation, the su and component presence state transition is as >below during admin su (with all restartable components) restart: su: INSTANTIATED => RESTARTING => INSTANTIATING => INSTANTIATED component: INSTANTIATED => RESTARTING => INSTANTIATED In my opinion, the presence state transition of su should be the same as component for this case: INSTANTIATED => RESTARTING => INSTANTIATED According to AIS-AMF-B.04.01-Table 5 Presence State of Components of a Service Unit Page 74, if all components are RESTARTING, then SU should be RESTARTING (not INSTANTIATING) as well. **syslog for su presence state transition** Oct 6 12:24:50 PL-3 osafamfnd[418]: NO Admin Restart request for 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Oct 6 12:24:50 PL-3 osafamfnd[418]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => RESTARTING Oct 6 12:24:50 PL-3 amf_demo[757]: Terminating Oct 6 12:24:50 PL-3 osafamfnd[418]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State RESTARTING => INSTANTIATING Oct 6 12:24:50 PL-3 amf_demo[985]: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' started Oct 6 12:24:50 PL-3 osafamfnd[418]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATING => INSTANTIATED Oct 6 12:24:50 PL-3 osafamfnd[418]: NO Assigning 'safSi=AmfDemo,safApp=AmfDemo1' ACTIVE to 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Oct 6 12:24:50 PL-3 amf_demo[985]: Registered with AMF and HC started Oct 6 12:24:50 PL-3 amf_demo[985]: CSI Set - add 'safCsi=AmfDemo,safSi=AmfDemo,safApp=AmfDemo1' HAState Active Oct 6 12:24:50 PL-3 osafamfnd[418]: NO Assigned 'safSi=AmfDemo,safApp=AmfDemo1' ACTIVE to 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1517 AMF : si-si deps not followed during SU restart
- **Milestone**: 4.5.2 --> 4.7.RC1 --- ** [tickets:#1517] AMF : si-si deps not followed during SU restart** **Status:** unassigned **Milestone:** 4.7.RC1 **Created:** Tue Oct 06, 2015 06:55 AM UTC by Srikanth R **Last Updated:** Tue Oct 06, 2015 07:12 AM UTC **Owner:** nobody Changeset : 6901 Application : 2n ( 5 SIs with SI1 as sponsor for the remaining SIs ) During admin restart of SU ( hosting restartable components ), si-si deps are not followed. Below is the syslog from the node hosting active SU. Component hosing active SI delayed CSI active assignment by 5 seconds. But dependent SIs are assigned, with the sponsor not completely assigned. Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO Admin Restart request for 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Presence State INSTANTIATED => RESTARTING Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Presence State RESTARTING => INSTANTIATING Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Presence State INSTANTIATING => INSTANTIATED Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO Assigning 'safSi=TestApp_SI1,safApp=TestApp_TwoN' ACTIVE to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO Assigning 'safSi=TestApp_SI2,safApp=TestApp_TwoN' ACTIVE to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO Assigning 'safSi=TestApp_SI3,safApp=TestApp_TwoN' ACTIVE to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO Assigning 'safSi=TestApp_SI4,safApp=TestApp_TwoN' ACTIVE to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO Assigning 'safSi=TestApp_SI5,safApp=TestApp_TwoN' ACTIVE to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO Assigned 'safSi=TestApp_SI3,safApp=TestApp_TwoN' ACTIVE to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO Assigned 'safSi=TestApp_SI2,safApp=TestApp_TwoN' ACTIVE to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO Assigned 'safSi=TestApp_SI4,safApp=TestApp_TwoN' ACTIVE to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 5 03:37:10 SYSTEST-PLD-1 osafamfnd[25306]: NO Assigned 'safSi=TestApp_SI5,safApp=TestApp_TwoN' ACTIVE to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 5 03:37:15 SYSTEST-PLD-1 osafamfnd[25306]: NO Assigned 'safSi=TestApp_SI1,safApp=TestApp_TwoN' ACTIVE to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1531 LOG: Improve handling of OI implementer create and restore
--- ** [tickets:#1531] LOG: Improve handling of OI implementer create and restore** **Status:** assigned **Milestone:** 5.0 **Created:** Thu Oct 08, 2015 11:14 AM UTC by elunlen **Last Updated:** Thu Oct 08, 2015 11:14 AM UTC **Owner:** elunlen Setting up IMM Object Implementer (OI) is not handled in a correct and consistent way. There is also a lot of redundant code. In some places (but not everywhere) a thread is used to prevent the main thread from ‘hanging’ a long time but this is done in an incorrect way. See also ticket [#1527] Create a ‘handler’ for setting up the OI. The ‘handler’ must handle very long setup time (up to 1 min). This means that at least part of this must be done in a separate thread so that the log server main thread never ‘hangs’ for this long time. An OI is set up when the active log server start, at HA state change and if the OI handle is lost (IMM error recovery). Setting up an OI includes: saImmOiInitialize_2() saImmOiSelectionObjectGet() The following may take long time: saImmOiImplementerSet() saImmOiClassImplementerSet() for "OpenSafLogConfig" class saImmOiClassImplementerSet() for "SaLogStreamConfig" class and: saImmOiRtObjectCreate_2() of "OpenSafLogCurrentConfig" object --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1518 AMF: SU presence state transition is not correct during admin op restart
On page 74 for Prescence state of Component spec says, when a restartable component is instantiated, it will move from restarting to Instantiated state. Thus for a restartable SU, when last comp is started terminating, SU will move to RESTARTING state. At this point of time all the component are in RESTARTING state and SU is also in RESTARTING state. Now AMF will instantiate first component and after successful instantiation this component will move to INSTANTIATED state. So now for marking the presence state of SU. AMF has following information: -one component is in INSTANTIATED state. -All the remaining component are in RESTARTING state waiting for instantiation. If we refer to Page74 table and draw the state, AMF must mark the su INSTANTIATED because only in this state components can be in RESTARTING state with atleast one component in INSTANTIATED state. Now two problem comes: 1) AMF has to send state change notification for presence state as SU is in INSTANTIATED state. If AMF sends it now then a user application must not assume that all components are restarted when some of them are still in restarting state. If AMF delays it then there will be time diffenrence when SU was marked instantiated and the notification was sent. 2) same is the problem for repyling to IMM for the status of admin operation. I think it is ok to delay both, notification and replying to IMM client, as presence state of component and SU are not directly exposed to the application. --- ** [tickets:#1518] AMF: SU presence state transition is not correct during admin op restart** **Status:** unassigned **Milestone:** 4.5.2 **Created:** Tue Oct 06, 2015 07:41 AM UTC by Quyen Dao **Last Updated:** Tue Oct 06, 2015 07:41 AM UTC **Owner:** nobody >From my observation, the su and component presence state transition is as >below during admin su (with all restartable components) restart: su: INSTANTIATED => RESTARTING => INSTANTIATING => INSTANTIATED component: INSTANTIATED => RESTARTING => INSTANTIATED In my opinion, the presence state transition of su should be the same as component for this case: INSTANTIATED => RESTARTING => INSTANTIATED According to AIS-AMF-B.04.01-Table 5 Presence State of Components of a Service Unit Page 74, if all components are RESTARTING, then SU should be RESTARTING (not INSTANTIATING) as well. **syslog for su presence state transition** Oct 6 12:24:50 PL-3 osafamfnd[418]: NO Admin Restart request for 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Oct 6 12:24:50 PL-3 osafamfnd[418]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => RESTARTING Oct 6 12:24:50 PL-3 amf_demo[757]: Terminating Oct 6 12:24:50 PL-3 osafamfnd[418]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State RESTARTING => INSTANTIATING Oct 6 12:24:50 PL-3 amf_demo[985]: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' started Oct 6 12:24:50 PL-3 osafamfnd[418]: NO 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATING => INSTANTIATED Oct 6 12:24:50 PL-3 osafamfnd[418]: NO Assigning 'safSi=AmfDemo,safApp=AmfDemo1' ACTIVE to 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Oct 6 12:24:50 PL-3 amf_demo[985]: Registered with AMF and HC started Oct 6 12:24:50 PL-3 amf_demo[985]: CSI Set - add 'safCsi=AmfDemo,safSi=AmfDemo,safApp=AmfDemo1' HAState Active Oct 6 12:24:50 PL-3 osafamfnd[418]: NO Assigned 'safSi=AmfDemo,safApp=AmfDemo1' ACTIVE to 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1530 AMF : TwoN, SU struck in terminating during sponsor si lock ( csi quiescd timeout)
Configuration to create AMF application is attached Attachments: - [1530_twon.sh](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/022eb6bb/24a1/attachment/1530_twon.sh) (9.7 kB; application/x-shellscript) --- ** [tickets:#1530] AMF : TwoN, SU struck in terminating during sponsor si lock ( csi quiescd timeout)** **Status:** unassigned **Milestone:** 4.5.2 **Created:** Thu Oct 08, 2015 09:52 AM UTC by Srikanth R **Last Updated:** Thu Oct 08, 2015 09:52 AM UTC **Owner:** nobody Changeset : 6901 Application : 2N , two SUs, 4 SIs ( 1 sponsor SI for the remaining 3 SIs) Issue : SU struck in terminating during sponsor si lock , when the sponsor rejected the quiescd assignment. Steps : * Initially all the SIs are in fully assigned state. * Invoked the lock of sponsor SI .i.e SI1. * When the quiesced callback is invoked, component did not respond for the callback. Oct 8 15:10:30 SYSTEST-PLD-1 osafamfnd[2645]: NO Assigning 'safSi=TestApp_SI1,safApp=TestApp_TwoN' QUIESCED to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 8 15:10:40 SYSTEST-PLD-1 osafamfnd[2645]: NO Performing failover of 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' (SU failover count: 1) Oct 8 15:10:40 SYSTEST-PLD-1 osafamfnd[2645]: NO 'safComp=COMP1,safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' faulted due to 'csiSetcallbackTimeout' : Recovery is 'componentFailover' Oct 8 15:10:40 SYSTEST-PLD-1 osafamfnd[2645]: NO 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Presence State INSTANTIATED => TERMINATING * All the assignments got removed as part of SI lock operation, but SU is not repaired. It got stuck in terminating and disabled state. * Further unlock of sponsor SI resulted in other SU getting active assignment. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1530 AMF : TwoN, SU struck in terminating during sponsor si lock ( csi quiescd timeout)
--- ** [tickets:#1530] AMF : TwoN, SU struck in terminating during sponsor si lock ( csi quiescd timeout)** **Status:** unassigned **Milestone:** 4.5.2 **Created:** Thu Oct 08, 2015 09:52 AM UTC by Srikanth R **Last Updated:** Thu Oct 08, 2015 09:52 AM UTC **Owner:** nobody Changeset : 6901 Application : 2N , two SUs, 4 SIs ( 1 sponsor SI for the remaining 3 SIs) Issue : SU struck in terminating during sponsor si lock , when the sponsor rejected the quiescd assignment. Steps : * Initially all the SIs are in fully assigned state. * Invoked the lock of sponsor SI .i.e SI1. * When the quiesced callback is invoked, component did not respond for the callback. Oct 8 15:10:30 SYSTEST-PLD-1 osafamfnd[2645]: NO Assigning 'safSi=TestApp_SI1,safApp=TestApp_TwoN' QUIESCED to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 8 15:10:40 SYSTEST-PLD-1 osafamfnd[2645]: NO Performing failover of 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' (SU failover count: 1) Oct 8 15:10:40 SYSTEST-PLD-1 osafamfnd[2645]: NO 'safComp=COMP1,safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' faulted due to 'csiSetcallbackTimeout' : Recovery is 'componentFailover' Oct 8 15:10:40 SYSTEST-PLD-1 osafamfnd[2645]: NO 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Presence State INSTANTIATED => TERMINATING * All the assignments got removed as part of SI lock operation, but SU is not repaired. It got stuck in terminating and disabled state. * Further unlock of sponsor SI resulted in other SU getting active assignment. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1519 log: The log notice message should be informative
It also looks like there is a superfluous "x" character printed after the MDS destination. The "x" character after PRIu64 should be removed: LOG_NO("Send of WRITE ack to %"PRIu64"x FAILED: %u", mds_dest, rc); --- ** [tickets:#1519] log: The log notice message should be informative** **Status:** unassigned **Milestone:** 4.7.RC1 **Created:** Tue Oct 06, 2015 09:18 AM UTC by Neelakanta Reddy **Last Updated:** Tue Oct 06, 2015 09:18 AM UTC **Owner:** nobody while an application upgrade the following is observed, here NO(notice) can be move to IN or trace. Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683243x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683243x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683243x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683243x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683240x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 Oct 6 14:22:36 SLES1 osaflogd[30996]: NO Send of WRITE ack to 567414949683237x FAILED: 2 --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1531 LOG: Improve handling of OI implementer create and restore
Will fix [#585] --- ** [tickets:#1531] LOG: Improve handling of OI implementer create and restore** **Status:** assigned **Milestone:** 5.0 **Created:** Thu Oct 08, 2015 11:14 AM UTC by elunlen **Last Updated:** Thu Oct 08, 2015 11:14 AM UTC **Owner:** elunlen Setting up IMM Object Implementer (OI) is not handled in a correct and consistent way. There is also a lot of redundant code. In some places (but not everywhere) a thread is used to prevent the main thread from ‘hanging’ a long time but this is done in an incorrect way. See also ticket [#1527] Create a ‘handler’ for setting up the OI. The ‘handler’ must handle very long setup time (up to 1 min). This means that at least part of this must be done in a separate thread so that the log server main thread never ‘hangs’ for this long time. An OI is set up when the active log server start, at HA state change and if the OI handle is lost (IMM error recovery). Setting up an OI includes: saImmOiInitialize_2() saImmOiSelectionObjectGet() The following may take long time: saImmOiImplementerSet() saImmOiClassImplementerSet() for "OpenSafLogConfig" class saImmOiClassImplementerSet() for "SaLogStreamConfig" class and: saImmOiRtObjectCreate_2() of "OpenSafLogCurrentConfig" object --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1516 AMF : si-swap operation in nway should honor saAmfSGMaxStandbySIsperSU
- **Milestone**: 4.5.2 --> 4.7.RC1 --- ** [tickets:#1516] AMF : si-swap operation in nway should honor saAmfSGMaxStandbySIsperSU ** **Status:** unassigned **Milestone:** 4.7.RC1 **Created:** Mon Oct 05, 2015 12:59 PM UTC by Srikanth R **Last Updated:** Wed Oct 07, 2015 04:29 AM UTC **Owner:** nobody While processing si-swap operation in nway model, AMF should take care of standby assignments also. If the saAmfSGMaxStandbySIsperSU is not met for the new SU, si-swap operation should be rejected. In the current case, si-swap operation is proceeded and si is degraded from fully assigned to partially assigned state. Initial state of SG : | TestApp_SI1 | TestApp_SI2 | TestApp_SI3 | TestApp_SI4 | TestApp_SI5 *** * TestApp_SU1 | ACTIVE | ACTIVE | ACTIVE| STANDBY | TestApp_SU2 | STANDBY |||ACTIVE | ACTIVE TestApp_SU3 | | STANDBY || | After invoking si-swap operation on SI2, SI2 moved to partially assigned state. |TestApp_SI1 | TestApp_SI2 | TestApp_SI3 | TestApp_SI4 | TestApp_SI5 * TestApp_SU1 | ACTIVE | | ACTIVE | STANDBY | TestApp_SU2 | STANDBY| ||ACTIVE | ACTIVE TestApp_SU3 | | ACTIVE | STANDBY | | Configuration of SG is saAmfSGNumPrefStandbySUs SA_UINT32_T 1 (0x1) saAmfSGNumPrefInserviceSUs SA_UINT32_T 3 (0x3) saAmfSGNumPrefAssignedSUs SA_UINT32_T 3 (0x3) saAmfSGNumPrefActiveSUsSA_UINT32_T 3 (0x3) saAmfSGNumCurrNonInstantiatedSpareSUs SA_UINT32_T 0 (0x0) saAmfSGNumCurrInstantiatedSpareSUs SA_UINT32_T 0 (0x0) saAmfSGNumCurrAssignedSUs SA_UINT32_T 3 (0x3) saAmfSGMaxStandbySIsperSU SA_UINT32_T 1 (0x1) saAmfSGMaxActiveSIsperSU SA_UINT32_T 3 (0x3) --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1532 AMF : SI is in assigned state, after shutdown operation of SI ( quiescing reject scenario)
--- ** [tickets:#1532] AMF : SI is in assigned state, after shutdown operation of SI ( quiescing reject scenario)** **Status:** unassigned **Milestone:** 4.5.2 **Created:** Thu Oct 08, 2015 11:20 AM UTC by Srikanth R **Last Updated:** Thu Oct 08, 2015 11:20 AM UTC **Owner:** nobody Changeset : 6901 Application : 2n ( two SUs and 4 SIs with SI1 as sponsor for the remaining SIs) Steps : * Initially all the SIs are in assigned state. * Invoked shutdown operation on one of the dependent SI .i.e SI2. * For the quiescing callback, component responded with FAILED_OP Oct 8 16:27:20 SYSTEST-PLD-1 osafamfnd[4535]: NO Assigning 'safSi=TestApp_SI2,safApp=TestApp_TwoN' QUIESCING to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 8 16:27:30 SYSTEST-PLD-1 osafamfnd[4535]: NO Performing failover of 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' (SU failover count: 2) Oct 8 16:27:30 SYSTEST-PLD-1 osafamfnd[4535]: NO 'safComp=COMP2,safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' faulted due to 'csiSetcallbackTimeout' : Recovery is 'componentFailover' * After recovery of SU1, SI2 assignments are also done, which is not expected. Oct 8 16:27:30 SYSTEST-PLD-1 osafamfnd[4535]: NO 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Presence State TERMINATING => INSTANTIATED Oct 8 16:27:30 SYSTEST-PLD-1 osafamfnd[4535]: NO Assigning 'safSi=TestApp_SI1,safApp=TestApp_TwoN' STANDBY to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 8 16:27:30 SYSTEST-PLD-1 osafamfnd[4535]: NO Assigning 'safSi=TestApp_SI2,safApp=TestApp_TwoN' STANDBY to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' Oct 8 16:27:30 SYSTEST-PLD-1 osafamfnd[4535]: NO Assigning 'safSi=TestApp_SI3,safApp=TestApp_TwoN' STANDBY to 'safSu=TestApp_SU1,safSg=TestApp_SG1,safApp=TestApp_TwoN' * Below is the SI state after the shutdown operation safSi=TestApp_SI2,safApp=TestApp_TwoN saAmfSIAdminState=LOCKED(2) saAmfSIAssignmentState=FULLY_ASSIGNED(2) * Further unlock operation of SI resulted in TIMEOUT return op. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1420 log: crashed when performing admin op on configurable obj class
- **Type**: enhancement --> defect --- ** [tickets:#1420] log: crashed when performing admin op on configurable obj class** **Status:** review **Milestone:** 4.5.2 **Created:** Thu Jul 16, 2015 09:28 AM UTC by Vu Minh Nguyen **Last Updated:** Tue Sep 29, 2015 07:56 AM UTC **Owner:** Vu Minh Nguyen Logsv supports 2 kinds of object class for app stream – run time and configurable. If performing on configurable object class, the logsv on active node will be crashed. > Steps to reproduce: 1. Create an configurable object class for app stream. 2. Perform admin opt on severity filter attribute with valid value. Eg: immadm -o 1 -p saLogStreamSeverityFilter:SA_UINT32_T:100 safLgStrCfg=str7,safApp=safLogService 3. The active node will be reboot. Logsv should check and only allow the action on runtime obj class. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1420 log: crashed when performing admin op on configurable obj class
- **status**: review --> fixed - **assigned_to**: Vu Minh Nguyen --> nobody - **Comment**: changeset: 6985:f9f8e549f071 tag: tip parent: 6980:f133f1195bc2 user:Vu Minh Nguyendate:Thu Oct 08 15:46:12 2015 +0200 summary: log: crashed when performing admin op on configurable obj class [#1420] rev: f9f8e549f0719d8f79e7ba898c6b442351c924e9 changeset: 6984:99821ea1c0cc branch: opensaf-4.7.x parent: 6981:76085f91c60e user:Vu Minh Nguyen date:Thu Oct 08 15:46:12 2015 +0200 summary: log: crashed when performing admin op on configurable obj class [#1420] rev: 99821ea1c0cc26bfb3a440f147f9d4747d0ac034 changeset: 6983:b93950e02e3b branch: opensaf-4.6.x parent: 6975:20542a10d9cb user:Vu Minh Nguyen date:Thu Oct 08 15:47:44 2015 +0200 summary: log: crashed when performing admin op on configurable obj class [#1420] rev: b93950e02e3b73de23e5ac28235301798531f328 changeset: 6982:3cc375475384 branch: opensaf-4.5.x parent: 6974:79afb96a84e6 user:Vu Minh Nguyen date:Thu Oct 08 15:47:44 2015 +0200 summary: log: crashed when performing admin op on configurable obj class [#1420] rev: 3cc375475384013ef3d4663d7fce94568dbf4c8e --- ** [tickets:#1420] log: crashed when performing admin op on configurable obj class** **Status:** fixed **Milestone:** 4.5.2 **Created:** Thu Jul 16, 2015 09:28 AM UTC by Vu Minh Nguyen **Last Updated:** Thu Oct 08, 2015 01:05 PM UTC **Owner:** nobody Logsv supports 2 kinds of object class for app stream – run time and configurable. If performing on configurable object class, the logsv on active node will be crashed. > Steps to reproduce: 1. Create an configurable object class for app stream. 2. Perform admin opt on severity filter attribute with valid value. Eg: immadm -o 1 -p saLogStreamSeverityFilter:SA_UINT32_T:100 safLgStrCfg=str7,safApp=safLogService 3. The active node will be reboot. Logsv should check and only allow the action on runtime obj class. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #509 reource leak when cpnd restarts and the application finalizes the checkpoint handles
The problem doesn't happen on the latest changeset. According to the log, the problem happened because the ckpt_lcl_ref_cnt was still 1 after the application on PL4 finalized Jul 17 17:18:07.958259 osafckptnd [13885:cpnd_evt.c:0540] T1 cpnd client ckpt close success ckpt_app_hdl:2,ckpt_id:1,ckpt_lcl_ref_cnt:1 By checking the code, the ckpt_lcl_ref_cnt is updated 2 times when the cpnd restarts. 1. From the share memory. According to the log, look like this updating is correct. 2 clients were added. Jul 17 17:18:06.971105 osafckptnd [13885:cpnd_res.c:0501] T1 cpnd client handle extracted Jul 17 17:18:06.971122 osafckptnd [13885:cpnd_res.c:0501] T1 cpnd client handle extracted 2. From the message CPND_EVT_A2ND_CKPT_REFCNTSET. There is not much information about the message data in the log. Jul 17 17:18:06.978908 osafckptnd [13885:cpnd_evt.c:4202] >> cpnd_evt_proc_ckpt_refcntset Jul 17 17:18:06.978921 osafckptnd [13885:cpnd_evt.c:4214] << cpnd_evt_proc_ckpt_refcntset More information about this is needed for further investigation. Suggestion: To improve tracing ckpt_lcl_ref_cnt cpnd for easier troubleshooting such kind of problem. --- ** [tickets:#509] reource leak when cpnd restarts and the application finalizes the checkpoint handles** **Status:** accepted **Milestone:** future **Created:** Thu Jul 18, 2013 06:14 AM UTC by Sirisha Alla **Last Updated:** Fri Oct 02, 2015 09:15 AM UTC **Owner:** Pham Hoang Nhat **Attachments:** - [logs.tar.gz](https://sourceforge.net/p/opensaf/tickets/509/attachment/logs.tar.gz) (78.0 kB; application/x-gzip) The issue is observed on changeset 4325 on SLES 4 node cluster VMs. The issue is reproducible with the following steps: Checkpoint applications running on PL-3 and PL-4 1) On PL-3 An asynchronous collocated checkpoint is created and the same checkpoint is opened for writing on the same node 2) On PL-4 the checkpoint is opened twice with write option 3) Active replica for the checkpoint is set on PL-3 4) A section is created in the checkpoint from PL-4 5) CPND is restarted on both the payloads 6) Checkpoint is unlinked and closed on PL-3 7) Active replica is set on PL-4 and a section is created in the checkpoint 8) Now the checkpoint handle being used by the application on PL-4 are finalized. The replicas are expected to be deleted both on PL-3 and PL-4. IMM database do not have any references to the checkpoint table or replica table, but a stale checkpoint is found on PL-4 in the shared memory. The replica on PL-3 is deleted. However there seems to be no functional impact because of this stale resource. When a checkpoint with the same name is opened, a new replica is being created in the shared memory. SLES-64BIT-SLOT4:/opt/goahead/tetware/opensaffire/bin64 # immfind | grep -i ckpt safApp=safCkptService SLES-64BIT-SLOT4:/opt/goahead/tetware/opensaffire/bin64 # ls -lrt /dev/shm/opensaf/ total 844 -rw-r--r-- 1 root root 132200 Jul 17 17:17 NCS_MQND_QUEUE_CKPT_INFO -rw-r--r-- 1 root root 328000 Jul 17 17:17 NCS_GLND_RES_CKPT_INFO -rw-r--r-- 1 root root 16 Jul 17 17:17 NCS_GLND_LCK_CKPT_INFO -rw-r--r-- 1 root root 88000 Jul 17 17:17 NCS_GLND_EVT_CKPT_INFO -rw-r--r-- 1 root root 1688 Jul 17 17:18 safCkpt=collocated_ckpt_name_101_13_132111_1 -rw-r--r-- 1 root root 704008 Jul 17 17:18 CPND_CHECKPOINT_INFO_132111 ckptd and ckptnd traces are attached. The time of test is 17th july 17:17:59. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets