[tickets] [opensaf:tickets] #2134 AMF: Update RTA saAmfSISUHAState to IMM
- **status**: unassigned --> wontfix --- ** [tickets:#2134] AMF: Update RTA saAmfSISUHAState to IMM** **Status:** wontfix **Milestone:** 5.2.FC **Created:** Thu Oct 20, 2016 07:58 PM UTC by Minh Hon Chau **Last Updated:** Thu Nov 10, 2016 06:47 AM UTC **Owner:** nobody In scenario of 2N Si-swap, when AMFD sends QUIESCED su_si assignment msg (for example) to AMFND that changes the HA State of SUSI assignment, AMFD updates its local state AVD_SU_SI_REL::state, checkpoint this change to standby AMFD. However, AMFD does not updates saAmfSISUHAState untill receiving su_si assignment response. Question: (1). Whether AMFD should update the runtime attribute saAmfSISUHAState to IMM as long as local @state gets updated in implementer; to make IMM, active AMFD, standby AMFD all are synced (2). Or AMFD updates saAmfSISUHAState to IMM only if AMFD receives su_si assignment from AMFND, as it has been implemented currently for some reason (not expose the change of saAmfSISUHAState to user too early?) grep "avd_susi_update" which updates saAmfSISUHAState to IMM, there is also an inconsistency in usage. For avd_susi_mod_send() sends su_si msg and also updates saAmfSISUHAState immediately, while avd_sg_su_si_mod_snd does otherwise. Since the headless recovery relies on IMM to restore the state. If saAmfSISUHAState is not updated punctually and the node is reboot during headless stage, so after headless saAmfSISUHAState read from IMM does not fit with many other states (SG fsm, SUSI fsm, saAmfSISUHAState of the other SUSIs). My question is if doing (1) will cause any problem for normal cluster? Pending patches #1725 part 2 currently implement (1). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2134 AMF: Update RTA saAmfSISUHAState to IMM
Hi Praveen, I have updated #2133, #1354. Since part 2 of #1725 has been pushed, so I close this ticket. There is a very small probability that application may still require this attribute updated after assignment sequence as before, this drops into the case of avd_sg_su_si_mod_snd() . But it's unlikely to happen since the other one - avd_susi_mod_send() has been done in opposite way. Thanks, Minh --- ** [tickets:#2134] AMF: Update RTA saAmfSISUHAState to IMM** **Status:** unassigned **Milestone:** 5.2.FC **Created:** Thu Oct 20, 2016 07:58 PM UTC by Minh Hon Chau **Last Updated:** Thu Nov 10, 2016 05:30 AM UTC **Owner:** nobody In scenario of 2N Si-swap, when AMFD sends QUIESCED su_si assignment msg (for example) to AMFND that changes the HA State of SUSI assignment, AMFD updates its local state AVD_SU_SI_REL::state, checkpoint this change to standby AMFD. However, AMFD does not updates saAmfSISUHAState untill receiving su_si assignment response. Question: (1). Whether AMFD should update the runtime attribute saAmfSISUHAState to IMM as long as local @state gets updated in implementer; to make IMM, active AMFD, standby AMFD all are synced (2). Or AMFD updates saAmfSISUHAState to IMM only if AMFD receives su_si assignment from AMFND, as it has been implemented currently for some reason (not expose the change of saAmfSISUHAState to user too early?) grep "avd_susi_update" which updates saAmfSISUHAState to IMM, there is also an inconsistency in usage. For avd_susi_mod_send() sends su_si msg and also updates saAmfSISUHAState immediately, while avd_sg_su_si_mod_snd does otherwise. Since the headless recovery relies on IMM to restore the state. If saAmfSISUHAState is not updated punctually and the node is reboot during headless stage, so after headless saAmfSISUHAState read from IMM does not fit with many other states (SG fsm, SUSI fsm, saAmfSISUHAState of the other SUSIs). My question is if doing (1) will cause any problem for normal cluster? Pending patches #1725 part 2 currently implement (1). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2133 AMF: Rollback admin shutdown/lock SI operation if node failover
- **summary**: AMF: Rollback admin shutdown SI operation if node failover --> AMF: Rollback admin shutdown/lock SI operation if node failover - **Type**: discussion --> defect - **Milestone**: 5.2.FC --> future --- ** [tickets:#2133] AMF: Rollback admin shutdown/lock SI operation if node failover** **Status:** unassigned **Milestone:** future **Created:** Thu Oct 20, 2016 06:49 PM UTC by Minh Hon Chau **Last Updated:** Thu Nov 10, 2016 06:36 AM UTC **Owner:** nobody In scenario of shut down SI, delay QUIESCING csi callback, then reboot the node that hosting SU having pending this csi callback. The result of this operation looks differently between SGs - For 2N: the SI Admin state is rollbacked to UNLOCK - For Nway: the SI Admin state moves to LOCKED - In NpM: Haven't tested just browsing SG_NPM::node_fail_si_oper, looks SI Admin states rollbacks to UNLOCK My question is whether the result of these scenario should be consistent? And what's the expected outcome? Also, the handling of node_fail_si_oper for admin lock is not consistent. For 2N, Admin state remains LOCKED, NpM rollbacks to UNLOCK --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2133 AMF: Rollback admin shutdown SI operation if node failover
The lock operation is not consistent behavior between SGs in scenario of failover during lock command. Mark this ticket as defect for future --- ** [tickets:#2133] AMF: Rollback admin shutdown SI operation if node failover** **Status:** unassigned **Milestone:** 5.2.FC **Created:** Thu Oct 20, 2016 06:49 PM UTC by Minh Hon Chau **Last Updated:** Mon Oct 24, 2016 01:38 PM UTC **Owner:** nobody In scenario of shut down SI, delay QUIESCING csi callback, then reboot the node that hosting SU having pending this csi callback. The result of this operation looks differently between SGs - For 2N: the SI Admin state is rollbacked to UNLOCK - For Nway: the SI Admin state moves to LOCKED - In NpM: Haven't tested just browsing SG_NPM::node_fail_si_oper, looks SI Admin states rollbacks to UNLOCK My question is whether the result of these scenario should be consistent? And what's the expected outcome? Also, the handling of node_fail_si_oper for admin lock is not consistent. For 2N, Admin state remains LOCKED, NpM rollbacks to UNLOCK --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1354 amf: sync amfd and amfnd for assignment related logging, imm updates.
https://sourceforge.net/p/opensaf/tickets/2134/ Discussion in #2134 has information for solution of #1354 --- ** [tickets:#1354] amf: sync amfd and amfnd for assignment related logging, imm updates.** **Status:** assigned **Milestone:** 5.0.2 **Created:** Wed Apr 29, 2015 04:22 AM UTC by Praveen **Last Updated:** Tue Sep 20, 2016 06:04 PM UTC **Owner:** Praveen This ticket is based on a user list query which goes like this: " When AMF begins "attempting" to assign CSI assignments it updates the runtime attributes and associations to show assigned and active BEFORE the active CSI has been accepted by the Component. So if I query the CSI using amf-state csiass before the CSI has been accepted by the Component smf-state will show it assigned and active. The same his for the SI. The SI will be assigned to the SU as active BEFORE all the CSI assignments have been accepted by the components in that SU. I checked the runtime state with both immlist and amf-state. Both were consistent in being incorrect. Interestingly enough the opensaf log entries show the correct behavior. The log entry indication the SI was assigned to the SU is not logged until all CSI assignments have been accepted. " --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2175 amfd: null SU during CCB modify apply on standby director
- **status**: review --> fixed - **Milestone**: 5.1.1 --> 5.0.2 --- ** [tickets:#2175] amfd: null SU during CCB modify apply on standby director** **Status:** fixed **Milestone:** 5.0.2 **Created:** Tue Nov 08, 2016 06:37 AM UTC by Gary Lee **Last Updated:** Wed Nov 09, 2016 03:17 AM UTC **Owner:** Gary Lee su is NULL and subsequently causes a segfault on standby director line 1833 corresponds to su->saAmfSUMaintenanceCampaign = ""; Cause is probably the same as #1932? ~~~ Full backtrace: #0 0x7f5ee5adc036 in std::string::assign(char const*, unsigned long) () from /usr/lib64/libstdc++.so.6 No symbol table info available. #1 0x00495ab9 in assign (__s=0x4c8b33 "", this=0x28) at /usr/include/c++/4.8/bits/basic_string.h:1131 No locals. #2 operator= (__s=0x4c8b33 "", this=0x28) at /usr/include/c++/4.8/bits/basic_string.h:555 No locals. #3 su_ccb_apply_modify_hdlr (opdata=opdata@entry=0x2580cf4) at ../../../../../../../opensaf/osaf/services/saf/amf/amfd/su.cc:1833 attr_mod = 0x2580f48 i = su = 0x0 value_is_deleted = true _FUNCTION_ = "su_ccb_apply_modify_hdlr" #4 0x00498d78 in su_ccb_apply_cb (opdata=0x2580cf4) at ../../../../../../../opensaf/osaf/services/saf/amf/amfd/su.cc:1985 su = _FUNCTION_ = "su_ccb_apply_cb" #5 0x00439fa6 in ccb_apply_cb (immoi_handle=, ccb_id=218) at ../../../../../../../opensaf/osaf/services/saf/amf/amfd/imm.cc:1226 ccb_util_ccb_data = type = temp = _FUNCTION_ = "ccb_apply_cb" opdata = 0x0 next = 0x2517540 #6 0x7f5ee6818329 in imma_process_callback_info (cb=cb@entry=0x7f5ee6a373a0 , cl_node=0x2517cc0, callback=callback@entry=0x7f5ed8004b60, immHandle=901943263503) at ../../../../../../../opensaf/osaf/libs/agents/saf/imma/imma_proc.c:2245 ccbid = 218 privateAugOmHandle = 0 _FUNCTION_ = "imma_process_callback_info" clientCapable = true isPbeOp = false isExtendedNameValid = false isAttrExtendedName = false #7 0x7f5ee681aec9 in imma_hdl_callbk_dispatch_all (cb=0x7f5ee6a373a0 , immHandle=901943263503) at ../../../../../../../opensaf/osaf/libs/agents/saf/imma/imma_proc.c:1732 callback = 0x7f5ed8004b60 cl_node = 0x2517cc0 #8 0x7f5ee680efc4 in saImmOiDispatch (immOiHandle=901943263503, dispatchFlags=SA_DISPATCH_ALL) at ../../../../../../../opensaf/osaf/libs/agents/saf/imma/imma_oi_api.c:609 rc = SA_AIS_OK cl_node = 0x0 locked = false pend_fin = 0 pend_dis = 0 _FUNCTION_ = "saImmOiDispatch" #9 0x00407b90 in main_loop () at ../../../../../../../opensaf/osaf/services/saf/amf/amfd/main.cc:722 pollretval = evt = polltmo = term_fd = 14 cb = 0x6e8900 <_control_block> error = #10 main (argc=, argv=) at ../../../../../../../opensaf/osaf/services/saf/amf/amfd/main.cc:848 ~~~ --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2134 AMF: Update RTA saAmfSISUHAState to IMM
Hi MInh, Any update on this. Thanks, Praveen --- ** [tickets:#2134] AMF: Update RTA saAmfSISUHAState to IMM** **Status:** unassigned **Milestone:** 5.2.FC **Created:** Thu Oct 20, 2016 07:58 PM UTC by Minh Hon Chau **Last Updated:** Wed Nov 09, 2016 06:52 AM UTC **Owner:** nobody In scenario of 2N Si-swap, when AMFD sends QUIESCED su_si assignment msg (for example) to AMFND that changes the HA State of SUSI assignment, AMFD updates its local state AVD_SU_SI_REL::state, checkpoint this change to standby AMFD. However, AMFD does not updates saAmfSISUHAState untill receiving su_si assignment response. Question: (1). Whether AMFD should update the runtime attribute saAmfSISUHAState to IMM as long as local @state gets updated in implementer; to make IMM, active AMFD, standby AMFD all are synced (2). Or AMFD updates saAmfSISUHAState to IMM only if AMFD receives su_si assignment from AMFND, as it has been implemented currently for some reason (not expose the change of saAmfSISUHAState to user too early?) grep "avd_susi_update" which updates saAmfSISUHAState to IMM, there is also an inconsistency in usage. For avd_susi_mod_send() sends su_si msg and also updates saAmfSISUHAState immediately, while avd_sg_su_si_mod_snd does otherwise. Since the headless recovery relies on IMM to restore the state. If saAmfSISUHAState is not updated punctually and the node is reboot during headless stage, so after headless saAmfSISUHAState read from IMM does not fit with many other states (SG fsm, SUSI fsm, saAmfSISUHAState of the other SUSIs). My question is if doing (1) will cause any problem for normal cluster? Pending patches #1725 part 2 currently implement (1). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1902 AMF: Extend escalation support during headless
Hi Praveen, I am going through component failover test cases again, which were not stable before. No other pending implementation other than this, just want to test again. Thanks, Minh --- ** [tickets:#1902] AMF: Extend escalation support during headless** **Status:** assigned **Milestone:** 5.2.FC **Created:** Wed Jun 29, 2016 12:02 PM UTC by Minh Hon Chau **Last Updated:** Thu Nov 10, 2016 04:39 AM UTC **Owner:** Minh Hon Chau If a comp/su failover occurs during headless, amfnd will escalate to reboot. This will unexpectedly impact on other comp/su which are up and running if there's no node failover escalation configured on this faulty comp/su 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safComp=AmfDemo2,safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' faulted due to 'avaDown' : Recovery is 'suFailover' 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO Terminating components of 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon'(abruptly & unordered) 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State INSTANTIATED => TERMINATING 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => TERMINATING 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => TERMINATING 2016-06-29 21:30:07 PL-4 osafamfnd[429]: Rebooting OpenSAF NodeId = 132111 EE Name = , Reason: Can't perform recovery while controllers are down. Recovery is node failfast., OwnNodeId = 132111, SupervisionTime = 60 2016-06-29 21:30:07 PL-4 opensaf_reboot: Rebooting local node; timeout=60 This ticket will remove unexpected reboot due to failover during headless which is mentioned as limitation in AMF opensaf documentation. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1902 AMF: Extend escalation support during headless
Hi Minh, Is there anything pending for this ticket besides documentation? There is a separate documentation ticket #2179. Thanks, Praveen --- ** [tickets:#1902] AMF: Extend escalation support during headless** **Status:** assigned **Milestone:** 5.2.FC **Created:** Wed Jun 29, 2016 12:02 PM UTC by Minh Hon Chau **Last Updated:** Thu Nov 10, 2016 04:37 AM UTC **Owner:** Minh Hon Chau If a comp/su failover occurs during headless, amfnd will escalate to reboot. This will unexpectedly impact on other comp/su which are up and running if there's no node failover escalation configured on this faulty comp/su 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safComp=AmfDemo2,safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' faulted due to 'avaDown' : Recovery is 'suFailover' 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO Terminating components of 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon'(abruptly & unordered) 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State INSTANTIATED => TERMINATING 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => TERMINATING 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => TERMINATING 2016-06-29 21:30:07 PL-4 osafamfnd[429]: Rebooting OpenSAF NodeId = 132111 EE Name = , Reason: Can't perform recovery while controllers are down. Recovery is node failfast., OwnNodeId = 132111, SupervisionTime = 60 2016-06-29 21:30:07 PL-4 opensaf_reboot: Rebooting local node; timeout=60 This ticket will remove unexpected reboot due to failover during headless which is mentioned as limitation in AMF opensaf documentation. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1902 AMF: Extend escalation support during headless
changeset: 8278:043297f42a74 user:minh-chaudate:Wed Nov 02 15:40:13 2016 +1100 summary: AMFD: Do not add su operation list if su has no pending susi assignment [#1902] changeset: 8277:09a006b409ba user:minh-chau date:Wed Nov 02 15:40:10 2016 +1100 summary: AMFD: Do not recover if no pending susi assignment after headless [#1902] --- ** [tickets:#1902] AMF: Extend escalation support during headless** **Status:** assigned **Milestone:** 5.2.FC **Created:** Wed Jun 29, 2016 12:02 PM UTC by Minh Hon Chau **Last Updated:** Mon Nov 07, 2016 07:30 PM UTC **Owner:** Minh Hon Chau If a comp/su failover occurs during headless, amfnd will escalate to reboot. This will unexpectedly impact on other comp/su which are up and running if there's no node failover escalation configured on this faulty comp/su 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safComp=AmfDemo2,safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' faulted due to 'avaDown' : Recovery is 'suFailover' 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO Terminating components of 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon'(abruptly & unordered) 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State INSTANTIATED => TERMINATING 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => TERMINATING 2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => TERMINATING 2016-06-29 21:30:07 PL-4 osafamfnd[429]: Rebooting OpenSAF NodeId = 132111 EE Name = , Reason: Can't perform recovery while controllers are down. Recovery is node failfast., OwnNodeId = 132111, SupervisionTime = 60 2016-06-29 21:30:07 PL-4 opensaf_reboot: Rebooting local node; timeout=60 This ticket will remove unexpected reboot due to failover during headless which is mentioned as limitation in AMF opensaf documentation. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node
- **status**: review --> fixed --- ** [tickets:#2112] amfd: multiple SUs incorrectly assigned to single node** **Status:** fixed **Milestone:** 5.0.2 **Created:** Tue Oct 11, 2016 11:56 PM UTC by Gary Lee **Last Updated:** Thu Nov 10, 2016 02:32 AM UTC **Owner:** Minh Hon Chau Multiple SUs are assigned to a single node after SC absence. To reproduce: 0) load nwayactive demo 1) stop SCs 2) restart SCs The following is observed: root@SC-1:~# immlist safSu=SU4,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) root@SC-1:~# immlist safSu=SU2,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is not assigned to PL-4. Operations on SU4 will lead to a crash of amfnd on PL-4. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node
Push into 5.0 branch changeset: 8302:6557805ec604 branch: opensaf-5.0.x --- ** [tickets:#2112] amfd: multiple SUs incorrectly assigned to single node** **Status:** review **Milestone:** 5.0.2 **Created:** Tue Oct 11, 2016 11:56 PM UTC by Gary Lee **Last Updated:** Thu Nov 10, 2016 02:24 AM UTC **Owner:** Minh Hon Chau Multiple SUs are assigned to a single node after SC absence. To reproduce: 0) load nwayactive demo 1) stop SCs 2) restart SCs The following is observed: root@SC-1:~# immlist safSu=SU4,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) root@SC-1:~# immlist safSu=SU2,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is not assigned to PL-4. Operations on SU4 will lead to a crash of amfnd on PL-4. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node
- **Milestone**: 5.1.1 --> 5.0.2 --- ** [tickets:#2112] amfd: multiple SUs incorrectly assigned to single node** **Status:** review **Milestone:** 5.0.2 **Created:** Tue Oct 11, 2016 11:56 PM UTC by Gary Lee **Last Updated:** Thu Nov 10, 2016 01:07 AM UTC **Owner:** Minh Hon Chau Multiple SUs are assigned to a single node after SC absence. To reproduce: 0) load nwayactive demo 1) stop SCs 2) restart SCs The following is observed: root@SC-1:~# immlist safSu=SU4,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) root@SC-1:~# immlist safSu=SU2,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is not assigned to PL-4. Operations on SU4 will lead to a crash of amfnd on PL-4. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2141 AMF: AMFD fails to stop clm track during role transition from active to quiesced
Attach patch with a solution described in ticket. Attachments: - [2141.diff](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/cdd5c818/5bc0/attachment/2141.diff) (1.8 kB; text/x-patch) --- ** [tickets:#2141] AMF: AMFD fails to stop clm track during role transition from active to quiesced** **Status:** assigned **Milestone:** 5.0.2 **Created:** Wed Oct 26, 2016 03:57 AM UTC by Minh Hon Chau **Last Updated:** Wed Oct 26, 2016 03:57 AM UTC **Owner:** Minh Hon Chau In scenario of swapping 2N Opensaf SI (switch over), when active AMFD is moving to quiesced, AMFD fails to stop clm track callback due to return code SA_AIS_ERR_TIMEOUT. Currently AMFD only logs error but stopping track record has not done properly. That results into new standby AMFD (was being quiesced) receives clm track callback when other node leaves cluster. Eventually, clm_node_exit_complete() will be triggered at standby AMFD, which should not happen. The consequence is standby AMFD fails to resolve checkpoint update if another node reboots afterward. In SC2 (standby AMFD) Oct 26 11:59:26.061288 osafamfd [468:clm.cc:0216] >> clm_track_cb: '0' '4' '1' Oct 26 11:59:26.061294 osafamfd [468:clm.cc:0281] TR Node Left: rootCauseEntity safNode=PL-3,safCluster=myClmCluster for node 131855 Oct 26 11:59:26.061298 osafamfd [468:clm.cc:0185] >> clm_node_exit_complete: 2030f Oct 26 11:59:26.061301 osafamfd [468:ndproc.cc:1139] >> avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster' ... Oct 26 11:59:26.070895 osafamfd [468:sg_nored_fsm.cc:0770] >> node_fail: safSu=PL-3,safSg=NoRed,safApp=OpenSAF, TEST sg_fsm_state=0 ... Oct 26 11:59:26.071007 osafamfd [468:siass.cc:0496] : >> avd_susi_delete: safSu=PL-3,safSg=NoRed,safApp=OpenSAF safSi=NoRed4,safApp=OpenSAF In SC1 (active AMFD) Oct 26 11:59:26.057724 osafamfd [488:ndfsm.cc:0671] >> avd_mds_avnd_down_evh: 2030f, 0x7da3d0 Oct 26 11:59:26.057732 osafamfd [488:ndproc.cc:1139] >> avd_node_failover: 'safAmfNode=PL-3,safAmfCluster=myAmfCluster' Oct 26 11:59:26.057739 osafamfd [488:ndfsm.cc:0999] >> avd_node_mark_absent ... Oct 26 11:59:26.066576 osafamfd [488:sg_nored_fsm.cc:0770] >> node_fail: safSu=PL-3,safSg=NoRed,safApp=OpenSAF, TEST sg_fsm_state=0 ... Oct 26 11:59:26.066783 osafamfd [488:siass.cc:0496] : >> avd_susi_delete: safSu=PL-3,safSg=NoRed,safApp=OpenSAF safSi=NoRed4,safApp=OpenSAF When AMFD-SC1 deletes susi, it checkpoints to standby AMFD, now standby AMFD fails to resolve this update because standby AMFD has already deleted it Oct 26 11:59:26.073665 osafamfd [468:ckpt_dec.cc:0659] >> dec_siass: i_action '2' ... Oct 26 11:59:26.073700 osafamfd [468:ckpt_updt.cc:0405] >> avd_ckpt_siass: 'safSi=NoRed4,safApp=OpenSAF' 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' Oct 26 11:59:26.073704 osafamfd [468:si.cc:0395] >> avd_si_get: safSi=NoRed4,safApp=OpenSAF Oct 26 11:59:26.073706 osafamfd [468:si.cc:0396] << avd_si_get ... Oct 26 11:59:26.073722 osafamfd [468:ckpt_updt.cc:0508] ER avd_ckpt_siass: safSu=PL-3,safSg=NoRed,safApp=OpenSAF safSi=NoRed4,safApp=OpenSAF does not exist Oct 26 11:59:26.073725 osafamfd [468:ckpt_dec.cc:0690] << dec_siass This error can be seen by comment out the avd_clm_track_stop() in amfd_switch_actv_qsd() to pretend the SA_AIS_ERR_TIMEOUT error code. A simple solution could be, when standby AMFD receives clm_track_cb(), AMFD can retry to stop track record and quickly return out of callback --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node
Pushed for 5.1 and default: changeset: 8301:a25d5d50b01a changeset: 8300:773643625dc6 Could it happen with 5.0? --- ** [tickets:#2112] amfd: multiple SUs incorrectly assigned to single node** **Status:** review **Milestone:** 5.1.1 **Created:** Tue Oct 11, 2016 11:56 PM UTC by Gary Lee **Last Updated:** Mon Oct 24, 2016 04:57 AM UTC **Owner:** Minh Hon Chau Multiple SUs are assigned to a single node after SC absence. To reproduce: 0) load nwayactive demo 1) stop SCs 2) restart SCs The following is observed: root@SC-1:~# immlist safSu=SU4,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) root@SC-1:~# immlist safSu=SU2,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is not assigned to PL-4. Operations on SU4 will lead to a crash of amfnd on PL-4. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2182 mds: MDS receive thread may hang when a signal is caught
- **status**: accepted --> review --- ** [tickets:#2182] mds: MDS receive thread may hang when a signal is caught** **Status:** review **Milestone:** 5.0.2 **Created:** Wed Nov 09, 2016 03:42 PM UTC by Anders Widell **Last Updated:** Wed Nov 09, 2016 03:42 PM UTC **Owner:** Anders Widell The result from poll() is incorrectly stored in an unsigned integer, which means that if poll() returns -1 we will interpret the result as a very large number. Subsequently, we read the possibly undefined values of pollfd.revents, and may perform a blocking read. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2182 mds: MDS receive thread may hang when a signal is caught
--- ** [tickets:#2182] mds: MDS receive thread may hang when a signal is caught** **Status:** accepted **Milestone:** 5.0.2 **Created:** Wed Nov 09, 2016 03:42 PM UTC by Anders Widell **Last Updated:** Wed Nov 09, 2016 03:42 PM UTC **Owner:** Anders Widell The result from poll() is incorrectly stored in an unsigned integer, which means that if poll() returns -1 we will interpret the result as a very large number. Subsequently, we read the possibly undefined values of pollfd.revents, and may perform a blocking read. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2181 mds: Use SOCK_CLOEXEC when creating sockets
--- ** [tickets:#2181] mds: Use SOCK_CLOEXEC when creating sockets** **Status:** accepted **Milestone:** 5.2.FC **Created:** Wed Nov 09, 2016 01:41 PM UTC by Anders Widell **Last Updated:** Wed Nov 09, 2016 01:41 PM UTC **Owner:** Anders Widell To avoid a potential race between fcntl(FD_CLOEXEC) in one thread and exec() in another thread, use the SOCK_CLOEXEC flag when creating sockets. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2153 smf: Fails to create a node group, admin owner/handle is lost
- **status**: accepted --> review --- ** [tickets:#2153] smf: Fails to create a node group, admin owner/handle is lost** **Status:** review **Milestone:** 5.1.1 **Created:** Mon Oct 31, 2016 02:54 PM UTC by elunlen **Last Updated:** Fri Nov 04, 2016 12:22 PM UTC **Owner:** elunlen Even though handles and admin owner needed to create a node group is created just before the node group shall be created the creation may still fail because of bad handle or missing admin owner. To increase robustness a mechanism for recreation of handles and admin owner similar to handling when deleting a node group should be implemented. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2180 mds: Ensure topology events are seen before data messages
--- ** [tickets:#2180] mds: Ensure topology events are seen before data messages** **Status:** assigned **Milestone:** 5.2.FC **Created:** Wed Nov 09, 2016 12:07 PM UTC by Anders Widell **Last Updated:** Wed Nov 09, 2016 12:07 PM UTC **Owner:** Anders Widell TIPC does not guarantee that topology events are delivered before related data messages, which means that you can receive a message before you see the name subscription event which tells you that the sender is up. MDS needs to re-order the events by buffering incoming messages in these cases until the corresponding topology events have been received, so that the MDS user always sees the topology event before the data message. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2158 AMF: IMMND dies at Opensaf start up phase causes AMFD heartbeat timeout
Hi Praveen, This issue happened with SC absence feature enabled but had no spare SC. It did not happen in context of headless. In context of #1334, standby SC reboots because cold sync is in progress? In this ticket, after cluster reboot, both SCs were coming up. SC1 (active) reboot before SC2 (supposedly) was assigned standby. In amfnd trace, as you said, before amfnd-SC2 initiated middleware SUs, immnd had died. So amfnd-SC2 was not aware of immnd process so amfnd can restart immnd. I think this situation would also happen in first active SC start up sequence (not only failover), where immnd has already responded to NID and immnd dies before amfnd can monitor immnd's process. I think I should change *component* to osaf? Any ideas for a solution? Thanks, Minh --- ** [tickets:#2158] AMF: IMMND dies at Opensaf start up phase causes AMFD heartbeat timeout** **Status:** unassigned **Milestone:** 5.0.2 **Created:** Wed Nov 02, 2016 05:20 AM UTC by Minh Hon Chau **Last Updated:** Tue Nov 08, 2016 07:09 AM UTC **Owner:** nobody **Attachments:** - [osafamfnd_sc2](https://sourceforge.net/p/opensaf/tickets/2158/attachment/osafamfnd_sc2) (264.2 kB; application/octet-stream) If IMMND dies at Opensaf startup phase, IMMND is not restarted by AMF. The issue has been observed in following situation - Restart cluster - During active controller starts up, a critical component is death which cause a node failfast Oct 25 12:51:21 SC-1 osafamfnd[7642]: ER safComp=ABC,safSu=1,safSg=2N,safApp=ABC Faulted due to:csiSetcallbackTimeout Recovery is:nodeFailfast Oct 25 12:51:21 SC-1 osafamfnd[7642]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60 - In the meantime, standby controller is requested to become active Oct 25 12:51:27 SC-2 tipclog[16221]: Lost link <1.1.2:eth0-1.1.1:eth0> on network plane A Oct 25 12:51:27 SC-2 osafclmna[4336]: NO Starting to promote this node to a system controller Oct 25 12:51:27 SC-2 osafrded[4387]: NO Requesting ACTIVE role - IMMND is also death a bit later Oct 25 12:51:29 SC-2 osafimmnd[4536]: ER MESSAGE:44816 OUT OF ORDER my highest processed:44814 - exiting Oct 25 12:51:29 SC-2 osafamfnd[7414]: NO saClmDispatch BAD_HANDLE - Other services could not initialize other services since IMMND is death Oct 25 12:51:39 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5 Oct 25 12:51:39 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5 Oct 25 12:51:39 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( returned SA_AIS_ERR_TIMEOUT (5) Oct 25 12:51:39 SC-2 osafclmd[7386]: WA saImmOiImplementerSet returned 9 Oct 25 12:51:39 SC-2 osafntfd[7372]: WA saLogInitialize returns try again, retries... Oct 25 12:51:39 SC-2 osaflogd[7358]: WA saImmOiImplementerSet returned SA_AIS_ERR_BAD_HANDLE (9) Oct 25 12:51:39 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5 Oct 25 12:51:49 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5 Oct 25 12:51:50 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5 Oct 25 12:51:50 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5 Oct 25 12:52:00 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5 Oct 25 12:52:00 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5 Oct 25 12:52:00 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5 Oct 25 12:52:20 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5 Oct 25 12:52:20 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5 Oct 25 12:52:20 SC-2 osafimmd[4489]: NO Extended intro from node 2210f - At the end, AMFD heart beat timeout Oct 25 12:53:57 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( returned SA_AIS_ERR_TIMEOUT (5) Oct 25 12:54:01 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5 Oct 25 12:54:01 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5 Oct 25 12:54:01 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5 Oct 25 12:54:07 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( returned SA_AIS_ERR_TIMEOUT (5) Oct 25 12:54:11 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5 Oct 25 12:54:11 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5 Oct 25 12:54:11 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5 Oct 25 12:54:15 SC-2 osafamfnd[7414]: ER AMF director heart beat timeout, generating core for amfd In AMFND trace in SC2, AMFND did not receive su_pres from AMFD, therefore AMFND could not initiate middleware components (including IMMND), so AMFND was not aware of IMMND's death so that AMFND can restart IMMND. The problem here is slightly different from #1828, which happened in newly promoted SC (with roamingSC feature) where AMFND had IMMND registered. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is
[tickets] [opensaf:tickets] #2015 mds: Use a separate process for writing MDS logs
- **status**: review --> fixed - **Comment**: changeset: 8297:9148b88c808f user:Anders Widelldate:Wed Nov 09 10:11:17 2016 +0100 summary: mds: Convert the mds_log.c file to C++ [#2015] changeset: 8298:b611bd543da4 user:Anders Widell date:Wed Nov 09 10:11:32 2016 +0100 summary: mds: Use osaftransportd for writing MDS log messages [#2015] changeset: 8299:9f1e6d7ea08c user:Anders Widell date:Wed Nov 09 10:11:32 2016 +0100 summary: dtm: Implement an MDS log server [#2015] [staging:9148b8] [staging:b611bd] [staging:9f1e6d] --- ** [tickets:#2015] mds: Use a separate process for writing MDS logs** **Status:** fixed **Milestone:** 5.2.FC **Created:** Fri Sep 09, 2016 09:59 AM UTC by Anders Widell **Last Updated:** Wed Oct 19, 2016 11:49 AM UTC **Owner:** Anders Widell Currently, the MDS log entries are written to disk from within the MDS code. This file I/O is done while holding the MDS mutex, and can potentially block for a long time if file I/O is slow. In the best case, it will result in longer latency for MDS messages. In the worst case, it will result in an overflow of the TIPC receive buffer and loss of incoming MDS messages. To avoid this problem, the idea is to let a separate process (one per node) do all the MDS logging file I/O. The MDS library will send log messages to this logger process using a UNIX socket. Incidentally, there is already a separate process osaf-transport-monitor which is responsible for rotating the MDS logs. This process can get the added responsibility to not only rotate the log files, but also write the log entries to the disk. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets