date:20161109

[tickets] [opensaf:tickets] #2134 AMF: Update RTA saAmfSISUHAState to IMM

2016-11-09 Thread Minh Hon Chau

- **status**: unassigned --> wontfix



---

** [tickets:#2134] AMF: Update RTA saAmfSISUHAState to IMM**

**Status:** wontfix
**Milestone:** 5.2.FC
**Created:** Thu Oct 20, 2016 07:58 PM UTC by Minh Hon Chau
**Last Updated:** Thu Nov 10, 2016 06:47 AM UTC
**Owner:** nobody


In scenario of 2N Si-swap, when AMFD sends QUIESCED su_si assignment msg (for 
example) to AMFND that changes the HA State of SUSI assignment, AMFD updates 
its local state AVD_SU_SI_REL::state, checkpoint this change to standby AMFD. 
However, AMFD does not updates saAmfSISUHAState untill receiving su_si 
assignment response. Question:
(1). Whether AMFD should update the runtime attribute saAmfSISUHAState to IMM 
as long as local @state gets updated in implementer; to make IMM, active AMFD, 
standby AMFD all are synced
(2). Or AMFD updates saAmfSISUHAState to IMM only if AMFD receives su_si 
assignment from AMFND, as it has been implemented currently for some reason 
(not expose the change of saAmfSISUHAState to user too early?)

grep "avd_susi_update" which updates saAmfSISUHAState to IMM, there is also an 
inconsistency in usage. For avd_susi_mod_send() sends su_si msg and also 
updates saAmfSISUHAState immediately, while avd_sg_su_si_mod_snd does 
otherwise. 

Since the headless recovery relies on IMM to restore the state. If 
saAmfSISUHAState is not updated punctually and the node is reboot during 
headless stage, so after headless saAmfSISUHAState read from IMM does not fit 
with many other states (SG fsm, SUSI fsm, saAmfSISUHAState of the other SUSIs).

My question is if doing (1) will cause any problem for normal cluster? Pending 
patches #1725 part 2 currently implement (1).



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2134 AMF: Update RTA saAmfSISUHAState to IMM

2016-11-09 Thread Minh Hon Chau

Hi Praveen,

I have updated #2133, #1354. Since part 2 of #1725 has been pushed, so I close 
this ticket. 
There is a very small probability that application may still require this 
attribute updated after assignment sequence as before, this drops into the case 
of avd_sg_su_si_mod_snd() . But it's unlikely to happen since the other one - 
avd_susi_mod_send() has been done in opposite way. 

Thanks,
Minh


---

** [tickets:#2134] AMF: Update RTA saAmfSISUHAState to IMM**

**Status:** unassigned
**Milestone:** 5.2.FC
**Created:** Thu Oct 20, 2016 07:58 PM UTC by Minh Hon Chau
**Last Updated:** Thu Nov 10, 2016 05:30 AM UTC
**Owner:** nobody


In scenario of 2N Si-swap, when AMFD sends QUIESCED su_si assignment msg (for 
example) to AMFND that changes the HA State of SUSI assignment, AMFD updates 
its local state AVD_SU_SI_REL::state, checkpoint this change to standby AMFD. 
However, AMFD does not updates saAmfSISUHAState untill receiving su_si 
assignment response. Question:
(1). Whether AMFD should update the runtime attribute saAmfSISUHAState to IMM 
as long as local @state gets updated in implementer; to make IMM, active AMFD, 
standby AMFD all are synced
(2). Or AMFD updates saAmfSISUHAState to IMM only if AMFD receives su_si 
assignment from AMFND, as it has been implemented currently for some reason 
(not expose the change of saAmfSISUHAState to user too early?)

grep "avd_susi_update" which updates saAmfSISUHAState to IMM, there is also an 
inconsistency in usage. For avd_susi_mod_send() sends su_si msg and also 
updates saAmfSISUHAState immediately, while avd_sg_su_si_mod_snd does 
otherwise. 

Since the headless recovery relies on IMM to restore the state. If 
saAmfSISUHAState is not updated punctually and the node is reboot during 
headless stage, so after headless saAmfSISUHAState read from IMM does not fit 
with many other states (SG fsm, SUSI fsm, saAmfSISUHAState of the other SUSIs).

My question is if doing (1) will cause any problem for normal cluster? Pending 
patches #1725 part 2 currently implement (1).



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2133 AMF: Rollback admin shutdown/lock SI operation if node failover

2016-11-09 Thread Minh Hon Chau

- **summary**: AMF: Rollback admin shutdown SI operation if node failover --> 
AMF: Rollback admin shutdown/lock SI operation if node failover
- **Type**: discussion --> defect
- **Milestone**: 5.2.FC --> future



---

** [tickets:#2133] AMF: Rollback admin shutdown/lock SI operation if node 
failover**

**Status:** unassigned
**Milestone:** future
**Created:** Thu Oct 20, 2016 06:49 PM UTC by Minh Hon Chau
**Last Updated:** Thu Nov 10, 2016 06:36 AM UTC
**Owner:** nobody


In scenario of shut down SI, delay QUIESCING csi callback, then reboot the node 
that hosting SU having pending this csi callback. The result of this operation 
looks differently between SGs
- For 2N: the SI Admin state is rollbacked to UNLOCK 
- For Nway: the SI Admin state moves to LOCKED
- In NpM: Haven't tested just browsing SG_NPM::node_fail_si_oper, looks SI 
Admin states rollbacks to UNLOCK

My question is whether the result of these scenario should be consistent? And 
what's the expected outcome?
Also, the handling of node_fail_si_oper for admin lock is not consistent. For 
2N, Admin state remains LOCKED, NpM rollbacks to UNLOCK


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2133 AMF: Rollback admin shutdown SI operation if node failover

2016-11-09 Thread Minh Hon Chau

The lock operation is not consistent behavior between SGs in scenario of 
failover during lock command. Mark this ticket as defect for future


---

** [tickets:#2133] AMF: Rollback admin shutdown SI operation if node failover**

**Status:** unassigned
**Milestone:** 5.2.FC
**Created:** Thu Oct 20, 2016 06:49 PM UTC by Minh Hon Chau
**Last Updated:** Mon Oct 24, 2016 01:38 PM UTC
**Owner:** nobody


In scenario of shut down SI, delay QUIESCING csi callback, then reboot the node 
that hosting SU having pending this csi callback. The result of this operation 
looks differently between SGs
- For 2N: the SI Admin state is rollbacked to UNLOCK 
- For Nway: the SI Admin state moves to LOCKED
- In NpM: Haven't tested just browsing SG_NPM::node_fail_si_oper, looks SI 
Admin states rollbacks to UNLOCK

My question is whether the result of these scenario should be consistent? And 
what's the expected outcome?
Also, the handling of node_fail_si_oper for admin lock is not consistent. For 
2N, Admin state remains LOCKED, NpM rollbacks to UNLOCK


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1354 amf: sync amfd and amfnd for assignment related logging, imm updates.

2016-11-09 Thread Minh Hon Chau

https://sourceforge.net/p/opensaf/tickets/2134/
Discussion in #2134 has information for solution of #1354


---

** [tickets:#1354] amf: sync amfd and amfnd for assignment related logging, imm 
updates.**

**Status:** assigned
**Milestone:** 5.0.2
**Created:** Wed Apr 29, 2015 04:22 AM UTC by Praveen
**Last Updated:** Tue Sep 20, 2016 06:04 PM UTC
**Owner:** Praveen


This ticket is based on a user list query which goes like this:
"
When AMF begins "attempting" to assign CSI assignments it updates the runtime 
attributes and associations to show assigned and active BEFORE the active CSI 
has been accepted by the Component.

So if I query the CSI using amf-state csiass before the CSI has been accepted 
by the Component smf-state will show it assigned and active.

The same his for the SI. The SI will be assigned to the SU as active BEFORE all 
the CSI assignments have been accepted by the components in that SU. I checked 
the runtime state with both immlist and amf-state. Both were consistent in 
being incorrect.

Interestingly enough the opensaf log entries show the correct behavior. The log 
entry indication the SI was assigned to the SU is not logged until all CSI 
assignments have been accepted.

"




---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2175 amfd: null SU during CCB modify apply on standby director

2016-11-09 Thread Gary Lee

- **status**: review --> fixed
- **Milestone**: 5.1.1 --> 5.0.2



---

** [tickets:#2175] amfd: null SU during CCB modify apply on standby director**

**Status:** fixed
**Milestone:** 5.0.2
**Created:** Tue Nov 08, 2016 06:37 AM UTC by Gary Lee
**Last Updated:** Wed Nov 09, 2016 03:17 AM UTC
**Owner:** Gary Lee


su is NULL and subsequently causes a segfault on standby director

line 1833 corresponds to su->saAmfSUMaintenanceCampaign = "";

Cause is probably the same as #1932?

~~~
Full backtrace:
#0 0x7f5ee5adc036 in std::string::assign(char const*, unsigned long) () 
from /usr/lib64/libstdc++.so.6
No symbol table info available.
#1 0x00495ab9 in assign (__s=0x4c8b33 "", this=0x28) at 
/usr/include/c++/4.8/bits/basic_string.h:1131
No locals.
#2 operator= (__s=0x4c8b33 "", this=0x28) at 
/usr/include/c++/4.8/bits/basic_string.h:555
No locals.
#3 su_ccb_apply_modify_hdlr (opdata=opdata@entry=0x2580cf4) at 
../../../../../../../opensaf/osaf/services/saf/amf/amfd/su.cc:1833
attr_mod = 0x2580f48
i = 
su = 0x0
value_is_deleted = true
_FUNCTION_ = "su_ccb_apply_modify_hdlr"
#4 0x00498d78 in su_ccb_apply_cb (opdata=0x2580cf4) at 
../../../../../../../opensaf/osaf/services/saf/amf/amfd/su.cc:1985
su = 
_FUNCTION_ = "su_ccb_apply_cb"
#5 0x00439fa6 in ccb_apply_cb (immoi_handle=, 
ccb_id=218) at 
../../../../../../../opensaf/osaf/services/saf/amf/amfd/imm.cc:1226
ccb_util_ccb_data = 
type = 
temp = 
_FUNCTION_ = "ccb_apply_cb"
opdata = 0x0
next = 0x2517540
#6 0x7f5ee6818329 in imma_process_callback_info (cb=cb@entry=0x7f5ee6a373a0 
, cl_node=0x2517cc0, callback=callback@entry=0x7f5ed8004b60, 
immHandle=901943263503) at 
../../../../../../../opensaf/osaf/libs/agents/saf/imma/imma_proc.c:2245
ccbid = 218
privateAugOmHandle = 0
_FUNCTION_ = "imma_process_callback_info"
clientCapable = true
isPbeOp = false
isExtendedNameValid = false
isAttrExtendedName = false
#7 0x7f5ee681aec9 in imma_hdl_callbk_dispatch_all (cb=0x7f5ee6a373a0 
, immHandle=901943263503) at 
../../../../../../../opensaf/osaf/libs/agents/saf/imma/imma_proc.c:1732
callback = 0x7f5ed8004b60
cl_node = 0x2517cc0
#8 0x7f5ee680efc4 in saImmOiDispatch (immOiHandle=901943263503, 
dispatchFlags=SA_DISPATCH_ALL) at 
../../../../../../../opensaf/osaf/libs/agents/saf/imma/imma_oi_api.c:609
rc = SA_AIS_OK
cl_node = 0x0
locked = false
pend_fin = 0
pend_dis = 0
_FUNCTION_ = "saImmOiDispatch"
#9 0x00407b90 in main_loop () at 
../../../../../../../opensaf/osaf/services/saf/amf/amfd/main.cc:722
pollretval = 
evt = 
polltmo = 
term_fd = 14
cb = 0x6e8900 <_control_block>
error = 
#10 main (argc=, argv=) at 
../../../../../../../opensaf/osaf/services/saf/amf/amfd/main.cc:848
~~~


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2134 AMF: Update RTA saAmfSISUHAState to IMM

2016-11-09 Thread Praveen

Hi MInh,

Any update on this.

Thanks,
Praveen


---

** [tickets:#2134] AMF: Update RTA saAmfSISUHAState to IMM**

**Status:** unassigned
**Milestone:** 5.2.FC
**Created:** Thu Oct 20, 2016 07:58 PM UTC by Minh Hon Chau
**Last Updated:** Wed Nov 09, 2016 06:52 AM UTC
**Owner:** nobody


In scenario of 2N Si-swap, when AMFD sends QUIESCED su_si assignment msg (for 
example) to AMFND that changes the HA State of SUSI assignment, AMFD updates 
its local state AVD_SU_SI_REL::state, checkpoint this change to standby AMFD. 
However, AMFD does not updates saAmfSISUHAState untill receiving su_si 
assignment response. Question:
(1). Whether AMFD should update the runtime attribute saAmfSISUHAState to IMM 
as long as local @state gets updated in implementer; to make IMM, active AMFD, 
standby AMFD all are synced
(2). Or AMFD updates saAmfSISUHAState to IMM only if AMFD receives su_si 
assignment from AMFND, as it has been implemented currently for some reason 
(not expose the change of saAmfSISUHAState to user too early?)

grep "avd_susi_update" which updates saAmfSISUHAState to IMM, there is also an 
inconsistency in usage. For avd_susi_mod_send() sends su_si msg and also 
updates saAmfSISUHAState immediately, while avd_sg_su_si_mod_snd does 
otherwise. 

Since the headless recovery relies on IMM to restore the state. If 
saAmfSISUHAState is not updated punctually and the node is reboot during 
headless stage, so after headless saAmfSISUHAState read from IMM does not fit 
with many other states (SG fsm, SUSI fsm, saAmfSISUHAState of the other SUSIs).

My question is if doing (1) will cause any problem for normal cluster? Pending 
patches #1725 part 2 currently implement (1).



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1902 AMF: Extend escalation support during headless

2016-11-09 Thread Minh Hon Chau

Hi Praveen,

I am going through component failover test cases again, which were not stable 
before. No other pending implementation other than this, just want to test 
again.

Thanks,
Minh


---

** [tickets:#1902] AMF: Extend escalation support during headless**

**Status:** assigned
**Milestone:** 5.2.FC
**Created:** Wed Jun 29, 2016 12:02 PM UTC by Minh Hon Chau
**Last Updated:** Thu Nov 10, 2016 04:39 AM UTC
**Owner:** Minh Hon Chau


If a comp/su failover occurs during headless, amfnd will escalate to reboot. 
This will unexpectedly impact on other comp/su which are up and running if 
there's no node failover escalation configured on this faulty comp/su 

2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safComp=AmfDemo2,safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' faulted due 
to 'avaDown' : Recovery is 'suFailover'
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO Terminating components of 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon'(abruptly & unordered)
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State INSTANTIATED => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: Rebooting OpenSAF NodeId = 132111 EE 
Name = , Reason: Can't perform recovery while controllers are down. Recovery is 
node failfast., OwnNodeId = 132111, SupervisionTime = 60
2016-06-29 21:30:07 PL-4 opensaf_reboot: Rebooting local node; timeout=60

This ticket will remove unexpected reboot due to failover during headless which 
is mentioned as limitation in AMF opensaf documentation.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1902 AMF: Extend escalation support during headless

2016-11-09 Thread Praveen

Hi Minh,
Is there anything pending for this ticket besides documentation?
There is a separate documentation ticket #2179.

Thanks,
Praveen


---

** [tickets:#1902] AMF: Extend escalation support during headless**

**Status:** assigned
**Milestone:** 5.2.FC
**Created:** Wed Jun 29, 2016 12:02 PM UTC by Minh Hon Chau
**Last Updated:** Thu Nov 10, 2016 04:37 AM UTC
**Owner:** Minh Hon Chau


If a comp/su failover occurs during headless, amfnd will escalate to reboot. 
This will unexpectedly impact on other comp/su which are up and running if 
there's no node failover escalation configured on this faulty comp/su 

2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safComp=AmfDemo2,safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' faulted due 
to 'avaDown' : Recovery is 'suFailover'
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO Terminating components of 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon'(abruptly & unordered)
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State INSTANTIATED => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: Rebooting OpenSAF NodeId = 132111 EE 
Name = , Reason: Can't perform recovery while controllers are down. Recovery is 
node failfast., OwnNodeId = 132111, SupervisionTime = 60
2016-06-29 21:30:07 PL-4 opensaf_reboot: Rebooting local node; timeout=60

This ticket will remove unexpected reboot due to failover during headless which 
is mentioned as limitation in AMF opensaf documentation.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1902 AMF: Extend escalation support during headless

2016-11-09 Thread Praveen

changeset:   8278:043297f42a74
user:minh-chau 
date:Wed Nov 02 15:40:13 2016 +1100
summary: AMFD: Do not add su operation list if su has no pending susi 
assignment [#1902]

changeset:   8277:09a006b409ba
user:minh-chau 
date:Wed Nov 02 15:40:10 2016 +1100
summary: AMFD: Do not recover if no pending susi assignment after headless 
[#1902]



---

** [tickets:#1902] AMF: Extend escalation support during headless**

**Status:** assigned
**Milestone:** 5.2.FC
**Created:** Wed Jun 29, 2016 12:02 PM UTC by Minh Hon Chau
**Last Updated:** Mon Nov 07, 2016 07:30 PM UTC
**Owner:** Minh Hon Chau


If a comp/su failover occurs during headless, amfnd will escalate to reboot. 
This will unexpectedly impact on other comp/su which are up and running if 
there's no node failover escalation configured on this faulty comp/su 

2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safComp=AmfDemo2,safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' faulted due 
to 'avaDown' : Recovery is 'suFailover'
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO Terminating components of 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon'(abruptly & unordered)
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State INSTANTIATED => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: NO 
'safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon' Presence State TERMINATING => 
TERMINATING
2016-06-29 21:30:07 PL-4 osafamfnd[429]: Rebooting OpenSAF NodeId = 132111 EE 
Name = , Reason: Can't perform recovery while controllers are down. Recovery is 
node failfast., OwnNodeId = 132111, SupervisionTime = 60
2016-06-29 21:30:07 PL-4 opensaf_reboot: Rebooting local node; timeout=60

This ticket will remove unexpected reboot due to failover during headless which 
is mentioned as limitation in AMF opensaf documentation.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node

2016-11-09 Thread Minh Hon Chau

- **status**: review --> fixed



---

** [tickets:#2112] amfd: multiple SUs incorrectly assigned to single node**

**Status:** fixed
**Milestone:** 5.0.2
**Created:** Tue Oct 11, 2016 11:56 PM UTC by Gary Lee
**Last Updated:** Thu Nov 10, 2016 02:32 AM UTC
**Owner:** Minh Hon Chau


Multiple SUs are assigned to a single node after SC absence.

To reproduce:

0) load nwayactive demo
1) stop SCs
2) restart SCs

The following is observed:

root@SC-1:~# immlist safSu=SU4,safSg=AmfDemo,safApp=AmfDemo2
...
saAmfSUHostedByNodeSA_NAME_T
safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) 

root@SC-1:~# immlist safSu=SU2,safSg=AmfDemo,safApp=AmfDemo2
...
saAmfSUHostedByNodeSA_NAME_T
safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) 

SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is 
not assigned to PL-4.

Operations on SU4 will lead to a crash of amfnd on PL-4.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node

2016-11-09 Thread Minh Hon Chau

Push into 5.0 branch

changeset:   8302:6557805ec604
branch:  opensaf-5.0.x



---

** [tickets:#2112] amfd: multiple SUs incorrectly assigned to single node**

**Status:** review
**Milestone:** 5.0.2
**Created:** Tue Oct 11, 2016 11:56 PM UTC by Gary Lee
**Last Updated:** Thu Nov 10, 2016 02:24 AM UTC
**Owner:** Minh Hon Chau


Multiple SUs are assigned to a single node after SC absence.

To reproduce:

0) load nwayactive demo
1) stop SCs
2) restart SCs

The following is observed:

root@SC-1:~# immlist safSu=SU4,safSg=AmfDemo,safApp=AmfDemo2
...
saAmfSUHostedByNodeSA_NAME_T
safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) 

root@SC-1:~# immlist safSu=SU2,safSg=AmfDemo,safApp=AmfDemo2
...
saAmfSUHostedByNodeSA_NAME_T
safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) 

SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is 
not assigned to PL-4.

Operations on SU4 will lead to a crash of amfnd on PL-4.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node

2016-11-09 Thread Gary Lee

- **Milestone**: 5.1.1 --> 5.0.2



---

** [tickets:#2112] amfd: multiple SUs incorrectly assigned to single node**

**Status:** review
**Milestone:** 5.0.2
**Created:** Tue Oct 11, 2016 11:56 PM UTC by Gary Lee
**Last Updated:** Thu Nov 10, 2016 01:07 AM UTC
**Owner:** Minh Hon Chau


Multiple SUs are assigned to a single node after SC absence.

To reproduce:

0) load nwayactive demo
1) stop SCs
2) restart SCs

The following is observed:

root@SC-1:~# immlist safSu=SU4,safSg=AmfDemo,safApp=AmfDemo2
...
saAmfSUHostedByNodeSA_NAME_T
safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) 

root@SC-1:~# immlist safSu=SU2,safSg=AmfDemo,safApp=AmfDemo2
...
saAmfSUHostedByNodeSA_NAME_T
safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) 

SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is 
not assigned to PL-4.

Operations on SU4 will lead to a crash of amfnd on PL-4.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2141 AMF: AMFD fails to stop clm track during role transition from active to quiesced

2016-11-09 Thread Minh Hon Chau

Attach patch with a solution described in ticket.


Attachments:

- 
[2141.diff](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/cdd5c818/5bc0/attachment/2141.diff)
 (1.8 kB; text/x-patch)


---

** [tickets:#2141] AMF: AMFD fails to stop clm track during role transition 
from active to quiesced**

**Status:** assigned
**Milestone:** 5.0.2
**Created:** Wed Oct 26, 2016 03:57 AM UTC by Minh Hon Chau
**Last Updated:** Wed Oct 26, 2016 03:57 AM UTC
**Owner:** Minh Hon Chau


In scenario of swapping 2N Opensaf SI (switch over), when active AMFD is moving 
to quiesced, AMFD fails to stop clm track callback due to return code 
SA_AIS_ERR_TIMEOUT. Currently AMFD only logs error but stopping track record 
has not done properly. That results into new standby AMFD (was being quiesced) 
receives clm track callback when other node leaves cluster. Eventually, 
clm_node_exit_complete() will be triggered at standby AMFD, which should not 
happen.

The consequence is standby AMFD fails to resolve checkpoint update if another 
node reboots afterward.

In SC2 (standby AMFD)
Oct 26 11:59:26.061288 osafamfd [468:clm.cc:0216] >> clm_track_cb: '0' '4' '1'
Oct 26 11:59:26.061294 osafamfd [468:clm.cc:0281] TR  Node Left: 
rootCauseEntity safNode=PL-3,safCluster=myClmCluster for node 131855
Oct 26 11:59:26.061298 osafamfd [468:clm.cc:0185] >> clm_node_exit_complete: 
2030f
Oct 26 11:59:26.061301 osafamfd [468:ndproc.cc:1139] >> avd_node_failover: 
'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
...
Oct 26 11:59:26.070895 osafamfd [468:sg_nored_fsm.cc:0770] >> node_fail: 
safSu=PL-3,safSg=NoRed,safApp=OpenSAF, TEST sg_fsm_state=0
...
Oct 26 11:59:26.071007 osafamfd [468:siass.cc:0496] : >> avd_susi_delete: 
safSu=PL-3,safSg=NoRed,safApp=OpenSAF safSi=NoRed4,safApp=OpenSAF

In SC1 (active AMFD)
Oct 26 11:59:26.057724 osafamfd [488:ndfsm.cc:0671] >> avd_mds_avnd_down_evh: 
2030f, 0x7da3d0
Oct 26 11:59:26.057732 osafamfd [488:ndproc.cc:1139] >> avd_node_failover: 
'safAmfNode=PL-3,safAmfCluster=myAmfCluster'
Oct 26 11:59:26.057739 osafamfd [488:ndfsm.cc:0999] >> avd_node_mark_absent 
...
Oct 26 11:59:26.066576 osafamfd [488:sg_nored_fsm.cc:0770] >> node_fail: 
safSu=PL-3,safSg=NoRed,safApp=OpenSAF, TEST sg_fsm_state=0
...
Oct 26 11:59:26.066783 osafamfd [488:siass.cc:0496] : >> avd_susi_delete: 
safSu=PL-3,safSg=NoRed,safApp=OpenSAF safSi=NoRed4,safApp=OpenSAF

When AMFD-SC1 deletes susi, it checkpoints to standby AMFD, now standby AMFD 
fails to resolve this update because standby AMFD has already deleted it

Oct 26 11:59:26.073665 osafamfd [468:ckpt_dec.cc:0659] >> dec_siass: i_action 
'2'
...
Oct 26 11:59:26.073700 osafamfd [468:ckpt_updt.cc:0405] >> avd_ckpt_siass: 
'safSi=NoRed4,safApp=OpenSAF' 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'
Oct 26 11:59:26.073704 osafamfd [468:si.cc:0395] >> avd_si_get: 
safSi=NoRed4,safApp=OpenSAF
Oct 26 11:59:26.073706 osafamfd [468:si.cc:0396] << avd_si_get 
...
Oct 26 11:59:26.073722 osafamfd [468:ckpt_updt.cc:0508] ER avd_ckpt_siass: 
safSu=PL-3,safSg=NoRed,safApp=OpenSAF safSi=NoRed4,safApp=OpenSAF does not exist
Oct 26 11:59:26.073725 osafamfd [468:ckpt_dec.cc:0690] << dec_siass 

This error can be seen by comment out the avd_clm_track_stop() in 
amfd_switch_actv_qsd() to pretend the SA_AIS_ERR_TIMEOUT error code.

A simple solution could be, when standby AMFD receives clm_track_cb(), AMFD can 
retry to stop track record and quickly return out of callback


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node

2016-11-09 Thread Minh Hon Chau

Pushed for 5.1 and default:

changeset:   8301:a25d5d50b01a
changeset:   8300:773643625dc6

Could it happen with 5.0?


---

** [tickets:#2112] amfd: multiple SUs incorrectly assigned to single node**

**Status:** review
**Milestone:** 5.1.1
**Created:** Tue Oct 11, 2016 11:56 PM UTC by Gary Lee
**Last Updated:** Mon Oct 24, 2016 04:57 AM UTC
**Owner:** Minh Hon Chau


Multiple SUs are assigned to a single node after SC absence.

To reproduce:

0) load nwayactive demo
1) stop SCs
2) restart SCs

The following is observed:

root@SC-1:~# immlist safSu=SU4,safSg=AmfDemo,safApp=AmfDemo2
...
saAmfSUHostedByNodeSA_NAME_T
safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) 

root@SC-1:~# immlist safSu=SU2,safSg=AmfDemo,safApp=AmfDemo2
...
saAmfSUHostedByNodeSA_NAME_T
safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) 

SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is 
not assigned to PL-4.

Operations on SU4 will lead to a crash of amfnd on PL-4.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2182 mds: MDS receive thread may hang when a signal is caught

2016-11-09 Thread Anders Widell

- **status**: accepted --> review



---

** [tickets:#2182] mds: MDS receive thread may hang when a signal is caught**

**Status:** review
**Milestone:** 5.0.2
**Created:** Wed Nov 09, 2016 03:42 PM UTC by Anders Widell
**Last Updated:** Wed Nov 09, 2016 03:42 PM UTC
**Owner:** Anders Widell


The result from poll() is incorrectly stored in an unsigned integer, which 
means that if poll() returns -1 we will interpret the result as a very large 
number. Subsequently, we read the possibly undefined values of pollfd.revents, 
and may perform a blocking read.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2182 mds: MDS receive thread may hang when a signal is caught

2016-11-09 Thread Anders Widell




---

** [tickets:#2182] mds: MDS receive thread may hang when a signal is caught**

**Status:** accepted
**Milestone:** 5.0.2
**Created:** Wed Nov 09, 2016 03:42 PM UTC by Anders Widell
**Last Updated:** Wed Nov 09, 2016 03:42 PM UTC
**Owner:** Anders Widell


The result from poll() is incorrectly stored in an unsigned integer, which 
means that if poll() returns -1 we will interpret the result as a very large 
number. Subsequently, we read the possibly undefined values of pollfd.revents, 
and may perform a blocking read.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2181 mds: Use SOCK_CLOEXEC when creating sockets

2016-11-09 Thread Anders Widell




---

** [tickets:#2181] mds: Use SOCK_CLOEXEC when creating sockets**

**Status:** accepted
**Milestone:** 5.2.FC
**Created:** Wed Nov 09, 2016 01:41 PM UTC by Anders Widell
**Last Updated:** Wed Nov 09, 2016 01:41 PM UTC
**Owner:** Anders Widell


To avoid a potential race between fcntl(FD_CLOEXEC) in one thread and exec() in
another thread, use the SOCK_CLOEXEC flag when creating sockets.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2153 smf: Fails to create a node group, admin owner/handle is lost

2016-11-09 Thread elunlen

- **status**: accepted --> review



---

** [tickets:#2153] smf: Fails to create a node group, admin owner/handle is 
lost**

**Status:** review
**Milestone:** 5.1.1
**Created:** Mon Oct 31, 2016 02:54 PM UTC by elunlen
**Last Updated:** Fri Nov 04, 2016 12:22 PM UTC
**Owner:** elunlen


Even though handles and admin owner needed to create a node group is created 
just before the node group shall be created the creation may still fail because 
of bad handle or missing admin owner. To increase robustness a mechanism for 
recreation of handles and admin owner similar to handling when deleting a node 
group should be implemented.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2180 mds: Ensure topology events are seen before data messages

2016-11-09 Thread Anders Widell




---

** [tickets:#2180] mds: Ensure topology events are seen before data messages**

**Status:** assigned
**Milestone:** 5.2.FC
**Created:** Wed Nov 09, 2016 12:07 PM UTC by Anders Widell
**Last Updated:** Wed Nov 09, 2016 12:07 PM UTC
**Owner:** Anders Widell


TIPC does not guarantee that topology events are delivered before related data 
messages, which means that you can receive a message before you see the name 
subscription event which tells you that the sender is up.

MDS needs to re-order the events by buffering incoming messages in these cases 
until the corresponding topology events have been received, so that the MDS 
user always sees the topology event before the data message.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2158 AMF: IMMND dies at Opensaf start up phase causes AMFD heartbeat timeout

2016-11-09 Thread Minh Hon Chau

Hi Praveen,

This issue happened with SC absence feature enabled but had no spare SC. It did 
not happen in context of headless. In context of #1334, standby SC reboots 
because cold sync is in progress?

In this ticket, after cluster reboot, both SCs were coming up. SC1 (active) 
reboot before SC2 (supposedly) was assigned standby. In amfnd trace, as you 
said, before amfnd-SC2 initiated middleware SUs, immnd had died. So amfnd-SC2 
was not aware of immnd process so amfnd can restart immnd. 

I think this situation would also happen in first active SC start up sequence 
(not only failover), where immnd has already responded to NID and immnd dies 
before amfnd can monitor immnd's process.

I think I should change *component* to osaf? Any ideas for a solution?

Thanks,
Minh





---

** [tickets:#2158] AMF: IMMND dies at Opensaf start up phase causes AMFD 
heartbeat timeout**

**Status:** unassigned
**Milestone:** 5.0.2
**Created:** Wed Nov 02, 2016 05:20 AM UTC by Minh Hon Chau
**Last Updated:** Tue Nov 08, 2016 07:09 AM UTC
**Owner:** nobody
**Attachments:**

- 
[osafamfnd_sc2](https://sourceforge.net/p/opensaf/tickets/2158/attachment/osafamfnd_sc2)
 (264.2 kB; application/octet-stream)


If IMMND dies at Opensaf startup phase, IMMND is not restarted by AMF. The 
issue has been observed in following situation
- Restart cluster
- During active controller starts up, a critical component is death which cause 
a node failfast
Oct 25 12:51:21 SC-1 osafamfnd[7642]: ER 
safComp=ABC,safSu=1,safSg=2N,safApp=ABC Faulted due to:csiSetcallbackTimeout 
Recovery is:nodeFailfast
Oct 25 12:51:21 SC-1 osafamfnd[7642]: Rebooting OpenSAF NodeId = 131343 EE Name 
= , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, 
SupervisionTime = 60
- In the meantime, standby controller is requested to become active
Oct 25 12:51:27 SC-2 tipclog[16221]: Lost link <1.1.2:eth0-1.1.1:eth0> on 
network plane A
Oct 25 12:51:27 SC-2 osafclmna[4336]: NO Starting to promote this node to a 
system controller
Oct 25 12:51:27 SC-2 osafrded[4387]: NO Requesting ACTIVE role
- IMMND is also death a bit later
Oct 25 12:51:29 SC-2 osafimmnd[4536]: ER MESSAGE:44816 OUT OF ORDER my highest 
processed:44814 - exiting
Oct 25 12:51:29 SC-2 osafamfnd[7414]: NO saClmDispatch BAD_HANDLE
- Other services could not initialize other services since IMMND is death
Oct 25 12:51:39 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:51:39 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:51:39 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( 
returned SA_AIS_ERR_TIMEOUT (5)
Oct 25 12:51:39 SC-2 osafclmd[7386]: WA saImmOiImplementerSet returned 9
Oct 25 12:51:39 SC-2 osafntfd[7372]: WA saLogInitialize returns try again, 
retries...
Oct 25 12:51:39 SC-2 osaflogd[7358]: WA saImmOiImplementerSet returned 
SA_AIS_ERR_BAD_HANDLE (9)
Oct 25 12:51:39 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5

Oct 25 12:51:49 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:51:50 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:51:50 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5

Oct 25 12:52:00 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:52:00 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:52:00 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5

Oct 25 12:52:20 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5
Oct 25 12:52:20 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:52:20 SC-2 osafimmd[4489]: NO Extended intro from node 2210f

- At the end, AMFD heart beat timeout 
Oct 25 12:53:57 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( 
returned SA_AIS_ERR_TIMEOUT (5)
Oct 25 12:54:01 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5
Oct 25 12:54:01 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:54:01 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:54:07 SC-2 osafntfimcnd[7501]: WA ntfimcn_ntf_init saNtfInitialize( 
returned SA_AIS_ERR_TIMEOUT (5)
Oct 25 12:54:11 SC-2 osafamfnd[7414]: WA saClmInitialize_4 returned 5
Oct 25 12:54:11 SC-2 osafamfd[7400]: WA saClmInitialize_4 returned 5
Oct 25 12:54:11 SC-2 osafamfd[7400]: WA saNtfInitialize returned 5
Oct 25 12:54:15 SC-2 osafamfnd[7414]: ER AMF director heart beat timeout, 
generating core for amfd

In AMFND trace in SC2, AMFND did not receive su_pres from AMFD, therefore AMFND 
could not initiate middleware components (including IMMND), so AMFND was not 
aware of IMMND's death so that AMFND can restart IMMND. The problem here is 
slightly different from #1828, which happened in newly promoted SC (with 
roamingSC feature) where AMFND had IMMND registered.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is

[tickets] [opensaf:tickets] #2015 mds: Use a separate process for writing MDS logs

2016-11-09 Thread Anders Widell

- **status**: review --> fixed
- **Comment**:

changeset:   8297:9148b88c808f
user:Anders Widell 
date:Wed Nov 09 10:11:17 2016 +0100
summary: mds: Convert the mds_log.c file to C++ [#2015]

changeset:   8298:b611bd543da4
user:Anders Widell 
date:Wed Nov 09 10:11:32 2016 +0100
summary: mds: Use osaftransportd for writing MDS log messages [#2015]

changeset:   8299:9f1e6d7ea08c
user:Anders Widell 
date:Wed Nov 09 10:11:32 2016 +0100
summary: dtm: Implement an MDS log server [#2015]

[staging:9148b8]
[staging:b611bd]
[staging:9f1e6d]




---

** [tickets:#2015] mds: Use a separate process for writing MDS logs**

**Status:** fixed
**Milestone:** 5.2.FC
**Created:** Fri Sep 09, 2016 09:59 AM UTC by Anders Widell
**Last Updated:** Wed Oct 19, 2016 11:49 AM UTC
**Owner:** Anders Widell


Currently, the MDS log entries are written to disk from within the MDS code. 
This file I/O is done while holding the MDS mutex, and can potentially block 
for a long time if file I/O is slow. In the best case, it will result in longer 
latency for MDS messages. In the worst case, it will result in an overflow of 
the TIPC receive buffer and loss of incoming MDS messages.

To avoid this problem, the idea is to let a separate process (one per node) do 
all the MDS logging file I/O. The MDS library will send log messages to this 
logger process using a UNIX socket. Incidentally, there is already a separate 
process osaf-transport-monitor which is responsible for rotating the MDS logs. 
This process can get the added responsibility to not only rotate the log files, 
but also write the log entries to the disk.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2134 AMF: Update RTA saAmfSISUHAState to IMM

[tickets] [opensaf:tickets] #2134 AMF: Update RTA saAmfSISUHAState to IMM

[tickets] [opensaf:tickets] #2133 AMF: Rollback admin shutdown/lock SI operation if node failover

[tickets] [opensaf:tickets] #2133 AMF: Rollback admin shutdown SI operation if node failover

[tickets] [opensaf:tickets] #1354 amf: sync amfd and amfnd for assignment related logging, imm updates.

[tickets] [opensaf:tickets] #2175 amfd: null SU during CCB modify apply on standby director

[tickets] [opensaf:tickets] #2134 AMF: Update RTA saAmfSISUHAState to IMM

[tickets] [opensaf:tickets] #1902 AMF: Extend escalation support during headless

[tickets] [opensaf:tickets] #1902 AMF: Extend escalation support during headless

[tickets] [opensaf:tickets] #1902 AMF: Extend escalation support during headless

[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node

[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node

[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node

[tickets] [opensaf:tickets] #2141 AMF: AMFD fails to stop clm track during role transition from active to quiesced

[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node

[tickets] [opensaf:tickets] #2182 mds: MDS receive thread may hang when a signal is caught

[tickets] [opensaf:tickets] #2182 mds: MDS receive thread may hang when a signal is caught

[tickets] [opensaf:tickets] #2181 mds: Use SOCK_CLOEXEC when creating sockets

[tickets] [opensaf:tickets] #2153 smf: Fails to create a node group, admin owner/handle is lost

[tickets] [opensaf:tickets] #2180 mds: Ensure topology events are seen before data messages

[tickets] [opensaf:tickets] #2158 AMF: IMMND dies at Opensaf start up phase causes AMFD heartbeat timeout

[tickets] [opensaf:tickets] #2015 mds: Use a separate process for writing MDS logs

22 matches

Site Navigation

Mail list logo

Footer information