[tickets] [opensaf:tickets] #2485 amfnd: missing susi response if component is restarted

2017-06-07 Thread Minh Hon Chau via Opensaf-tickets
I just wondered if it is ok that the admin restart op on component is rejected, 
because the component is local to amfnd as you also mentioned. If component is 
ready to handle TRY_AGAIN for this case then I think it is ok that AMFD returns 
TRY_AGAIN on unstable SG. The improvement has been addressed in #1873


---

** [tickets:#2485] amfnd: missing susi response if component is restarted**

**Status:** accepted
**Milestone:** 5.17.06
**Created:** Wed Jun 07, 2017 12:57 AM UTC by Gary Lee
**Last Updated:** Wed Jun 07, 2017 09:24 AM UTC
**Owner:** Praveen


An SI contains multiple CSIs. If a restart component admin operation arrives at 
amfnd before all CSIs are assigned,
the SUSI response is not sent to AMFD.

This code in avnd_comp_csi_assign_done() appears to be the problem area.

  /* while restarting, we wont use assign all, so csi will not be null */
  if (csi && m_AVND_COMP_CSI_CURR_ASSIGN_STATE_IS_RESTARTING(csi)) {
m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(csi,
  AVND_COMP_CSI_ASSIGN_STATE_ASSIGNED);
goto done;
  }

Perhaps we should not initiate a restart in avnd_evt_comp_admin_op_req(), if
a component is still in AVND_COMP_CSI_ASSIGN_STATE_ASSIGNING state.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2488 rde: Avoid sending messages to nodes that are down

2017-06-07 Thread Anders Widell via Opensaf-tickets



---

** [tickets:#2488] rde: Avoid sending messages to nodes that are down**

**Status:** assigned
**Milestone:** 5.17.08
**Created:** Wed Jun 07, 2017 11:59 AM UTC by Anders Widell
**Last Updated:** Wed Jun 07, 2017 11:59 AM UTC
**Owner:** Anders Widell


RDE sometimes fails to send MDS messages because the receiving RDE service is 
already down. RDE should be optimised to keep track of service down events, and 
avoid sending messages to peers that are not up.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2487 imm: IMMND crashes in immnd_proc_discard_other_nodes

2017-06-07 Thread Hung Nguyen via Opensaf-tickets



---

** [tickets:#2487] imm: IMMND crashes in immnd_proc_discard_other_nodes**

**Status:** accepted
**Milestone:** 5.17.06
**Created:** Wed Jun 07, 2017 10:58 AM UTC by Hung Nguyen
**Last Updated:** Wed Jun 07, 2017 10:58 AM UTC
**Owner:** Hung Nguyen
**Attachments:**

- 
[logs_n_traces.7z](https://sourceforge.net/p/opensaf/tickets/2487/attachment/logs_n_traces.7z)
 (13.5 MB; application/octet-stream)


IMMD was down when discarding a IMMA connection, that caused a failure and the 
client was marked as stale.

~~~css
12:20:03.331159 osafimmnd [206:206:src/imm/immnd/immnd_evt.c:12127] T2 IMMA 
DOWN EVENT
...
12:20:03.332028 osafimmnd [206:206:src/imm/immnd/immnd_proc.c:0091] >> 
immnd_proc_imma_discard_connection 
12:20:03.332031 osafimmnd [206:206:src/imm/immnd/immnd_proc.c:0096] T5 
Attempting discard connection id:610002020f 
12:20:03.332035 osafimmnd [206:206:src/imm/immnd/ImmModel.cc:14042] >> 
discardContinuations 
12:20:03.332038 osafimmnd [206:206:src/imm/immnd/ImmModel.cc:14095] << 
discardContinuations 
12:20:03.332042 osafimmnd [206:206:src/imm/immnd/immnd_proc.c:0138] T5 
Discarding implementer id:35 for connection: 97
12:20:03.332046 osafimmnd [206:206:src/imm/immnd/immnd_mds.c:0781] T2 Director 
Service Is Down
12:20:03.332062 osafimmnd [206:206:src/imm/immnd/immnd_proc.c:0156] WA Discard 
implementer failed for implId:35 (immd_down)- will retry later
12:20:03.332073 osafimmnd [206:206:src/imm/immnd/ImmModel.cc:13961] >> 
discardImplementer 
12:20:03.332083 osafimmnd [206:206:src/imm/immnd/ImmModel.cc:14012] NO 
Implementer locally disconnected. Marking it as doomed 35 <97, 2020f> 
(safLogService)
12:20:03.332087 osafimmnd [206:206:src/imm/immnd/ImmModel.cc:14038] << 
discardImplementer 
12:20:03.332090 osafimmnd [206:206:src/imm/immnd/immnd_proc.c:0169] << 
immnd_proc_imma_discard_connection 
12:20:03.332093 osafimmnd [206:206:src/imm/immnd/immnd_proc.c:0320] T5 Stale 
marked client id:610002020f sv_id:27
~~~


Later when discarding other nodes, immnd_proc_imma_discard_connection() 
returned false because the client was previously marked as stale
~~~
immModel_discardImplementer(cb, implId, scAbsence, NULL, NULL);
}

if (cl_node->mIsStale) {
TRACE_LEAVE();
return false;
}
~~~

~~~css
12:20:03.332133 osafimmnd [206:206:src/imm/immnd/immnd_evt.c:12219] NO IMMD 
SERVICE IS DOWN, HYDRA IS CONFIGURED => UNREGISTERING IMMND form MDS
12:20:03.332201 osafimmnd [206:206:src/imm/immnd/immnd_proc.c:2819] >> 
immnd_proc_discard_other_nodes 
...
12:20:03.332406 osafimmnd [206:206:src/imm/immnd/immnd_proc.c:0091] >> 
immnd_proc_imma_discard_connection 
12:20:03.332410 osafimmnd [206:206:src/imm/immnd/immnd_proc.c:0096] T5 
Attempting discard connection id:610002020f 
12:20:03.332413 osafimmnd [206:206:src/imm/immnd/ImmModel.cc:14042] >> 
discardContinuations 
12:20:03.332416 osafimmnd [206:206:src/imm/immnd/ImmModel.cc:14095] << 
discardContinuations 
12:20:03.332419 osafimmnd [206:206:src/imm/immnd/immnd_proc.c:0138] T5 
Discarding implementer id:35 for connection: 97
12:20:03.332423 osafimmnd [206:206:src/imm/immnd/ImmModel.cc:13961] >> 
discardImplementer 
12:20:03.332431 osafimmnd [206:206:src/imm/immnd/ImmModel.cc:13967] NO 
Implementer disconnected 35 <97, 2020f> (safLogService)
12:20:03.332435 osafimmnd [206:206:src/imm/immnd/ImmModel.cc:14038] << 
discardImplementer 
12:20:03.332438 osafimmnd [206:206:src/imm/immnd/immnd_proc.c:0169] << 
immnd_proc_imma_discard_connection 
~~~


And IMMND crashed due to assertion failure
~~~css
12:20:03 SC-2 osafimmnd[206]: NO Implementer disconnected 35 <97, 2020f> 
(safLogService)
12:20:03 SC-2 osafimmnd[206]: src/imm/immnd/immnd_proc.c:2828: 
immnd_proc_discard_other_nodes: Assertion 
'immnd_proc_imma_discard_connection(cb, cl_node, true)' failed.
~~~


Attached is logs and traces.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2445 log: reorganize headers and implementation

2017-06-07 Thread Vu Minh Nguyen via Opensaf-tickets
- Description has changed:

Diff:



--- old
+++ new
@@ -1,5 +1,7 @@
 This ticket intends to do some enhancements:
-1) Put headers and implementations in proper places
+1) Put public interfaces and its implementations in proper places (e.g: 
lgs_amf_init() interface could be put into lgs_amf.h file)
 2) Fix all cppcheck messages
 3) Fix all cpplint messages
 4) Replace NULL by nullptr
+
+Will divide the fix to several increaments. The purpose is to make the review 
patch small and avoid code rebasing. The work could be fixing all above items 
for few files, then send it out for review. 






---

** [tickets:#2445] log: reorganize headers and implementation**

**Status:** accepted
**Milestone:** 5.17.08
**Created:** Fri Apr 28, 2017 08:28 AM UTC by Vu Minh Nguyen
**Last Updated:** Tue May 30, 2017 01:25 PM UTC
**Owner:** Canh Truong


This ticket intends to do some enhancements:
1) Put public interfaces and its implementations in proper places (e.g: 
lgs_amf_init() interface could be put into lgs_amf.h file)
2) Fix all cppcheck messages
3) Fix all cpplint messages
4) Replace NULL by nullptr

Will divide the fix to several increaments. The purpose is to make the review 
patch small and avoid code rebasing. The work could be fixing all above items 
for few files, then send it out for review. 


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2149 log: refactor handling IMM in log service

2017-06-07 Thread Vu Minh Nguyen via Opensaf-tickets
- **status**: review --> accepted
- **Blocker**:  --> False



---

** [tickets:#2149] log: refactor handling IMM in log service **

**Status:** accepted
**Milestone:** 5.17.08
**Created:** Fri Oct 28, 2016 12:25 PM UTC by elunlen
**Last Updated:** Fri Apr 21, 2017 07:11 AM UTC
**Owner:** Vu Minh Nguyen


This ticket intends to do refactor stuffs related to IMM handling in log 
service. 

More details will come later. 

With first look, we see many places using IMM APIs, and refering to IMM OI 
handler. It would be very good if we could move this into one or several files 
that are purely dedicated to IMM handling, with a well-defined API. 


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2477 amfd: Cyclic reboot after SC absence period (in large cluster)

2017-06-07 Thread Nagendra Kumar via Opensaf-tickets
Hi Minh,
But I agree that we need to avoid rebooting the controllers, 
but by avoiding assert, I am not sure, let me check.

Thanks
-Nagu


---

** [tickets:#2477] amfd: Cyclic reboot after SC absence period (in large 
cluster)**

**Status:** review
**Milestone:** 5.17.06
**Labels:** assignment failover during stop of both SC 2416 
**Created:** Fri Jun 02, 2017 06:17 AM UTC by Minh Hon Chau
**Last Updated:** Wed Jun 07, 2017 09:00 AM UTC
**Owner:** Minh Hon Chau


The scenario of the problem in this ticket happens in the same scenario 
reported in #2416

After SC absence period, amfd gets into osafassert(), causes coredump, and the 
problem repeatedly happens 

One of patches of #2416 had tried to call IMM sync as soon as possible, and it 
works fine with a small cluster (5 nodes). But a large cluster consists of 
about 75 nodes, the change of IMM sync calls takes mostly no effect. 

In #2416, a problem had been seen with an assumption of unreliable IMM sync 
calls in which after SC absence period, amfd had 3 assignments for a 2N SG, 2 
STANDBY SUSIs , and 1 ACTIVE SUSI. It was fixed by commit :"amfd: Add iteration 
to failover all absent assignments [#2416]" (refer to: 
https://sourceforge.net/p/opensaf/tickets/2416/#f83b)

Another variant problem of unreliable IMM calls before both SC go down, is that 
amfd can have both SUs with ACTIVE assignments, that leads to assert. This 
problem can only be seen in large cluster so far


Details of coredump:
 
~~~
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib64/opensaf/osafamfd'.
Program terminated with signal SIGABRT, Aborted.
#0  0x7f784279b0c7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: zypper install 
opensaf-amf-director-debuginfo-5.2.0-469.0.6128a2d.sle12.x86_64
(gdb) bt full
#0  0x7f784279b0c7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x7f784279c478 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x7f78435fdf4e in __osafassert_fail (__file=, 
__line=, __func=, 
__assertion=) at ../../opensaf/src/base/sysf_def.c:286
No locals.
#3  0x7f78445671e8 in avd_sg_2n_act_susi (sg=, 
stby_susi=stby_susi@entry=0x7ffeef034998, cb=0x7f78447f2e80 <_control_block>)
at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:596
susi = 
a_susi_2 = 0x7f7845e0d0c0
s_susi_1 = 0x7f7845e0d0c0
su_2 = 
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
s_susi_2 = 0x7f7845e2a030
a_susi = 0x0
a_susi_1 = 0x7f7845e2a030
s_susi = 0x0
su_1 = 0x7f7845d69e60
#4  0x7f784456d5d6 in SG_2N::node_fail (this=0x7f7845d5f4f0, 
cb=0x7f78447f2e80 <_control_block>, su=0x7f7845d69e60)
at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:3402
a_susi = 
s_susi = 0x7f7845d69a68
o_su = 
flag = 
__FUNCTION__ = "node_fail"
su_ha_state = 
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
#5  0x7f784455de1a in AVD_SG::failover_absent_assignment 
(this=0x7f7845d5f4f0) at ../../opensaf/src/amf/amfd/sg.cc:2307
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
__FUNCTION__ = "failover_absent_assignment"
failed_su = 0x7f7845d69e60
#6  0x7f7844514125 in avd_cluster_tmr_init_evh (cb=0x7f78447f2e80 
<_control_block>, evt=)
at ../../opensaf/src/amf/amfd/cluster.cc:103
i_sg = 0x7f7845d5f4f0
__for_range = @0x7f7845ca2a90: {db = {_M_t = {
  _M_impl = 
{ const, AVD_SG*> > >> = 
{<__gnu_cxx::new_allocator const, AVD_SG*> > >> = {}, }, 
_M_key_compare = {, std::basic_string, bool>> = {}, 
}, _M_header = {_M_color = std::_S_red, 
  _M_parent = 0x7f7845d515e0, _M_left = 0x7f7845d03ed0, 
_M_right = 0x7f7845d81580}, _M_node_count = 28
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
__FUNCTION__ = "avd_cluster_tmr_init_evh"
su = 0x0
node = 
#7  0x7f784453ca2c in process_event (cb_now=0x7f78447f2e80 
<_control_block>, evt=0x7f78340013d0) at ../../opensaf/src/amf/amfd/main.cc:775
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
__FUNCTION__ = "process_event"
#8  0x7f78444f6abe in main_loop () at ../../opensaf/src/amf/amfd/main.cc:691
pollretval = 
evt = 0x7f78340013d0
polltmo = 0
term_fd = 24
cb = 0x7f78447f2e80 <_control_block>
error = 
old_sync_state = AVD_STBY_OUT_OF_SYNC
#9  main (argc=, argv=) at 
../../opensaf/src/amf/amfd/main.cc:848
No locals.
~~~



---


[tickets] [opensaf:tickets] #2485 amfnd: missing susi response if component is restarted

2017-06-07 Thread Praveen via Opensaf-tickets

Currently AMFD returns TRY_AGAIN when SG is unstable and some admin op request 
comes for most of the entities. For admin restart of SU, currently AMFD returns 
TRY_AGAIN. I proposed solution on those lines. However, there is a general 
enhancement ticket to allow admin op in SG unstable cases:
\#1873 amf: Avoid rejecting user requests due to internal "unstable" state.

Spec is not very much clear in general for all cases. Only at one place (9.4.7 
SA_AMF_ADMIN_RESTART page 384 ), to avoid restart admin op parallely over other 
admin op going on same entity, it states that:
"The Availability Management Framework must not proceed with this operation if
another administrative operation or an error recovery initiated by the 
Availability Management
Framework is already engaged on the logical entity. In such case, the
SA_AIS_ERR_TRY_AGAIN error value shall be returned to indicate that the action 
is
feasible but not at this instant."

For the case locking of standby SU and restart of component in active SU: for a 
restartable component it can be allowed as it will be local to AMFND. But for a 
non-restartable component assignments needs to be switchovered and AMFD will 
have to find standby SUs. It will increase the complexity for red models like 
Nway and NplusM.





---

** [tickets:#2485] amfnd: missing susi response if component is restarted**

**Status:** accepted
**Milestone:** 5.17.06
**Created:** Wed Jun 07, 2017 12:57 AM UTC by Gary Lee
**Last Updated:** Wed Jun 07, 2017 08:37 AM UTC
**Owner:** Praveen


An SI contains multiple CSIs. If a restart component admin operation arrives at 
amfnd before all CSIs are assigned,
the SUSI response is not sent to AMFD.

This code in avnd_comp_csi_assign_done() appears to be the problem area.

  /* while restarting, we wont use assign all, so csi will not be null */
  if (csi && m_AVND_COMP_CSI_CURR_ASSIGN_STATE_IS_RESTARTING(csi)) {
m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(csi,
  AVND_COMP_CSI_ASSIGN_STATE_ASSIGNED);
goto done;
  }

Perhaps we should not initiate a restart in avnd_evt_comp_admin_op_req(), if
a component is still in AVND_COMP_CSI_ASSIGN_STATE_ASSIGNING state.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2477 amfd: Cyclic reboot after SC absence period (in large cluster)

2017-06-07 Thread Nagendra Kumar via Opensaf-tickets
Also, to note, it is documented as limitations in Amf PR Doc as below, so this 
ticket qualifies as Enhancement (could have been #2416 as well):
2.2.11.3Limitations
•   Possible loss of RTA updates and SI assignment messages
If both SCs go down abruptly (SCs are immediately powered-off for instance), 
AMFD could fail to update RTA to IMM, the SI assignment messages sent from 
AMFND could not reach to AMFD, or vice versa. In such cases,  recovery could be 
impossible, applications may have inappropriate assignment states.



---

** [tickets:#2477] amfd: Cyclic reboot after SC absence period (in large 
cluster)**

**Status:** review
**Milestone:** 5.17.06
**Labels:** assignment failover during stop of both SC 2416 
**Created:** Fri Jun 02, 2017 06:17 AM UTC by Minh Hon Chau
**Last Updated:** Mon Jun 05, 2017 10:18 AM UTC
**Owner:** Minh Hon Chau


The scenario of the problem in this ticket happens in the same scenario 
reported in #2416

After SC absence period, amfd gets into osafassert(), causes coredump, and the 
problem repeatedly happens 

One of patches of #2416 had tried to call IMM sync as soon as possible, and it 
works fine with a small cluster (5 nodes). But a large cluster consists of 
about 75 nodes, the change of IMM sync calls takes mostly no effect. 

In #2416, a problem had been seen with an assumption of unreliable IMM sync 
calls in which after SC absence period, amfd had 3 assignments for a 2N SG, 2 
STANDBY SUSIs , and 1 ACTIVE SUSI. It was fixed by commit :"amfd: Add iteration 
to failover all absent assignments [#2416]" (refer to: 
https://sourceforge.net/p/opensaf/tickets/2416/#f83b)

Another variant problem of unreliable IMM calls before both SC go down, is that 
amfd can have both SUs with ACTIVE assignments, that leads to assert. This 
problem can only be seen in large cluster so far


Details of coredump:
 
~~~
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib64/opensaf/osafamfd'.
Program terminated with signal SIGABRT, Aborted.
#0  0x7f784279b0c7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: zypper install 
opensaf-amf-director-debuginfo-5.2.0-469.0.6128a2d.sle12.x86_64
(gdb) bt full
#0  0x7f784279b0c7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x7f784279c478 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x7f78435fdf4e in __osafassert_fail (__file=, 
__line=, __func=, 
__assertion=) at ../../opensaf/src/base/sysf_def.c:286
No locals.
#3  0x7f78445671e8 in avd_sg_2n_act_susi (sg=, 
stby_susi=stby_susi@entry=0x7ffeef034998, cb=0x7f78447f2e80 <_control_block>)
at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:596
susi = 
a_susi_2 = 0x7f7845e0d0c0
s_susi_1 = 0x7f7845e0d0c0
su_2 = 
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
s_susi_2 = 0x7f7845e2a030
a_susi = 0x0
a_susi_1 = 0x7f7845e2a030
s_susi = 0x0
su_1 = 0x7f7845d69e60
#4  0x7f784456d5d6 in SG_2N::node_fail (this=0x7f7845d5f4f0, 
cb=0x7f78447f2e80 <_control_block>, su=0x7f7845d69e60)
at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:3402
a_susi = 
s_susi = 0x7f7845d69a68
o_su = 
flag = 
__FUNCTION__ = "node_fail"
su_ha_state = 
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
#5  0x7f784455de1a in AVD_SG::failover_absent_assignment 
(this=0x7f7845d5f4f0) at ../../opensaf/src/amf/amfd/sg.cc:2307
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
__FUNCTION__ = "failover_absent_assignment"
failed_su = 0x7f7845d69e60
#6  0x7f7844514125 in avd_cluster_tmr_init_evh (cb=0x7f78447f2e80 
<_control_block>, evt=)
at ../../opensaf/src/amf/amfd/cluster.cc:103
i_sg = 0x7f7845d5f4f0
__for_range = @0x7f7845ca2a90: {db = {_M_t = {
  _M_impl = 
{ const, AVD_SG*> > >> = 
{<__gnu_cxx::new_allocator const, AVD_SG*> > >> = {}, }, 
_M_key_compare = {, std::basic_string, bool>> = {}, 
}, _M_header = {_M_color = std::_S_red, 
  _M_parent = 0x7f7845d515e0, _M_left = 0x7f7845d03ed0, 
_M_right = 0x7f7845d81580}, _M_node_count = 28
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
__FUNCTION__ = "avd_cluster_tmr_init_evh"
su = 0x0
node = 
#7  0x7f784453ca2c in process_event (cb_now=0x7f78447f2e80 
<_control_block>, evt=0x7f78340013d0) at ../../opensaf/src/amf/amfd/main.cc:775
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
__FUNCTION__ = 

[tickets] [opensaf:tickets] Re: #2485 amfnd: missing susi response if component is restarted

2017-06-07 Thread Minh Hon Chau via Opensaf-tickets
So does that mean, for example, if locking a standby su is ongoing, and restart 
a component belongs to active su deferred? Given that stanby/active su(s) are 
hosted on different nodes


---

** [tickets:#2485] amfnd: missing susi response if component is restarted**

**Status:** accepted
**Milestone:** 5.17.06
**Created:** Wed Jun 07, 2017 12:57 AM UTC by Gary Lee
**Last Updated:** Wed Jun 07, 2017 08:37 AM UTC
**Owner:** Praveen


An SI contains multiple CSIs. If a restart component admin operation arrives at 
amfnd before all CSIs are assigned,
the SUSI response is not sent to AMFD.

This code in avnd_comp_csi_assign_done() appears to be the problem area.

  /* while restarting, we wont use assign all, so csi will not be null */
  if (csi && m_AVND_COMP_CSI_CURR_ASSIGN_STATE_IS_RESTARTING(csi)) {
m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(csi,
  AVND_COMP_CSI_ASSIGN_STATE_ASSIGNED);
goto done;
  }

Perhaps we should not initiate a restart in avnd_evt_comp_admin_op_req(), if
a component is still in AVND_COMP_CSI_ASSIGN_STATE_ASSIGNING state.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2485 amfnd: missing susi response if component is restarted

2017-06-07 Thread Praveen via Opensaf-tickets
- **status**: assigned --> accepted
- **Comment**:

Since SG is not stable, AMFD should return TRY_AGAIN to IMM client. This check 
is missing in comp_admin_op_cb() in amfd/comp.cc. I guess assignment in 
component is happening because of cluster startup timer expiry (not due to any 
other admin operation).



---

** [tickets:#2485] amfnd: missing susi response if component is restarted**

**Status:** accepted
**Milestone:** 5.17.06
**Created:** Wed Jun 07, 2017 12:57 AM UTC by Gary Lee
**Last Updated:** Wed Jun 07, 2017 08:16 AM UTC
**Owner:** Praveen


An SI contains multiple CSIs. If a restart component admin operation arrives at 
amfnd before all CSIs are assigned,
the SUSI response is not sent to AMFD.

This code in avnd_comp_csi_assign_done() appears to be the problem area.

  /* while restarting, we wont use assign all, so csi will not be null */
  if (csi && m_AVND_COMP_CSI_CURR_ASSIGN_STATE_IS_RESTARTING(csi)) {
m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(csi,
  AVND_COMP_CSI_ASSIGN_STATE_ASSIGNED);
goto done;
  }

Perhaps we should not initiate a restart in avnd_evt_comp_admin_op_req(), if
a component is still in AVND_COMP_CSI_ASSIGN_STATE_ASSIGNING state.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2485 amfnd: missing susi response if component is restarted

2017-06-07 Thread Praveen via Opensaf-tickets
- **status**: unassigned --> assigned
- **assigned_to**: Praveen



---

** [tickets:#2485] amfnd: missing susi response if component is restarted**

**Status:** assigned
**Milestone:** 5.17.06
**Created:** Wed Jun 07, 2017 12:57 AM UTC by Gary Lee
**Last Updated:** Wed Jun 07, 2017 12:57 AM UTC
**Owner:** Praveen


An SI contains multiple CSIs. If a restart component admin operation arrives at 
amfnd before all CSIs are assigned,
the SUSI response is not sent to AMFD.

This code in avnd_comp_csi_assign_done() appears to be the problem area.

  /* while restarting, we wont use assign all, so csi will not be null */
  if (csi && m_AVND_COMP_CSI_CURR_ASSIGN_STATE_IS_RESTARTING(csi)) {
m_AVND_COMP_CSI_CURR_ASSIGN_STATE_SET(csi,
  AVND_COMP_CSI_ASSIGN_STATE_ASSIGNED);
goto done;
  }

Perhaps we should not initiate a restart in avnd_evt_comp_admin_op_req(), if
a component is still in AVND_COMP_CSI_ASSIGN_STATE_ASSIGNING state.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2486 test: improve fail report for test cases that use multiple test_validate

2017-06-07 Thread Vo Minh Hoang via Opensaf-tickets
- **status**: accepted --> review



---

** [tickets:#2486] test: improve fail report for test cases that use multiple 
test_validate**

**Status:** review
**Milestone:** 5.17.08
**Created:** Wed Jun 07, 2017 06:38 AM UTC by Vo Minh Hoang
**Last Updated:** Wed Jun 07, 2017 06:40 AM UTC
**Owner:** Vo Minh Hoang


Experience CLM test case that use test_validate() multiple times made the test 
report hard to recognize.
This proposes a way to enhance it.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2486 test: improve fail report for test cases that use multiple test_validate

2017-06-07 Thread Vo Minh Hoang via Opensaf-tickets
- Description has changed:

Diff:



--- old
+++ new
@@ -0,0 +1,2 @@
+Experience CLM test case that use test_validate() multiple times made the test 
report hard to recognize.
+This proposes a way to enhance it.



- **Blocker**: True --> False



---

** [tickets:#2486] test: improve fail report for test cases that use multiple 
test_validate**

**Status:** accepted
**Milestone:** 5.17.08
**Created:** Wed Jun 07, 2017 06:38 AM UTC by Vo Minh Hoang
**Last Updated:** Wed Jun 07, 2017 06:38 AM UTC
**Owner:** Vo Minh Hoang


Experience CLM test case that use test_validate() multiple times made the test 
report hard to recognize.
This proposes a way to enhance it.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2486 test: improve fail report for test cases that use multiple test_validate

2017-06-07 Thread Vo Minh Hoang via Opensaf-tickets



---

** [tickets:#2486] test: improve fail report for test cases that use multiple 
test_validate**

**Status:** accepted
**Milestone:** 5.17.08
**Created:** Wed Jun 07, 2017 06:38 AM UTC by Vo Minh Hoang
**Last Updated:** Wed Jun 07, 2017 06:38 AM UTC
**Owner:** Vo Minh Hoang





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2469 clm: Stop tracking api returns NOT_EXIST

2017-06-07 Thread Praveen via Opensaf-tickets
It seems TrackStop() request came to CLMS and it executed it. But the client 
which is AMFD received ERR_TIMEOUT:
2017-05-26 10:19:12 SC-1 osafamfd[268]: WA Failed to stop cluster tracking 5

Now same AMFD after becoming standby  tries to do TrackStop() again. It will 
surely get ERR_NOT_EXIST as the tracking was stopped.
AMFD should finalize the handle when its gets ERR_TIMEOUT.



---

** [tickets:#2469] clm: Stop tracking api returns NOT_EXIST**

**Status:** assigned
**Milestone:** 5.17.06
**Created:** Mon May 29, 2017 12:19 AM UTC by Minh Hon Chau
**Last Updated:** Mon May 29, 2017 09:21 AM UTC
**Owner:** Praveen


When performing switchover, AMFD fails to stop CLM track callback with error 
code 12 (NOT_EXIST)

**syslog:
**
2017-05-26 10:19:02 SC-1 osafamfd[268]: NO Controller switch over initiated
2017-05-26 10:19:02 SC-1 osafamfd[268]: NO ROLE SWITCH Active --> Quiesced
2017-05-26 10:19:02 SC-1 osafimmnd[205]: NO Implementer (applier) connected: 40 
(@OpenSafImmReplicatorB) <343, 2010f>
2017-05-26 10:19:02 SC-1 osafntfimcnd[626]: NO Started
2017-05-26 10:19:12 SC-1 osafamfd[268]: WA Failed to stop cluster tracking 5
2017-05-26 10:19:12 SC-1 osafimmnd[205]: NO Implementer disconnected 32 <27, 
2010f> (safAmfService)
2017-05-26 10:19:12 SC-1 osafimmnd[205]: NO Implementer (applier) connected: 41 
(@safAmfService2010f) <27, 2010f>
2017-05-26 10:19:12 SC-1 osafamfnd[283]: NO AVD NEW_ACTIVE, adest:1
2017-05-26 10:19:12 SC-1 osafimmnd[205]: NO Implementer disconnected 31 <0, 
2020f> (@safAmfService2020f)
2017-05-26 10:19:12 SC-1 osafimmnd[205]: NO Implementer connected: 42 
(safAmfService) <0, 2020f>
2017-05-26 10:19:12 SC-1 osafamfd[268]: NO Switching Quiesced --> StandBy
2017-05-26 10:19:13 SC-1 osafamfd[268]: ER Failed to stop cluster tracking 12
2017-05-26 10:19:13 SC-1 osafamfd[268]: ER Failed to stop cluster tracking 
after switch over
2017-05-26 10:19:13 SC-1 osafamfd[268]: NO Controller switch over done

**CLM trace:
**
May 26 10:19:13.173369 osafclmd [240:240:src/clm/clmd/clms_evt.c:1347] >> 
proc_track_stop_msg 
May 26 10:19:13.173374 osafclmd [240:240:src/clm/clmd/clms_util.c:0126] >> 
clms_node_get_by_id 
May 26 10:19:13.173379 osafclmd [240:240:src/clm/clmd/clms_util.c:0137] TR Node 
found 131343
May 26 10:19:13.173383 osafclmd [240:240:src/clm/clmd/clms_util.c:0140] << 
clms_node_get_by_id 
May 26 10:19:13.173388 osafclmd [240:240:src/clm/clmd/clms_evt.c:1350] TR Node 
id = 131343
May 26 10:19:13.173393 osafclmd [240:240:src/clm/clmd/clms_mds.c:1553] >> 
clms_mds_msg_send 
May 26 10:19:13.173448 osafclmd [240:240:src/clm/clmd/clms_mds.c:1587] << 
clms_mds_msg_send 
May 26 10:19:13.173457 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:0810] >> 
clms_send_async_update 
May 26 10:19:13.173462 osafclmd [240:240:src/mbc/mbcsv_api.c:0798] >> 
mbcsv_process_snd_ckpt_request: Sending checkpoint data to all STANDBY peers, 
as per the send-type specified
May 26 10:19:13.173504 osafclmd [240:240:src/mbc/mbcsv_api.c:0830] TR 
svc_id:48, pwe_hdl:65552
May 26 10:19:13.173509 osafclmd [240:240:src/mbc/mbcsv_util.c:0363] >> 
mbcsv_send_ckpt_data_to_all_peers 
May 26 10:19:13.173593 osafclmd [240:240:src/mbc/mbcsv_util.c:0411] TR 
dispatching FSM for NCSMBCSV_SEND_ASYNC_UPDATE
May 26 10:19:13.173599 osafclmd [240:240:src/mbc/mbcsv_act.c:0103] TR ASYNC 
update to be sent. role: 1, svc_id: 48, pwe_hdl: 65552
May 26 10:19:13.173604 osafclmd [240:240:src/mbc/mbcsv_util.c:0424] TR calling 
encode callback
May 26 10:19:13.173610 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:0740] >> 
mbcsv_callback 
May 26 10:19:13.173615 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:0856] >> 
ckpt_encode_cbk_handler 
May 26 10:19:13.173626 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:0867] TR 
cbk_arg->info.encode.io_msg_type type 1
May 26 10:19:13.173632 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:1307] >> 
ckpt_encode_async_update 
May 26 10:19:13.173637 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:1324] TR 
data->header.type 3
May 26 10:19:13.173641 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:1362] TR 
Async update CLMS_CKPT_TRACK_START
May 26 10:19:13.173646 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:1701] >> 
enc_mbcsv_track_changes_msg 
May 26 10:19:13.173650 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:1714] << 
enc_mbcsv_track_changes_msg 
May 26 10:19:13.173654 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:1515] << 
ckpt_encode_async_update 
May 26 10:19:13.173658 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:0910] << 
ckpt_encode_cbk_handler 
May 26 10:19:13.173663 osafclmd [240:240:src/clm/clmd/clms_mbcsv.c:0780] << 
mbcsv_callback 
May 26 10:19:13.173667 osafclmd [240:240:src/mbc/mbcsv_util.c:0469] TR send the 
encoded message to any other peer with same s/w version
May 26 10:19:13.173671 osafclmd [240:240:src/mbc/mbcsv_util.c:0472] TR 
dispatching FSM for NCSMBCSV_SEND_ASYNC_UPDATE
May 26 10:19:13.173675 osafclmd [240:240:src/mbc/mbcsv_act.c:0103] TR ASYNC 
update to be sent. role: 1, svc_id: 48, pwe_hdl: 65552
May 26