Re: [users] Payload card reboot due to a short time network break

2018-04-10 Thread Jianfeng Dong
Anders, you are right we do need to care about other nodes in the whole system, 
we have to keep IMMA_SYNCR_TIMEOUT bigger than TIPC tolerance for fixing 
another issue we ever had.
Regarding to the multi hop, fortunately in our system every PLD connects 
directly  with every SC, so probably we don’t need to worry about it.

I will make some tests on the change in our system, and also I will read the 
description about the parameter again in OpenSAF’s docs in case I missed 
something there.

Much appreciate!

Regards,
Jianfeng

From: Anders Widell <anders.wid...@ericsson.com>
Sent: Tuesday, April 10, 2018 2:19 AM
To: Jianfeng Dong <jd...@juniper.net>
Cc: opensaf-users@lists.sourceforge.net
Subject: Re: [users] Payload card reboot due to a short time network break


The only way to be sure if it is appropriate is to test under realistic 
conditions. I agree that it makes sense to increase it so that it is larger 
than the TIPC link tolerance. It should be noted that the IMM agent always 
communicates directly with the IMM node director running on the same node, and 
for this communication I don't think the TIPC link tolerance is relevant (you 
will immediately detect if the IMM node director process goes away). However, 
the IMM node director may in turn have to communicate with IMM processes 
running on other nodes in the cluster in order to fulfill your request, and for 
that communication the TIPC link tolerance comes into play. If it needs to 
communicate in several hops it may even make sense to have a time-out which is 
several times the TIPC link tolerance (compare with the default values for 
these time-outs: link tolerance=1.5 seconds and IMMA time-out=10 seconds).

regards,

Anders Widell

On 04/09/2018 10:19 AM, Jianfeng Dong wrote:
Hi Anders,

Now we want to increase TIPC tolerance from current 10 seconds to 12 or 15, 
thus we also need to increase a OpenSAF parameter ‘IMMA_SYNCR_TIMEOUT’ from 
current 12 seconds to a bigger value(20 maybe), do you think 20 seconds is 
proper for the parameter?
Thanks.

Regards,
Jianfeng

From: Jianfeng Dong
Sent: Tuesday, March 13, 2018 5:38 PM
To: Anders Widell 
<anders.wid...@ericsson.com><mailto:anders.wid...@ericsson.com>; Mathi N P 
<mathi.np@gmail.com><mailto:mathi.np@gmail.com>
Cc: 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: RE: [users] Payload card reboot due to a short time network break

Anders,

As you can see in those logs we had set the TIPC link tolerance to 10 seconds, 
I’m just not sure how long is proper especially for this case.
I think I can take a try at least, to turn TIPC running on the Ethernet 
interfaces instead.
Thanks for your comment for the CLM design idea, I understand it definitely 
would not be easy to make such a change.

Thanks,
Jianfeng

From: Anders Widell [mailto:anders.wid...@ericsson.com]
Sent: Monday, March 12, 2018 7:52 PM
To: Mathi N P <mathi.np....@gmail.com<mailto:mathi.np@gmail.com>>; Jianfeng 
Dong <jd...@juniper.net<mailto:jd...@juniper.net>>
Cc: 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: Re: [users] Payload card reboot due to a short time network break


We also tried running TIPC on a bonded interface but ended up having to change 
it since it never worked well. When you have two redundant Ethernet interfaces, 
TIPC will tolerate failures in one of them seamlessly without losing 
connectivity. But when you run TIPC on a bonded interface it doesn't work, as 
you can see in your case. I guess the reason is that you have two separate 
mechanisms on top of each other, trying to achieve the same thing. One possible 
workaround is to increase the TIPC link tolerance.

When we lose connectivity with a node in the cluster, we are expecting that it 
happened because the other node went down (rebooted or permanently died). We 
don't expect to re-establish connectivity with the same node unless it has 
rebooted in between. It would be possible to introduce a grace time to allow a 
node to stay in the CLM cluster for a while after the connectivity with it has 
been lost, and allow it to continue as a cluster member if connectivity is 
re-established before this grace time has expired. However, this is not so easy 
and it is much easier to increase the TIPC link tolerance and let TIPC handle 
this for us.

regards,

Anders Widell

On 03/09/2018 12:42 PM, Mathi N P wrote:
This is an interesting case (and 'rare' :-))

2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node reboot 
order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late 
node_up_msg after cluster startup timeout
2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the 
reboot command from SC and thus it reboot itself.
Given that the node has not 'instantiated' completely and a reboot order can be 
treated as a 'failed start

Re: [users] Payload card reboot due to a short time network break

2018-04-09 Thread Jianfeng Dong
Hi Anders,

Now we want to increase TIPC tolerance from current 10 seconds to 12 or 15, 
thus we also need to increase a OpenSAF parameter ‘IMMA_SYNCR_TIMEOUT’ from 
current 12 seconds to a bigger value(20 maybe), do you think 20 seconds is 
proper for the parameter?
Thanks.

Regards,
Jianfeng

From: Jianfeng Dong
Sent: Tuesday, March 13, 2018 5:38 PM
To: Anders Widell <anders.wid...@ericsson.com>; Mathi N P 
<mathi.np@gmail.com>
Cc: opensaf-users@lists.sourceforge.net
Subject: RE: [users] Payload card reboot due to a short time network break

Anders,

As you can see in those logs we had set the TIPC link tolerance to 10 seconds, 
I’m just not sure how long is proper especially for this case.
I think I can take a try at least, to turn TIPC running on the Ethernet 
interfaces instead.
Thanks for your comment for the CLM design idea, I understand it definitely 
would not be easy to make such a change.

Thanks,
Jianfeng

From: Anders Widell [mailto:anders.wid...@ericsson.com]
Sent: Monday, March 12, 2018 7:52 PM
To: Mathi N P <mathi.np@gmail.com<mailto:mathi.np@gmail.com>>; Jianfeng 
Dong <jd...@juniper.net<mailto:jd...@juniper.net>>
Cc: 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: Re: [users] Payload card reboot due to a short time network break


We also tried running TIPC on a bonded interface but ended up having to change 
it since it never worked well. When you have two redundant Ethernet interfaces, 
TIPC will tolerate failures in one of them seamlessly without losing 
connectivity. But when you run TIPC on a bonded interface it doesn't work, as 
you can see in your case. I guess the reason is that you have two separate 
mechanisms on top of each other, trying to achieve the same thing. One possible 
workaround is to increase the TIPC link tolerance.

When we lose connectivity with a node in the cluster, we are expecting that it 
happened because the other node went down (rebooted or permanently died). We 
don't expect to re-establish connectivity with the same node unless it has 
rebooted in between. It would be possible to introduce a grace time to allow a 
node to stay in the CLM cluster for a while after the connectivity with it has 
been lost, and allow it to continue as a cluster member if connectivity is 
re-established before this grace time has expired. However, this is not so easy 
and it is much easier to increase the TIPC link tolerance and let TIPC handle 
this for us.

regards,

Anders Widell

On 03/09/2018 12:42 PM, Mathi N P wrote:
This is an interesting case (and 'rare' :-))

2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node reboot 
order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late 
node_up_msg after cluster startup timeout
2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the 
reboot command from SC and thus it reboot itself.
Given that the node has not 'instantiated' completely and a reboot order can be 
treated as a 'failed start up', based on the current AMF state,
AMF can make a decision by reading the 
'saamfnodefailfastoninstantiationfailure' (or perhaps 'saamfnodeautorepair' ) 
attribute to reboot or not and report a node instantantiation failure (back to 
the rc script and other associated events for that state).

Thanks,
Mathi.



On Fri, Mar 9, 2018 at 10:42 AM, Jianfeng Dong 
<jd...@juniper.net<mailto:jd...@juniper.net>> wrote:
Thanks Anders, much appreciate.

And yes, in PLD we run TIPC on a bonded interface which comprises two Ethernet 
interfaces.
I'm wondering why a bonding interface can't provide similar protection like 
TIPC does, is it because TIPC is more robust or something else? I'm not sure if 
it is right to change the low-level design at this time point for our product, 
I will talk with my workmates on this change and find more details in TIPC 
manual.

Regarding to OpenSAF part, do you guys think is it possible that SC do not 
force rebooting the PLD in this case? After all the connection recovered 
quickly.

Regards,
Jianfeng

-Original Message-
From: Anders Widell 
[mailto:anders.wid...@ericsson.com<mailto:anders.wid...@ericsson.com>]
Sent: Thursday, March 8, 2018 8:38 PM
To: Jianfeng Dong <jd...@juniper.net<mailto:jd...@juniper.net>>; 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: Re: [users] Payload card reboot due to a short time network break

Hi!

Are you running TIPC on a bonded interface? I wouldn't recommend this.
Instead, you should run TIPC on the raw Ethernet interfaces and let TIPC handle 
the link fail-over in case of a failure in one of them. TIPC should be able to 
do this without ever losing the connectivity between the nodes.

regards,

Anders Widell


On 03/08/2018 10:43 AM, Jianfeng Dong wrote:
> Hi,
>
> Several days ago we got a payload card reboot issue in customer field, a PLD 
&g

Re: [users] Payload card reboot due to a short time network break

2018-03-13 Thread Jianfeng Dong
Anders,

As you can see in those logs we had set the TIPC link tolerance to 10 seconds, 
I’m just not sure how long is proper especially for this case.
I think I can take a try at least, to turn TIPC running on the Ethernet 
interfaces instead.
Thanks for your comment for the CLM design idea, I understand it definitely 
would not be easy to make such a change.

Thanks,
Jianfeng

From: Anders Widell [mailto:anders.wid...@ericsson.com]
Sent: Monday, March 12, 2018 7:52 PM
To: Mathi N P <mathi.np@gmail.com>; Jianfeng Dong <jd...@juniper.net>
Cc: opensaf-users@lists.sourceforge.net
Subject: Re: [users] Payload card reboot due to a short time network break


We also tried running TIPC on a bonded interface but ended up having to change 
it since it never worked well. When you have two redundant Ethernet interfaces, 
TIPC will tolerate failures in one of them seamlessly without losing 
connectivity. But when you run TIPC on a bonded interface it doesn't work, as 
you can see in your case. I guess the reason is that you have two separate 
mechanisms on top of each other, trying to achieve the same thing. One possible 
workaround is to increase the TIPC link tolerance.

When we lose connectivity with a node in the cluster, we are expecting that it 
happened because the other node went down (rebooted or permanently died). We 
don't expect to re-establish connectivity with the same node unless it has 
rebooted in between. It would be possible to introduce a grace time to allow a 
node to stay in the CLM cluster for a while after the connectivity with it has 
been lost, and allow it to continue as a cluster member if connectivity is 
re-established before this grace time has expired. However, this is not so easy 
and it is much easier to increase the TIPC link tolerance and let TIPC handle 
this for us.

regards,

Anders Widell

On 03/09/2018 12:42 PM, Mathi N P wrote:
This is an interesting case (and 'rare' :-))

2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node reboot 
order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late 
node_up_msg after cluster startup timeout
2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the 
reboot command from SC and thus it reboot itself.
Given that the node has not 'instantiated' completely and a reboot order can be 
treated as a 'failed start up', based on the current AMF state,
AMF can make a decision by reading the 
'saamfnodefailfastoninstantiationfailure' (or perhaps 'saamfnodeautorepair' ) 
attribute to reboot or not and report a node instantantiation failure (back to 
the rc script and other associated events for that state).

Thanks,
Mathi.



On Fri, Mar 9, 2018 at 10:42 AM, Jianfeng Dong 
<jd...@juniper.net<mailto:jd...@juniper.net>> wrote:
Thanks Anders, much appreciate.

And yes, in PLD we run TIPC on a bonded interface which comprises two Ethernet 
interfaces.
I'm wondering why a bonding interface can't provide similar protection like 
TIPC does, is it because TIPC is more robust or something else? I'm not sure if 
it is right to change the low-level design at this time point for our product, 
I will talk with my workmates on this change and find more details in TIPC 
manual.

Regarding to OpenSAF part, do you guys think is it possible that SC do not 
force rebooting the PLD in this case? After all the connection recovered 
quickly.

Regards,
Jianfeng

-Original Message-
From: Anders Widell 
[mailto:anders.wid...@ericsson.com<mailto:anders.wid...@ericsson.com>]
Sent: Thursday, March 8, 2018 8:38 PM
To: Jianfeng Dong <jd...@juniper.net<mailto:jd...@juniper.net>>; 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: Re: [users] Payload card reboot due to a short time network break

Hi!

Are you running TIPC on a bonded interface? I wouldn't recommend this.
Instead, you should run TIPC on the raw Ethernet interfaces and let TIPC handle 
the link fail-over in case of a failure in one of them. TIPC should be able to 
do this without ever losing the connectivity between the nodes.

regards,

Anders Widell


On 03/08/2018 10:43 AM, Jianfeng Dong wrote:
> Hi,
>
> Several days ago we got a payload card reboot issue in customer field, a PLD 
> lost connection with SC for a little while(about 10 seconds), then SC forced 
> the PLD to reboot even though the PLD was going into “SC Absent mode”.
>
> System summary:
> our product is a system with 2 SC boards and at most 14 PLD cards, running 
> OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC connect with 
> PLD via Ethernet and TIPC.
>
> Issue course:
> 1. PLD’s internal network went down for a hardware/driver problem, but it 
> recovered quickly in 2 seconds.
>
> 2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link
> status definitely down for interface eth0, disabling it
> 2018-02-16T17:56:00.7

Re: [users] Payload card reboot due to a short time network break

2018-03-09 Thread Jianfeng Dong
Thanks Anders, much appreciate. 

And yes, in PLD we run TIPC on a bonded interface which comprises two Ethernet 
interfaces. 
I'm wondering why a bonding interface can't provide similar protection like 
TIPC does, is it because TIPC is more robust or something else? I'm not sure if 
it is right to change the low-level design at this time point for our product, 
I will talk with my workmates on this change and find more details in TIPC 
manual.

Regarding to OpenSAF part, do you guys think is it possible that SC do not 
force rebooting the PLD in this case? After all the connection recovered 
quickly.

Regards,
Jianfeng

-Original Message-
From: Anders Widell [mailto:anders.wid...@ericsson.com] 
Sent: Thursday, March 8, 2018 8:38 PM
To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] Payload card reboot due to a short time network break

Hi!

Are you running TIPC on a bonded interface? I wouldn't recommend this. 
Instead, you should run TIPC on the raw Ethernet interfaces and let TIPC handle 
the link fail-over in case of a failure in one of them. TIPC should be able to 
do this without ever losing the connectivity between the nodes.

regards,

Anders Widell


On 03/08/2018 10:43 AM, Jianfeng Dong wrote:
> Hi,
>
> Several days ago we got a payload card reboot issue in customer field, a PLD 
> lost connection with SC for a little while(about 10 seconds), then SC forced 
> the PLD to reboot even though the PLD was going into “SC Absent mode”.
>
> System summary:
> our product is a system with 2 SC boards and at most 14 PLD cards, running 
> OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC connect with 
> PLD via Ethernet and TIPC.
>
> Issue course:
> 1. PLD’s internal network went down for a hardware/driver problem, but it 
> recovered quickly in 2 seconds.
>
> 2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link 
> status definitely down for interface eth0, disabling it
> 2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding: bond0: link status 
> up for interface eth0, enabling it in 6 ms.
>
> 2. 10 seconds later TIPC still broke even though the network got recovered.
>
> 2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc: Resetting link 
> <1.1.14:bond0-1.1.16:eth2>, peer not responding
> 2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link 
> <1.1.14:bond0-1.1.16:eth2> on network plane A
> 2018-02-16T17:56:10.050440+00:00 pld0114 kernel: tipc: Lost contact 
> with <1.1.16>
>
> 3. SC found the PLD left the cluster.
>
> 2018-02-16T17:56:10.050704+00:00 scm2 osafimmd[3095]: NO MDS event 
> from svc_id 25 (change:4, dest:296935520731140)
> 2018-02-16T17:56:10.052770+00:00 scm2 osafclmd[3302]: NO Node 69135 
> went down. Not sending track callback for agents on that node
> 2018-02-16T17:56:10.054411+00:00 scm2 osafimmnd[3106]: NO Global 
> discard node received for nodeId:10e0f pid:3516
> 2018-02-16T17:56:10.054505+00:00 scm2 osafimmnd[3106]: NO Implementer 
> disconnected 15 <0, 10e0f(down)> (MsgQueueService69135)
> 2018-02-16T17:56:10.055158+00:00 scm2 osafamfd[3312]: NO Node 
> 'PLD0114' left the cluster
>
> 4. One more second later, the TIPC link also got recovered.
>
> 2018-02-16T17:56:11.054553+00:00 pld0114 kernel: tipc: Established 
> link <1.1.14:bond0-1.1.16:eth2> on network plane A
>
> 5. However, PLD was still impacted by the network issue and was trying to go 
> into ‘SC Absent Mode’.
>
> 2018-02-16T17:56:11.057260+00:00 pld0114 osafamfnd[3626]: NO AVD 
> NEW_ACTIVE, adest:1
> 2018-02-16T17:56:11.057407+00:00 pld0114 osafamfnd[3626]: NO Sending 
> node up due to NCSMDS_NEW_ACTIVE
> 2018-02-16T17:56:11.057684+00:00 pld0114 osafamfnd[3626]: NO 19 SISU 
> states sent
> 2018-02-16T17:56:11.057715+00:00 pld0114 osafamfnd[3626]: NO 22 SU 
> states sent
> 2018-02-16T17:56:11.057775+00:00 pld0114 osafimmnd[3516]: NO Sleep 
> done registering IMMND with MDS
> 2018-02-16T17:56:11.058243+00:00 pld0114 osafmsgnd[3665]: ER 
> saClmDispatch Failed with error 9
> 2018-02-16T17:56:11.058283+00:00 pld0114 osafckptnd[3697]: NO Bad CLM handle. 
> Reinitializing.
> 2018-02-16T17:56:11.059054+00:00 pld0114 osafimmnd[3516]: NO SUCCESS 
> IN REGISTERING IMMND WITH MDS
> 2018-02-16T17:56:11.059116+00:00 pld0114 osafimmnd[3516]: NO 
> Re-introduce-me highestProcessed:26209 highestReceived:26209
> 2018-02-16T17:56:11.059699+00:00 pld0114 osafimmnd[3516]: NO IMMD 
> service is UP ... ScAbsenseAllowed?:31536 introduced?:2
> 2018-02-16T17:56:11.059932+00:00 pld0114 osafimmnd[3516]: NO MDS: 
> mds_register_callback: dest 10e0fb03c0010 already exist
> 2018-02-16T17:56:11.060297+00:00 pld0114 osafimmnd[3516]: NO 
> Re-introduce-me highestProcessed:26209 highestRecei

[users] Payload card reboot due to a short time network break

2018-03-08 Thread Jianfeng Dong
Hi,

Several days ago we got a payload card reboot issue in customer field, a PLD 
lost connection with SC for a little while(about 10 seconds), then SC forced 
the PLD to reboot even though the PLD was going into “SC Absent mode”.

System summary:
our product is a system with 2 SC boards and at most 14 PLD cards, running 
OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC connect with 
PLD via Ethernet and TIPC.

Issue course:
1. PLD’s internal network went down for a hardware/driver problem, but it 
recovered quickly in 2 seconds.

2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link status 
definitely down for interface eth0, disabling it
2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding: bond0: link status up 
for interface eth0, enabling it in 6 ms.

2. 10 seconds later TIPC still broke even though the network got recovered.

2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc: Resetting link 
<1.1.14:bond0-1.1.16:eth2>, peer not responding
2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link 
<1.1.14:bond0-1.1.16:eth2> on network plane A
2018-02-16T17:56:10.050440+00:00 pld0114 kernel: tipc: Lost contact with 
<1.1.16>

3. SC found the PLD left the cluster.

2018-02-16T17:56:10.050704+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 
25 (change:4, dest:296935520731140)
2018-02-16T17:56:10.052770+00:00 scm2 osafclmd[3302]: NO Node 69135 went down. 
Not sending track callback for agents on that node
2018-02-16T17:56:10.054411+00:00 scm2 osafimmnd[3106]: NO Global discard node 
received for nodeId:10e0f pid:3516
2018-02-16T17:56:10.054505+00:00 scm2 osafimmnd[3106]: NO Implementer 
disconnected 15 <0, 10e0f(down)> (MsgQueueService69135)
2018-02-16T17:56:10.055158+00:00 scm2 osafamfd[3312]: NO Node 'PLD0114' left 
the cluster

4. One more second later, the TIPC link also got recovered.

2018-02-16T17:56:11.054553+00:00 pld0114 kernel: tipc: Established link 
<1.1.14:bond0-1.1.16:eth2> on network plane A

5. However, PLD was still impacted by the network issue and was trying to go 
into ‘SC Absent Mode’.

2018-02-16T17:56:11.057260+00:00 pld0114 osafamfnd[3626]: NO AVD NEW_ACTIVE, 
adest:1
2018-02-16T17:56:11.057407+00:00 pld0114 osafamfnd[3626]: NO Sending node up 
due to NCSMDS_NEW_ACTIVE
2018-02-16T17:56:11.057684+00:00 pld0114 osafamfnd[3626]: NO 19 SISU states sent
2018-02-16T17:56:11.057715+00:00 pld0114 osafamfnd[3626]: NO 22 SU states sent
2018-02-16T17:56:11.057775+00:00 pld0114 osafimmnd[3516]: NO Sleep done 
registering IMMND with MDS
2018-02-16T17:56:11.058243+00:00 pld0114 osafmsgnd[3665]: ER saClmDispatch 
Failed with error 9
2018-02-16T17:56:11.058283+00:00 pld0114 osafckptnd[3697]: NO Bad CLM handle. 
Reinitializing.
2018-02-16T17:56:11.059054+00:00 pld0114 osafimmnd[3516]: NO SUCCESS IN 
REGISTERING IMMND WITH MDS
2018-02-16T17:56:11.059116+00:00 pld0114 osafimmnd[3516]: NO Re-introduce-me 
highestProcessed:26209 highestReceived:26209
2018-02-16T17:56:11.059699+00:00 pld0114 osafimmnd[3516]: NO IMMD service is UP 
... ScAbsenseAllowed?:31536 introduced?:2
2018-02-16T17:56:11.059932+00:00 pld0114 osafimmnd[3516]: NO MDS: 
mds_register_callback: dest 10e0fb03c0010 already exist
2018-02-16T17:56:11.060297+00:00 pld0114 osafimmnd[3516]: NO Re-introduce-me 
highestProcessed:26209 highestReceived:26209
2018-02-16T17:56:11.062053+00:00 pld0114 osafamfnd[3626]: NO 25 CSICOMP states 
synced
2018-02-16T17:56:11.062102+00:00 pld0114 osafamfnd[3626]: NO 28 SU states sent
2018-02-16T17:56:11.064418+00:00 pld0114 osafimmnd[3516]: ER MESSAGE:26438 OUT 
OF ORDER my highest processed:26209 - exiting
2018-02-16T17:56:11.160121+00:00 pld0114 osafckptnd[3697]: NO CLM selection 
object was updated. (12)
2018-02-16T17:56:11.166764+00:00 pld0114 osafamfnd[3626]: NO saClmDispatch 
BAD_HANDLE
2018-02-16T17:56:11.167030+00:00 pld0114 osafamfnd[3626]: NO 
'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' component restart probation timer 
started (timeout: 600 ns)
2018-02-16T17:56:11.167102+00:00 pld0114 osafamfnd[3626]: NO Restarting a 
component of 'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1)
2018-02-16T17:56:11.167135+00:00 pld0114 osafamfnd[3626]: NO 
'safComp=IMMND,safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' faulted due to 
'avaDown' : Recovery is 'componentRestart'

6. SC received messages from the PLD, then it forced the PLD to reboot(due to 
the node sync timeout?).

2018-02-16T17:56:11.058121+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 
25 (change:3, dest:296935520731140)
2018-02-16T17:56:11.058515+00:00 scm2 osafsmfd[3391]: ER saClmClusterNodeGet 
failed, rc=SA_AIS_ERR_NOT_EXIST (12)
2018-02-16T17:56:11.059607+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind: write 
failed - Bad file descriptor
2018-02-16T17:56:11.060307+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind: write 
failed - Bad file descriptor
2018-02-16T17:56:11.060811+00:00 scm2 osafimmd[3095]: NO ACT: New Epoch for 
IMMND process at node 10e0f old epoch: 0  new 

Re: [users] osafamfd coredump issue

2017-05-25 Thread Jianfeng Dong
Thank you Praveen, I will upload those files asap.



Much appreciate for the help!



Thanks,

Jianfeng



-Original Message-
From: praveen malviya [mailto:praveen.malv...@oracle.com]
Sent: Thursday, May 25, 2017 4:48 PM
To: Jianfeng Dong <jd...@juniper.net>
Cc: opensaf-users@lists.sourceforge.net
Subject: Re: [users] osafamfd coredump issue



Hi Jianfeng,



I have raised ticket #2468 for this issue.

Please attach bt, logs and traces in the ticket.



Thanks,

Praveen


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users


Re: [users] osafimmnd coredump issue

2017-05-24 Thread Jianfeng Dong
Send again for email size limit reason.

From: Jianfeng Dong
Sent: Wednesday, May 24, 2017 6:08 PM
To: 'Zoran Milinkovic' <zoran.milinko...@ericsson.com>; 
'opensaf-users@lists.sourceforge.net' <opensaf-users@lists.sourceforge.net>
Subject: RE: osafimmnd coredump issue


Hi Zoran,



Seems the issue is hard to repro, I checked the syslog and found the SC board 
"scm2" was doing nothing special at that time.

The other SC board "scm1" was in a loop of rebooting due to its firmware fault, 
which has nothing to do with OpenSAF and OpenSAF was not started at all.



I paste syslog here when the issue occurred:



2017-04-25T05:30:00.482306-04:00 user.info scm2 osafimmloadd: IN Synced 7032 
objects in total

2017-04-25T05:30:00.482749-04:00 local0.notice scm2 osafimmnd[2793]: NO NODE 
STATE-> IMM_NODE_FULLY_AVAILABLE 18455

2017-04-25T05:30:00.489351-04:00 user.notice scm2 osafimmloadd: NO Sync ending 
normally

2017-04-25T05:30:01.395342-04:00 local0.notice pld0206 osafimmnd[3154]: NO NODE 
STATE-> IMM_NODE_FULLY_AVAILABLE 19026

2017-04-25T05:30:01.394642-04:00 local0.notice pld0106 osafimmnd[2996]: NO NODE 
STATE-> IMM_NODE_FULLY_AVAILABLE 19026

2017-04-25T05:30:01.395272-04:00 local0.notice cmm02b osafimmnd[5129]: NO NODE 
STATE-> IMM_NODE_FULLY_AVAILABLE 19026

2017-04-25T05:30:01.395862-04:00 local0.notice cmm02a osafimmnd[5102]: NO NODE 
STATE-> IMM_NODE_FULLY_AVAILABLE 19026

2017-04-25T05:30:01.399220-04:00 local0.notice scm2 osafimmnd[2793]: NO Epoch 
set to 147 in ImmModel

2017-04-25T05:30:01.396470-04:00 local0.notice cmm02a osafimmnd[5102]: NO Epoch 
set to 147 in ImmModel

2017-04-25T05:30:01.395932-04:00 local0.notice pld0206 osafimmnd[3154]: NO 
Epoch set to 147 in ImmModel

2017-04-25T05:30:01.396513-04:00 local0.notice pld0210 osafimmnd[4345]: NO NODE 
STATE-> IMM_NODE_FULLY_AVAILABLE 19026

2017-04-25T05:30:01.397065-04:00 local0.notice pld0210 osafimmnd[4345]: NO 
Epoch set to 147 in ImmModel

2017-04-25T05:30:01.395883-04:00 local0.notice cmm02b osafimmnd[5129]: NO Epoch 
set to 147 in ImmModel

2017-04-25T05:30:01.395214-04:00 local0.notice pld0106 osafimmnd[2996]: NO 
Epoch set to 147 in ImmModel

2017-04-25T05:30:01.396031-04:00 local0.notice pld0205 osafimmnd[3150]: NO NODE 
STATE-> IMM_NODE_FULLY_AVAILABLE 19026

2017-04-25T05:30:01.396647-04:00 local0.notice pld0205 osafimmnd[3150]: NO 
Epoch set to 147 in ImmModel

2017-04-25T05:30:01.400229-04:00 local0.notice scm2 osafimmd[2782]: NO ACT: New 
Epoch for IMMND process at node 1100f old epoch: 146  new epoch:147

2017-04-25T05:30:01.400321-04:00 local0.notice scm2 osafimmd[2782]: NO ACT: New 
Epoch for IMMND process at node 11a0f old epoch: 146  new epoch:147

2017-04-25T05:30:01.400380-04:00 local0.notice scm2 osafimmd[2782]: NO ACT: New 
Epoch for IMMND process at node 1200f old epoch: 146  new epoch:147

2017-04-25T05:30:01.400435-04:00 local0.notice scm2 osafimmd[2782]: NO ACT: New 
Epoch for IMMND process at node 11f0f old epoch: 146  new epoch:147

2017-04-25T05:30:01.400540-04:00 local0.notice scm2 osafimmd[2782]: NO ACT: New 
Epoch for IMMND process at node 1160f old epoch: 146  new epoch:147

2017-04-25T05:30:01.400619-04:00 local0.notice scm2 osafimmd[2782]: NO ACT: New 
Epoch for IMMND process at node 1150f old epoch: 146  new epoch:147

2017-04-25T05:30:01.428760-04:00 local0.notice pld0104 osafimmnd[7230]: NO NODE 
STATE-> IMM_NODE_FULLY_AVAILABLE 2901

2017-04-25T05:30:01.428818-04:00 local0.notice pld0104 osafimmnd[7230]: NO 
RepositoryInitModeT is SA_IMM_INIT_FROM_FILE

2017-04-25T05:30:01.428854-04:00 local0.warning pld0104 osafimmnd[7230]: WA IMM 
Access Control mode is DISABLED!

2017-04-25T05:30:01.448820-04:00 local0.notice scm2 osafimmd[2782]: NO ACT: New 
Epoch for IMMND process at node 10a0f old epoch: 0  new epoch:147

2017-04-25T05:30:01.492622-04:00 local0.notice scm2 osafimmnd[2793]: NO SERVER 
STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY

2017-04-25T05:30:01.429155-04:00 local0.notice pld0104 osafimmnd[7230]: NO 
Epoch set to 147 in ImmModel

2017-04-25T05:30:01.497081-04:00 local0.notice pld0104 osafimmnd[7230]: NO 
SERVER STATE: IMM_SERVER_SYNC_CLIENT --> IMM_SERVER_READY

2017-04-25T05:30:01.497438-04:00 local0.notice pld0104 osafimmnd[7230]: NO 
ImmModel received scAbsenceAllowed 31536

2017-04-25T05:30:01.527159-04:00 local0.notice pld0104 osafclmna[11250]: Started

2017-04-25T05:30:01.562054-04:00 local0.notice pld0104 osafamfnd[11259]: Started

2017-04-25T05:30:01.446033-04:00 local0.notice pld0110 osafimmnd[22477]: NO 
NODE STATE-> IMM_NODE_FULLY_AVAILABLE 2901

2017-04-25T05:30:01.446346-04:00 local0.notice pld0110 osafimmnd[22477]: NO 
RepositoryInitModeT is SA_IMM_INIT_FROM_FILE

2017-04-25T05:30:01.446575-04:00 local0.warning pld0110 osafimmnd[22477]: WA 
IMM Access Control mode is DISABLED!

2017-04-25T05:30:01.446797-04:00 local0.notice pld0110 osafimmnd[22477]: NO 
Epoch set to 147 in ImmModel

2017-04-25T05:30:0

Re: [users] osafamfd coredump issue

2017-05-24 Thread Jianfeng Dong
Thanks Praveen, we tried but couldn't repro the issue, it should be hard to 
reproduce it.



According to the description from guys who found the issue, all boards in the 
chassis were trying to reboot required by user command:



Here is syslog when the issue occurred:

2017-05-01T07:52:57.714906-04:00 scm2 kernel: tipc: Resetting link 
<1.1.16:eth2-1.1.5:bond0>, peer not responding

2017-05-01T07:52:57.714935-04:00 scm2 kernel: tipc: Lost link 
<1.1.16:eth2-1.1.5:bond0> on network plane A

2017-05-01T07:52:57.714939-04:00 scm2 kernel: tipc: Lost contact with <1.1.5>

2017-05-01T07:52:57.716788-04:00 scm2 osafimmd[3009]: NO MDS event from svc_id 
25 (change:4, dest:287038266327043)

2017-05-01T07:52:57.717304-04:00 scm2 osafclmd[4259]: NO Node 66831 went down. 
Not sending track callback for agents on that node

2017-05-01T07:52:57.719178-04:00 scm2 osafimmnd[3020]: NO Global discard node 
received for nodeId:1050f pid:15395

2017-05-01T07:52:57.719233-04:00 scm2 osafimmnd[3020]: NO Implementer 
disconnected 104 <0, 1050f(down)> (MsgQueueService66831)

2017-05-01T07:52:57.721345-04:00 scm2 osafamfd[4277]: NO Node 'PLD0105' left 
the cluster

2017-05-01T07:52:57.722778-04:00 scm2 log_demo[6160]: [0.I.Proc]: FYI state 
change notification from NTF, entity PLD0105 now has new state DISABLED (Oper 
state safAmfNode=PLD0105,safAmfCluster=myAmfCluster changed)

2017-05-01T07:52:57.732796-04:00 scm2 osafamfd[4277]: su.cc:2006: 
dec_curr_act_si: Assertion 'saAmfSUNumCurrActiveSIs > 0' failed.

2017-05-01T07:52:57.778777-04:00 scm2 kernel: tipc: Resetting link 
<1.1.16:eth2-1.1.6:bond0>, peer not responding

2017-05-01T07:52:57.778827-04:00 scm2 kernel: tipc: Lost link 
<1.1.16:eth2-1.1.6:bond0> on network plane A

2017-05-01T07:52:57.778833-04:00 scm2 kernel: tipc: Lost contact with <1.1.6>

2017-05-01T07:52:57.777979-04:00 scm2 osafimmd[3009]: NO MDS event from svc_id 
25 (change:4, dest:288139774320643)

2017-05-01T07:52:57.717343-04:00 scm2 osafclmd[4259]: NO Node 66831 went down. 
Not sending track callback for agents on that node

2017-05-01T07:52:57.779373-04:00 scm2 osafclmd[4259]: NO Node 67087 went down. 
Not sending track callback for agents on that node

2017-05-01T07:52:57.780552-04:00 scm2 osafimmnd[3020]: NO Global discard node 
received for nodeId:1060f pid:17439

2017-05-01T07:52:57.780607-04:00 scm2 osafimmnd[3020]: NO Implementer 
disconnected 106 <0, 1060f(down)> (MsgQueueService67087)

2017-05-01T07:52:57.810785-04:00 scm2 osafamfnd[5281]: WA AMF director 
unexpectedly crashed

2017-05-01T07:52:57.810839-04:00 scm2 osafamfnd[5281]: Rebooting OpenSAF NodeId 
= 69647 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) 
received, OwnNodeId = 69647, SupervisionTime = 0

2017-05-01T07:52:57.810978-04:00 scm2 osafimmnd[3020]: NO Implementer locally 
disconnected. Marking it as doomed 105 <29, 1100f> (safAmfService)

2017-05-01T07:52:57.812582-04:00 scm2 osafimmnd[3020]: NO Implementer 
disconnected 105 <29, 1100f> (safAmfService)

2017-05-01T07:52:57.950567-04:00 scm2 opensaf_reboot: Rebooting local node; 
timeout=0

2017-05-01T07:52:58.084968-04:00 scm2 atwdog[28335]: rebooting (-f) local node





And could you please do me a favor to open a ticket for the issue? I just tried 
to register in sourceforge but failed, the registration page always complain 
something “Form security missing”.



Thanks,

Jianfeng



-Original Message-
From: praveen malviya [mailto:praveen.malv...@oracle.com]
Sent: Wednesday, May 24, 2017 1:40 PM
To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] osafamfd coredump issue



Hi Jianfeng,



Any steps to reproduce it?

While AMFD is performing failover, it finds mismatch in assignment counters and 
it asserted.

Please share amfd traces if available and also raise a ticket with the same.



Thanks

Praveen





On 23-May-17 3:41 PM, Jianfeng Dong wrote:

> Hi,

>

>

>

> We also got a 'osafamfd' coredump in our controller board, could please 
> someone take a look at the issue? Thanks in advance.

>

>

>

> I listed the backtrace info here but not attach the coredump file(due to 
> email size limit), so please let me know if you need more information.

>

>

>

>

>

> root@scm1:/coredumps/# gdb /usr/lib64/opensaf/osafamfd

> core.image\=26115.proc\=osafamfd.pid\=4277.signal\=6.time\=1493639577

>

> GNU gdb (Wind River Linux Sourcery CodeBench 4.8-28) 7.6

>

> Copyright (C) 2013 Free Software Foundation, Inc.

>

> License GPLv3+: GNU GPL version 3 or later

> <https://urldefense.proofpoint.com/v2/url?u=http-3A__gnu.org_licenses_

> gpl.html=DwICAg=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=Lehk

> 1PZKwfDQtYJXNyUKbPAqrw5O--SlPRAF9DIEps4=i29npjWFiXQmxwH36EDTrR9FGoBD

> UiNYwHVYDA9w-_M=YnuqkZRthOUoqXPc8jSZiuTM5L7kb24nNWEV6_8GrUY= >

>

>

Re: [users] osafimmnd coredump issue

2017-05-23 Thread Jianfeng Dong
Hi,



We got a 'osafimmnd' core dump in our chassis, could someone please take a look 
at the issue? Thanks.

I can't attach the coredump file because it will make this email too big and 
get blocked by mailserver, so I just list backtrace info here, please let me 
know if you need more information on the issue.





root@scm1:/coredumps# gdb /usr/lib64/opensaf/osafimmnd 
core.image\=26115.proc\=osafimmnd.pid\=2793.signal\=6.time\=1493112624

GNU gdb (Wind River Linux Sourcery CodeBench 4.8-28) 7.6

Copyright (C) 2013 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later 

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.  Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-wrs-linux-gnu".

For bug reporting instructions, please see:

>...

Reading symbols from /usr/lib64/opensaf/osafimmnd...Reading symbols from 
/usr/lib64/opensaf/.debug/osafimmnd...done.

done.

[New LWP 2793]

[New LWP 2797]

[New LWP 2795]

[New LWP 2796]



warning: Could not load shared library symbols for linux-vdso.so.1.

Do you need "set solib-search-path" or "set sysroot"?

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib64/libthread_db.so.1".

Core was generated by `/usr/lib64/opensaf/osafimmnd osafimmnd'.

Program terminated with signal 6, Aborted.

#0  0x003d84a353e9 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56

56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.

(gdb) bt full

#0  0x003d84a353e9 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56

resultvar = 0

pid = 2793

selftid = 2793

#1  0x003d84a38508 in __GI_abort () at abort.c:89

save_stage = 2

act = {__sigaction_handler = {sa_handler = 0x2020202020202020, 
sa_sigaction = 0x2020202020202020}, sa_mask = {__val = {2314885530818453536, 
7239073644580708384, 7378697627939729267, 3474076752553600614, 
7378697383761162288,

  3919933115663279718, 3274715270390755632, 3472328296226648184, 
3475143045726351408, 2314885530819502128, 2314885530818453536, 
8319937555149627424, 746872325959545721, 3486968516907053670, 
233741787981676, 0}},

  sa_flags = 93, sa_restorer = 0x7fff53288a40}

sigs = {__val = {32, 0 }}

#2  0x003d84a6e964 in __libc_message (do_abort=do_abort@entry=2, 
fmt=fmt@entry=0x3d84b65f88 "*** Error in `%s': %s: 0x%s ***\n") at 
../sysdeps/posix/libc_fatal.c:175

ap = {{gp_offset = 40, fp_offset = 0, overflow_arg_area = 
0x7fff53288a50, reg_save_area = 0x7fff532889e0}}

fd = 2

on_2 = 

list = 

nlist = 

cp = 

written = 

#3  0x003d84a786be in malloc_printerr (action=3, str=0x3d84b62052 "free(): 
invalid pointer", ptr=) at malloc.c:4895

buf = "00f8c100"

cp = 

#4  0x003d84a79397 in _int_free (av=, p=0xf8c0f0, 
have_lock=0) at malloc.c:3751

size = 

fb = 

nextchunk = 

nextsize = 

nextinuse = 

prevsize = 

bck = 

fwd = 

errstr = 

locked = 

__func__ = "_int_free"

#5  0x004088af in freeSearchNext (rsp=0xbe5d60, freeTop=SA_TRUE) at 
immnd_evt.c:1378

al = 0x0

__FUNCTION__ = "freeSearchNext"

#6  0x00424602 in immnd_proc_imma_discard_connection (cb=0x6eee60 
<_immnd_cb>, cl_node=0x956b60, scAbsence=false) at immnd_proc.c:108

rsp = 0xbe5d60

client_id = 532842

node_id = 69647

sn = 0xbe5d40

implId = 0

send_evt = {next = 0x4a, type = 4910773, info = {imma = {type = 
IMMA_EVT_ND2A_SEARCHNEXT_RSP, info = {initRsp = {immHandle = 4910761, error = 
4941167}, errRsp = {error = 4910761,

  errStrings = 0x4b656f <__FUNCTION__.12396>}, admInitRsp = 
{error = 4910761, ownerId = 0}, ccbInitRsp = {error = 4910761, ccbId = 0}, 
searchInitRsp = {error = 4910761, searchId = 0}, searchNextRsp = 0x4aeea9,

searchBundleNextRsp = 0x4aeea9, searchRemote = {client_hdl = 
4910761, requestNodeId = 4941167, remoteNodeId = 0, searchId = 2244561952, 
objectName = {size = 1395166368,

buf = 0x423fe6  
"\203\275|\377\377\377\001t\"H\215\r\257\070\t"}, attributeNames = 
0x7fff53288c60}, admOpReq = {adminOwnerId = 4910761, invocation = 0, 
operationId = 4941167,

  continuationId = 264237567008, timeout = 140734588554400, 
objectName = {size = 4341734, buf = 0x7fff53288c60 ""}, params = 
0x7fff53288d90}, admOpRsp = {oi_client_hdl = 4910761, invocation = 4941167,

  result = 2244561952, error = 61, parms = 0x7fff53288ca0}, 
objCreate = {ccbId = 4910761, adminOwnerId = 0, className = {size = 

Re: [users] osafimmnd coredump issue

2017-05-23 Thread Jianfeng Dong

Hi Zoran,

Thanks for the comment, I'm not sure if can reproduce the issue, but I will 
take a try tomorrow and check syslog to see how did we hit it.

Thanks,
Jiangeng

> 在 2017年5月23日,下午7:44,Zoran Milinkovic <zoran.milinko...@ericsson.com> 写道:
> 
> Hi Jianfeng,
> 
> We have the same issue. Seems that the problem was introduced by ticket #1848.
> I'm working on this now.
> 
> Can you reproduce the problem ?
> If you can, can you provide the steps ?
> 
> Thanks,
> Zoran
> 
> -Original Message-
> From: Jianfeng Dong [mailto:jd...@juniper.net] 
> Sent: den 23 maj 2017 12:39
> To: opensaf-users@lists.sourceforge.net
> Subject: [users] osafimmnd coredump issue
> 
> Hi,
> 
> 
> 
> We got a 'osafimmnd' core dump in our chassis, could someone please take a 
> look at the issue? Thanks.
> 
> I can't attach the coredump file here due to OpenSAF email size limit policy, 
> so please let me know if you need more information on the issue.
> 
> 
> 
> 
> 
> atlas@scm1:/coredumps/ $ gdb /usr/lib64/opensaf/osafimmnd 
> core.image\=26115.proc\=osafimmnd.pid\=2793.signal\=6.time\=1493112624
> 
> GNU gdb (Wind River Linux Sourcery CodeBench 4.8-28) 7.6
> 
> Copyright (C) 2013 Free Software Foundation, Inc.
> 
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> 
> This is free software: you are free to change and redistribute it.
> 
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> 
> and "show warranty" for details.
> 
> This GDB was configured as "x86_64-wrs-linux-gnu".
> 
> For bug reporting instructions, please see:
> 
> <supp...@windriver.com>...
> 
> Reading symbols from /usr/lib64/opensaf/osafimmnd...Reading symbols from 
> /usr/lib64/opensaf/.debug/osafimmnd...done.
> 
> done.
> 
> [New LWP 2793]
> 
> [New LWP 2797]
> 
> [New LWP 2795]
> 
> [New LWP 2796]
> 
> 
> 
> warning: Could not load shared library symbols for linux-vdso.so.1.
> 
> Do you need "set solib-search-path" or "set sysroot"?
> 
> [Thread debugging using libthread_db enabled]
> 
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 
> Core was generated by `/usr/lib64/opensaf/osafimmnd osafimmnd'.
> 
> Program terminated with signal 6, Aborted.
> 
> #0  0x003d84a353e9 in __GI_raise (sig=sig@entry=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> 
> 56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> 
> (gdb) bt
> 
> #0  0x003d84a353e9 in __GI_raise (sig=sig@entry=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> 
> #1  0x003d84a38508 in __GI_abort () at abort.c:89
> 
> #2  0x003d84a6e964 in __libc_message (do_abort=do_abort@entry=2, 
> fmt=fmt@entry=0x3d84b65f88 "*** Error in `%s': %s: 0x%s ***\n") at 
> ../sysdeps/posix/libc_fatal.c:175
> 
> #3  0x003d84a786be in malloc_printerr (action=3, str=0x3d84b62052 
> "free(): invalid pointer", ptr=) at malloc.c:4895
> 
> #4  0x003d84a79397 in _int_free (av=, p=0xf8c0f0, 
> have_lock=0) at malloc.c:3751
> 
> #5  0x004088af in freeSearchNext (rsp=0xbe5d60, freeTop=SA_TRUE) at 
> immnd_evt.c:1378
> 
> #6  0x00424602 in immnd_proc_imma_discard_connection (cb=0x6eee60 
> <_immnd_cb>, cl_node=0x956b60, scAbsence=false) at immnd_proc.c:108
> 
> #7  0x0040a657 in immnd_evt_proc_imm_finalize (cb=0x6eee60 
> <_immnd_cb>, evt=0x7ff2640029c0, sinfo=0x7ff264002b00, isOm=SA_TRUE) at 
> immnd_evt.c:2071
> 
> #8  0x0040614c in immnd_process_evt () at immnd_evt.c:535
> 
> #9  0x00422e14 in main (argc=2, argv=0x7fff532890f8) at 
> immnd_main.c:370
> 
> (gdb)
> 
> 
> 
> 
> 
> Thanks,
> 
> Jianfeng
> 
> 
> --
> Check out the vibrant tech community on one of the world's most engaging tech 
> sites, Slashdot.org! http://sdm.link/slashdot 
> ___
> Opensaf-users mailing list
> Opensaf-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users


[users] osafamfd coredump issue

2017-05-23 Thread Jianfeng Dong
Hi,



We also got a 'osafamfd' coredump in our controller board, could please someone 
take a look at the issue? Thanks in advance.



I listed the backtrace info here but not attach the coredump file(due to email 
size limit), so please let me know if you need more information.





root@scm1:/coredumps/# gdb /usr/lib64/opensaf/osafamfd 
core.image\=26115.proc\=osafamfd.pid\=4277.signal\=6.time\=1493639577

GNU gdb (Wind River Linux Sourcery CodeBench 4.8-28) 7.6

Copyright (C) 2013 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later 

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.  Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-wrs-linux-gnu".

For bug reporting instructions, please see:

>...

Reading symbols from /usr/lib64/opensaf/osafamfd...Reading symbols from 
/usr/lib64/opensaf/.debug/osafamfd...done.

done.

[New LWP 4277]

[New LWP 4279]

[New LWP 4280]

[New LWP 4282]



warning: Could not load shared library symbols for linux-vdso.so.1.

Do you need "set solib-search-path" or "set sysroot"?

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib64/libthread_db.so.1".

bCore was generated by `/usr/lib64/opensaf/osafamfd osafamfd'.

Program terminated with signal 6, Aborted.

#0  0x003d84a353e9 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56

56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.

(gdb) bt full

#0  0x003d84a353e9 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56

resultvar = 0

pid = 4277

selftid = 4277

#1  0x003d84a38508 in __GI_abort () at abort.c:89

save_stage = 2

act = {__sigaction_handler = {sa_handler = 0x51560d, sa_sigaction = 
0x51560d}, sa_mask = {__val = {2006, 5336880, 5335460, 2130303778826, 5320117, 
9977552, 264237561592, 140737298378800, 264235064979, 17179869185,

  18442240615826079272, 4294967296, 5873756416, 5321392, 14559416, 
140737298378864}}, sa_flags = -2052873586, sa_restorer = 0x0}

sigs = {__val = {32, 0 }}

#2  0x003d85a2110a in __osafassert_fail (__file=0x51560d "su.cc", 
__line=2006, __func=0x516f30  
"dec_curr_act_si", __assertion=0x5169a4 "saAmfSUNumCurrActiveSIs > 0") at 
sysf_def.c:281

No locals.

#3  0x004d907d in AVD_SU::dec_curr_act_si (this=0xde8390) at su.cc:2006

__FUNCTION__ = "dec_curr_act_si"

#4  0x004c0301 in avd_susi_delete (cb=0x75a2e0 <_control_block>, 
susi=0xd38320, ckpt=false) at siass.cc:554

i_su_si = 0xd38320

su = 0xde8390

__FUNCTION__ = "avd_susi_delete"

p_su_si = 0x0

p_si_su = 0x0

#5  0x004964e1 in SG_NORED::node_fail (this=0xd7e9a0, cb=0x75a2e0 
<_control_block>, su=0xde8390) at sg_nored_fsm.cc:781

l_si = 0x74ad31a0

old_state = SA_AMF_HA_QUIESCED

su_node_ptr = 0x0

__FUNCTION__ = "node_fail"

#6  0x004b8c78 in avd_node_down_mw_susi_failover (cb=0x75a2e0 
<_control_block>, avnd=0x9e3bf0) at sgproc.cc:1983

i_su = @0xde84a0: 0xde8390

__for_range = @0x9e3eb8: { >> = {_M_impl = {> = 
{<__gnu_cxx::new_allocator> = {}, }, 
_M_start = 0xde84a0,

  _M_finish = 0xde84a8, _M_end_of_storage = 0xde84a8}}, }

__for_begin = {_M_current = 0xde84a0}

__for_end = {_M_current = 0xde84a8}

__FUNCTION__ = "avd_node_down_mw_susi_failover"

#7  0x0045eb75 in avd_node_failover (node=0x9e3bf0) at ndproc.cc:1142

__FUNCTION__ = "avd_node_failover"

#8  0x00456fea in avd_mds_avnd_down_evh (cb=0x75a2e0 <_control_block>, 
evt=0x7f5f78000ec0) at ndfsm.cc:684

node = 0x9e3bf0

__FUNCTION__ = "avd_mds_avnd_down_evh"

#9  0x004514f5 in process_event (cb_now=0x75a2e0 <_control_block>, 
evt=0x7f5f78000ec0) at main.cc:775

__FUNCTION__ = "process_event"

#10 0x00451211 in main_loop () at main.cc:696

pollretval = 1

evt = 0x7f5f78000ec0

mbx_fd = {raise_obj = 10, rmv_obj = 11}

polltmo = -1

term_fd = 22

__FUNCTION__ = "main_loop"

cb = 0x75a2e0 <_control_block>

error = SA_AIS_OK

#11 0x0045178f in main (argc=2, argv=0x74ad33e8) at main.cc:848

No locals.

(gdb)





Regards,

Jianfeng


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net

Re: [users] Question about ticket 1617

2017-03-22 Thread Jianfeng Dong
Thanks, I will take a try in  this way. Have a good day! :-)

Regards,
Jianfeng

-Original Message-
From: Zoran Milinkovic [mailto:zoran.milinko...@ericsson.com] 
Sent: Wednesday, March 22, 2017 12:04 AM
To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
Subject: RE: Question about ticket 1617

Hi,

That were steps to reproduce the problem in OpenSAF 4.5 without changing the 
code.

BR,
Zoran

-Original Message-
From: Jianfeng Dong [mailto:jd...@juniper.net] 
Sent: den 21 mars 2017 16:58
To: Zoran Milinkovic <zoran.milinko...@ericsson.com>; 
opensaf-users@lists.sourceforge.net
Subject: RE: Question about ticket 1617

Thank you Zoran, much appreciate for your so detailed description!

And, do you know any way to reproduce the issue without changing any OpenSAF 
code? I'm not familiar with OpenSAf's source code, and it will cost much 
time(probably more than one day) to build the modified OpenSAF in our build 
environment.

Thanks,
Jianfeng

-Original Message-
From: Zoran Milinkovic [mailto:zoran.milinko...@ericsson.com] 
Sent: Monday, March 20, 2017 11:15 PM
To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
Subject: RE: Question about ticket 1617

Hi Jianfeng,

Ticket #47 (https://sourceforge.net/p/opensaf/tickets/47/) has implemented 
search handle timeout on 10 minutes.
The ticket has been pushed to OpenSAF 4.5.

To produce the problem you describe below in OpenSAF 4.5, first, you need to 
call searchInitialize function, then change time forward for more than 10 
minutes, and after that start calling searchNext.
Be aware that searchNext request to IMMND can pull more search results at once 
and then searchNext works client locally. That's done for the performance 
issue. Because of that some searchNext calls will work after 10 minutes, until 
the next searchNext request is sent to IMMND.
If you change time just after searchInitialize, you will be able to see the 
timeout immediately.

Ticket #1617 (https://sourceforge.net/p/opensaf/tickets/1617/) has changed the 
use of system time with monotonic time, so that the case above cannot happen.

Thanks,
Zoran

-Original Message-----
From: Jianfeng Dong [mailto:jd...@juniper.net] 
Sent: den 20 mars 2017 15:46
To: opensaf-users@lists.sourceforge.net
Subject: [users] Question about ticket 1617

Hi,



We have a bug seems related with ticket 1617( 
https://sourceforge.net/p/opensaf/tickets/1617/ ), now we just found OpenSAF 
5.0 has resolved the issue, but we don't know how to reproduce the issue in our 
release with OpenSAF 4.5.2, thus we can confirm it is caused by the ticket 1617 
indeed and it is gone in our new release with OpenSAF 5.0.



So, could you please kindly tell us how to reproduce the issue of ticket 1617 
in release 4.5.2 with an easy or stable way? Thanks in advance!





Here is the log when the issue happened, seems it was caused by the system time 
changing while OpenSAF is waiting for something



2015-11-18T14:12:12.624742-05:00 pld0113 logger: Synced time with server 
100.100.0.1

2015-11-18T14:12:12.627585-05:00 pld0113 logger: Set hardware clock

2015-11-18T14:12:12.635408-05:00 pld0113 osafimmnd[3427]: NO Clear 1 search 
result(s) for OM handle 3500010d0f. Search timeout 600sec

2015-11-18T14:12:12.635563-05:00 pld0113 osafimmnd[3427]: ER Could not find 
search node for search-ID:54

2015-11-18T14:12:12.635853-05:00 pld0113 osafamfnd[3536]: saImmOmSearchNext 
FAILED, rc = 9

2015-11-18T14:12:12.649634-05:00 pld0113 osafimmnd[3427]: NO Implementer 
connected: 23 (MsgQueueService68879) <0, 10f0f>

2015-11-18T14:12:12.651858-05:00 pld0113 osafimmnd[3427]: NO Implementer 
disconnected 23 <0, 10f0f> (MsgQueueService68879)

2015-11-18T14:12:13.525888-05:00 pld0113 ntpd[3557]: ntpd 
4.2.6p5@1.2349-o<mailto:4.2.6p5@1.2349-o> Fri Nov  6 19:29:25 UTC 2015 (2)



2015-11-06T14:44:07.488664-05:00 pld0107 osafclmna[3531]: Started

2015-11-06T14:44:07.492894-05:00 pld0107 osafclmna[3531]: NO 
safNode=pld0107,safCluster=myClmCluster Joined cluster, nodeid=1070f

2015-11-06T14:44:07.523654-05:00 pld0107 osafamfnd[3540]: Started

2015-11-18T15:14:40.889384-05:00 pld0107 osafimmnd[3431]: NO Clear 1 search 
result(s) for OM handle d0001070f. Search timeout 600sec

2015-11-18T15:14:40.893124-05:00 pld0107 logger: Synced time with server 
100.100.0.1

2015-11-18T15:14:40.894797-05:00 pld0107 osafimmnd[3431]: ER Could not find 
search node for search-ID:14

2015-11-18T15:14:40.895280-05:00 pld0107 osafamfnd[3540]: saImmOmSearchNext 
FAILED, rc = 9

2015-11-18T15:14:40.896828-05:00 pld0107 logger: Set hardware clock

2015-11-18T15:14:40.915801-05:00 pld0107 osafimmnd[3431]: NO Implementer 
connected: 28 (MsgQueueService67343) <0, 10f0f>

2015-11-18T15:14:40.920772-05:00 pld0107 osafimmnd[3431]: NO Implementer 
disconnected 28 <0, 10f0f> (MsgQueueService67343)





Thanks,

Jianfeng



Re: [users] Question about ticket 1617

2017-03-21 Thread Jianfeng Dong
Thank you Zoran, much appreciate for your so detailed description!

And, do you know any way to reproduce the issue without changing any OpenSAF 
code? I'm not familiar with OpenSAf's source code, and it will cost much 
time(probably more than one day) to build the modified OpenSAF in our build 
environment.

Thanks,
Jianfeng

-Original Message-
From: Zoran Milinkovic [mailto:zoran.milinko...@ericsson.com] 
Sent: Monday, March 20, 2017 11:15 PM
To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
Subject: RE: Question about ticket 1617

Hi Jianfeng,

Ticket #47 (https://sourceforge.net/p/opensaf/tickets/47/) has implemented 
search handle timeout on 10 minutes.
The ticket has been pushed to OpenSAF 4.5.

To produce the problem you describe below in OpenSAF 4.5, first, you need to 
call searchInitialize function, then change time forward for more than 10 
minutes, and after that start calling searchNext.
Be aware that searchNext request to IMMND can pull more search results at once 
and then searchNext works client locally. That's done for the performance 
issue. Because of that some searchNext calls will work after 10 minutes, until 
the next searchNext request is sent to IMMND.
If you change time just after searchInitialize, you will be able to see the 
timeout immediately.

Ticket #1617 (https://sourceforge.net/p/opensaf/tickets/1617/) has changed the 
use of system time with monotonic time, so that the case above cannot happen.

Thanks,
Zoran

-Original Message-
From: Jianfeng Dong [mailto:jd...@juniper.net] 
Sent: den 20 mars 2017 15:46
To: opensaf-users@lists.sourceforge.net
Subject: [users] Question about ticket 1617

Hi,



We have a bug seems related with ticket 1617( 
https://sourceforge.net/p/opensaf/tickets/1617/ ), now we just found OpenSAF 
5.0 has resolved the issue, but we don't know how to reproduce the issue in our 
release with OpenSAF 4.5.2, thus we can confirm it is caused by the ticket 1617 
indeed and it is gone in our new release with OpenSAF 5.0.



So, could you please kindly tell us how to reproduce the issue of ticket 1617 
in release 4.5.2 with an easy or stable way? Thanks in advance!





Here is the log when the issue happened, seems it was caused by the system time 
changing while OpenSAF is waiting for something



2015-11-18T14:12:12.624742-05:00 pld0113 logger: Synced time with server 
100.100.0.1

2015-11-18T14:12:12.627585-05:00 pld0113 logger: Set hardware clock

2015-11-18T14:12:12.635408-05:00 pld0113 osafimmnd[3427]: NO Clear 1 search 
result(s) for OM handle 3500010d0f. Search timeout 600sec

2015-11-18T14:12:12.635563-05:00 pld0113 osafimmnd[3427]: ER Could not find 
search node for search-ID:54

2015-11-18T14:12:12.635853-05:00 pld0113 osafamfnd[3536]: saImmOmSearchNext 
FAILED, rc = 9

2015-11-18T14:12:12.649634-05:00 pld0113 osafimmnd[3427]: NO Implementer 
connected: 23 (MsgQueueService68879) <0, 10f0f>

2015-11-18T14:12:12.651858-05:00 pld0113 osafimmnd[3427]: NO Implementer 
disconnected 23 <0, 10f0f> (MsgQueueService68879)

2015-11-18T14:12:13.525888-05:00 pld0113 ntpd[3557]: ntpd 
4.2.6p5@1.2349-o<mailto:4.2.6p5@1.2349-o> Fri Nov  6 19:29:25 UTC 2015 (2)



2015-11-06T14:44:07.488664-05:00 pld0107 osafclmna[3531]: Started

2015-11-06T14:44:07.492894-05:00 pld0107 osafclmna[3531]: NO 
safNode=pld0107,safCluster=myClmCluster Joined cluster, nodeid=1070f

2015-11-06T14:44:07.523654-05:00 pld0107 osafamfnd[3540]: Started

2015-11-18T15:14:40.889384-05:00 pld0107 osafimmnd[3431]: NO Clear 1 search 
result(s) for OM handle d0001070f. Search timeout 600sec

2015-11-18T15:14:40.893124-05:00 pld0107 logger: Synced time with server 
100.100.0.1

2015-11-18T15:14:40.894797-05:00 pld0107 osafimmnd[3431]: ER Could not find 
search node for search-ID:14

2015-11-18T15:14:40.895280-05:00 pld0107 osafamfnd[3540]: saImmOmSearchNext 
FAILED, rc = 9

2015-11-18T15:14:40.896828-05:00 pld0107 logger: Set hardware clock

2015-11-18T15:14:40.915801-05:00 pld0107 osafimmnd[3431]: NO Implementer 
connected: 28 (MsgQueueService67343) <0, 10f0f>

2015-11-18T15:14:40.920772-05:00 pld0107 osafimmnd[3431]: NO Implementer 
disconnected 28 <0, 10f0f> (MsgQueueService67343)





Thanks,

Jianfeng


--
Check out the vibrant tech community on one of the world's most engaging tech 
sites, Slashdot.org! http://sdm.link/slashdot 
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users


Re: [users] How to detect if PLD is being in "SC Absence" mode?

2017-03-10 Thread Jianfeng Dong
Thank you Praveen, much appreciate!



IMO it would be great if user can get all nodes status as below output by a 
tool application(just like AmfUtil/immadm).



IdName   State



2 SC-1   up

3 SC-2   down

4 PL-3   up



IdName   State



2 SC-1   down

3 SC-2   down

4 PL-3   SAM(SC Absent Mode)



Regards,

Jianfeng



-Original Message-
From: praveen malviya [mailto:praveen.malv...@oracle.com]
Sent: Friday, March 10, 2017 1:59 PM
To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] How to detect if PLD is being in "SC Absence" mode?







On 01-Mar-17 2:46 PM, Jianfeng Dong wrote:

> Hi,

>

> We have enabled the new feature "SC Absence" of OpenSAF 5.x in our product, 
> it works good so far.

>

> Now we need to make some actions when PLD go in/out "SC Absence" mode, we 
> have to find a way in PLD to detect if it is being in "SC Absent" mode or not.

> So, does anyone knows how to make it by a utility/tool and C code(i.e. 
> OpenSAF API) as well?

>

There is no API to query OpenSAF for knowing if payload is running in 
presence/absence of SCs. However, behavior of SAF APIs for SAF services is 
documented in respective PR docs.

As of now I have raised a discussion ticket "#2354 osaf: How to detect if 
payload is being in "SC Absence" mode." Ticket will be updated with any known 
or proposed solution.



Thanks,

Praveen





> Thanks,

> Jianfeng

> --

>  Check out the vibrant tech community on one of the world's

> most engaging tech sites, SlashDot.org! http://sdm.link/slashdot

> ___

> Opensaf-users mailing list

> Opensaf-users@lists.sourceforge.net<mailto:Opensaf-users@lists.sourceforge.net>

> https://lists.sourceforge.net/lists/listinfo/opensaf-users

>
--
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users


[users] How to detect if PLD is being in "SC Absence" mode?

2017-03-01 Thread Jianfeng Dong
Hi,

We have enabled the new feature "SC Absence" of OpenSAF 5.x in our product, it 
works good so far.

Now we need to make some actions when PLD go in/out "SC Absence" mode, we have 
to find a way in PLD to detect if it is being in "SC Absent" mode or not.
So, does anyone knows how to make it by a utility/tool and C code(i.e. OpenSAF 
API) as well?

Thanks,
Jianfeng
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users


Re: [users] Max timeout limit problem in the "headless cluster" feature

2016-10-25 Thread Jianfeng Dong
Got it, probably we will upgrade to the next release.
Thanks a lot!

Regards,
Jianfeng

-Original Message-
From: Zoran Milinkovic [mailto:zoran.milinko...@ericsson.com] 
Sent: Tuesday, October 25, 2016 2:48 PM
To: Jianfeng Dong <jd...@juniper.net>; Hung Duc Nguyen 
<hung.d.ngu...@dektech.com.au>; opensaf-users@lists.sourceforge.net
Subject: RE: [users] Max timeout limit problem in the "headless cluster" feature

Hi Jianfeng,

The patch cannot be pushed to earlier OpenSAF versions due to the increased MBC 
version.
The patch will be pushed only to the development branch and will be included in 
the next OpenSAF release.
If you need the patch for earlier OpenSAF version, you can take the patch and 
backport it to your version.

Thanks,
Zoran

-Original Message-----
From: Jianfeng Dong [mailto:jd...@juniper.net] 
Sent: den 21 oktober 2016 09:36
To: Hung Duc Nguyen <hung.d.ngu...@dektech.com.au>; 
opensaf-users@lists.sourceforge.net
Subject: Re: [users] Max timeout limit problem in the "headless cluster" feature

Thanks!

Regards,
Jianfeng

From: Hung Nguyen [mailto:hung.d.ngu...@dektech.com.au]
Sent: Friday, October 21, 2016 11:11 AM
To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] Max timeout limit problem in the "headless cluster" feature


Yes, it will be pushed to all branches, including 5.1.



Regards,


Hung Nguyen - DEK Technologies


----

From: Jianfeng Dong jd...@juniper.net<mailto:jd...@juniper.net>

Sent: Thursday, October 20, 2016 6:17PM

To: Hung Nguyen, Opensaf-users

hung.d.ngu...@dektech.com.au<mailto:hung.d.ngu...@dektech.com.au>, 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>

Cc:



Subject: RE: [users] Max timeout limit problem in the "headless cluster" feature




Much appreciate!

I notice the milestone is 5.0.2, but the fix code of this ticket will go into 
5.1.0 as well, right?

Regards,
Jianfeng

From: Hung Nguyen [mailto:hung.d.ngu...@dektech.com.au]
Sent: Thursday, October 20, 2016 6:48 PM
To: Jianfeng Dong <jd...@juniper.net><mailto:jd...@juniper.net>; 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: Re: [users] Max timeout limit problem in the "headless cluster" feature


Hi,



I opened a defect ticket for this.

https://sourceforge.net/p/opensaf/tickets/2130/



Regards,

Hung Nguyen - DEK Technologies




From: Jianfeng Dong jd...@juniper.net<mailto:jd...@juniper.net>

Sent: Tuesday, October 18, 2016 11:47PM

To: Opensaf-users


opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>

Cc: Jianfeng Dong

jd...@juniper.net<mailto:jd...@juniper.net>

Subject: [users] Max timeout limit problem in the "headless cluster" feature





Hi,



Now we are trying to apply OpenSAF's new feature "headless cluster" in our 
product, but we find out "IMMSV_SC_ABSENCE_ALLOWED"(saved in a 16 bit unsigned 
variable) can be set no more than 65535 which means only 18 hours. We think 
it's too short for our product in some special cases, in those cases we want 
payload card to work as long as possible in "headless" mode. We checked OpenSAF 
source code and find this 16 bit unsigned variable should be able to change to 
32bit unsigned type, with this change we can prevent payload from 
auto-rebooting for a enough long time till the controller come back, and this 
change should be not much risky .



So, could you please change to save "IMMSV_SC_ABSENCE_ALLOWED" into a 32bit 
unsigned variable to keep payload continue running more than 18 hours in 
"headless" mode? Thanks!



Much appreciate to any comment!



Regards,

Jianfeng Dong

--

Check out the vibrant tech community on one of the world's most

engaging tech sites, SlashDot.org! http://sdm.link/slashdot

___

Opensaf-users mailing list

Opensaf-users@lists.sourceforge.net<mailto:Opensaf-users@lists.sourceforge.net>

https://lists.sourceforge.net/lists/listinfo/opensaf-users


--
Check out the vibrant tech community on one of the world's most engaging tech 
sites, SlashDot.org! http://sdm.link/slashdot 
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

--
The Command Line: Reinvented for Modern Developers
Did

Re: [users] How to prevent system auto-reboot if "osafamfd/osafamfnd" is killed?

2016-10-24 Thread Jianfeng Dong
Thanks for clearing that! So we have to change our design to resolve this issue.

Regards,
Jianfeng

-Original Message-
From: praveen malviya [mailto:praveen.malv...@oracle.com] 
Sent: Monday, October 24, 2016 7:56 PM
To: Jianfeng Dong <jd...@juniper.net>; Hans Nordeback 
<hans.nordeb...@ericsson.com>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] How to prevent system auto-reboot if "osafamfd/osafamfnd" 
is killed?

I there is no way as of now. But there is a ticket in AMFD ticket db where a 
solution is proposed to allow user to stop healthcheck by changing healthcheck 
period to 0. So when user modfies to 0, AMF should not give callback to the 
comp.


Thanks,
Praveen

On 22-Oct-16 3:01 PM, Jianfeng Dong wrote:
> Hi,
>
> Thank you, but is there the other way to stop osafamfwd's function at running 
> time(i.e. not change config)? We just want to temporarily disable its 
> auto-boot for a little while, and then recover it after that.
>
> Regards,
> Jianfeng
>
> -Original Message-
> From: Hans Nordeback [mailto:hans.nordeb...@ericsson.com]
> Sent: Saturday, October 22, 2016 5:19 PM
> To: Jianfeng Dong <jd...@juniper.net>; 
> opensaf-users@lists.sourceforge.net
> Subject: Re: [users] How to prevent system auto-reboot if 
> "osafamfd/osafamfnd" is killed?
>
> Hi,
>
> you can uncomment AMFWDOG_TIMEOUT_MS in /etc/opensaf/amfwdog.conf and change 
> accordingly.
>
> /Thanks HansN
>
>
> On 10/22/2016 11:12 AM, Jianfeng Dong wrote:
>> Hi,
>>
>> We know "osafamfwd" will reboot local node when "osafamfd/osafamfnd" is 
>> killed, but now we have a special case that need us to prevent this 
>> auto-reboot for a little time, could somebody please kindly tell us how to 
>> meet this requirement? Or, how to stop osafamfwd's auto-reboot function for 
>> a little while at running time?
>> Thanks!
>>
>> Regards,
>> Jianfeng Dong
>>
>> -
>> -
>>  Check out the vibrant tech community on one of the world's 
>> most engaging tech sites, SlashDot.org! http://sdm.link/slashdot 
>> ___
>> Opensaf-users mailing list
>> Opensaf-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>
>
> --
>  Check out the vibrant tech community on one of the world's 
> most engaging tech sites, SlashDot.org! http://sdm.link/slashdot 
> ___
> Opensaf-users mailing list
> Opensaf-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>

--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users


Re: [users] How to prevent system auto-reboot if "osafamfd/osafamfnd" is killed?

2016-10-22 Thread Jianfeng Dong
Hi,

Thank you, but is there the other way to stop osafamfwd's function at running 
time(i.e. not change config)? We just want to temporarily disable its auto-boot 
for a little while, and then recover it after that.

Regards,
Jianfeng

-Original Message-
From: Hans Nordeback [mailto:hans.nordeb...@ericsson.com] 
Sent: Saturday, October 22, 2016 5:19 PM
To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] How to prevent system auto-reboot if "osafamfd/osafamfnd" 
is killed?

Hi,

you can uncomment AMFWDOG_TIMEOUT_MS in /etc/opensaf/amfwdog.conf and change 
accordingly.

/Thanks HansN


On 10/22/2016 11:12 AM, Jianfeng Dong wrote:
> Hi,
>
> We know "osafamfwd" will reboot local node when "osafamfd/osafamfnd" is 
> killed, but now we have a special case that need us to prevent this 
> auto-reboot for a little time, could somebody please kindly tell us how to 
> meet this requirement? Or, how to stop osafamfwd's auto-reboot function for a 
> little while at running time?
> Thanks!
>
> Regards,
> Jianfeng Dong
>
> --
>  Check out the vibrant tech community on one of the world's 
> most engaging tech sites, SlashDot.org! http://sdm.link/slashdot 
> ___
> Opensaf-users mailing list
> Opensaf-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-users


--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users


Re: [users] Max timeout limit problem in the "headless cluster" feature

2016-10-21 Thread Jianfeng Dong
Thanks!

Regards,
Jianfeng

From: Hung Nguyen [mailto:hung.d.ngu...@dektech.com.au]
Sent: Friday, October 21, 2016 11:11 AM
To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] Max timeout limit problem in the "headless cluster" feature


Yes, it will be pushed to all branches, including 5.1.



Regards,


Hung Nguyen - DEK Technologies




From: Jianfeng Dong jd...@juniper.net<mailto:jd...@juniper.net>

Sent: Thursday, October 20, 2016 6:17PM

To: Hung Nguyen, Opensaf-users

hung.d.ngu...@dektech.com.au<mailto:hung.d.ngu...@dektech.com.au>, 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>

Cc:



Subject: RE: [users] Max timeout limit problem in the "headless cluster" feature




Much appreciate!

I notice the milestone is 5.0.2, but the fix code of this ticket will go into 
5.1.0 as well, right?

Regards,
Jianfeng

From: Hung Nguyen [mailto:hung.d.ngu...@dektech.com.au]
Sent: Thursday, October 20, 2016 6:48 PM
To: Jianfeng Dong <jd...@juniper.net><mailto:jd...@juniper.net>; 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: Re: [users] Max timeout limit problem in the "headless cluster" feature


Hi,



I opened a defect ticket for this.

https://sourceforge.net/p/opensaf/tickets/2130/



Regards,

Hung Nguyen - DEK Technologies


----

From: Jianfeng Dong jd...@juniper.net<mailto:jd...@juniper.net>

Sent: Tuesday, October 18, 2016 11:47PM

To: Opensaf-users


opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>

Cc: Jianfeng Dong

jd...@juniper.net<mailto:jd...@juniper.net>

Subject: [users] Max timeout limit problem in the "headless cluster" feature





Hi,



Now we are trying to apply OpenSAF's new feature "headless cluster" in our 
product, but we find out "IMMSV_SC_ABSENCE_ALLOWED"(saved in a 16 bit unsigned 
variable) can be set no more than 65535 which means only 18 hours. We think 
it's too short for our product in some special cases, in those cases we want 
payload card to work as long as possible in "headless" mode. We checked OpenSAF 
source code and find this 16 bit unsigned variable should be able to change to 
32bit unsigned type, with this change we can prevent payload from 
auto-rebooting for a enough long time till the controller come back, and this 
change should be not much risky .



So, could you please change to save "IMMSV_SC_ABSENCE_ALLOWED" into a 32bit 
unsigned variable to keep payload continue running more than 18 hours in 
"headless" mode? Thanks!



Much appreciate to any comment!



Regards,

Jianfeng Dong

--

Check out the vibrant tech community on one of the world's most

engaging tech sites, SlashDot.org! http://sdm.link/slashdot

___

Opensaf-users mailing list

Opensaf-users@lists.sourceforge.net<mailto:Opensaf-users@lists.sourceforge.net>

https://lists.sourceforge.net/lists/listinfo/opensaf-users


--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users


Re: [users] Max timeout limit problem in the "headless cluster" feature

2016-10-20 Thread Jianfeng Dong
Much appreciate!

I notice the milestone is 5.0.2, but the fix code of this ticket will go into 
5.1.0 as well, right?

Regards,
Jianfeng

From: Hung Nguyen [mailto:hung.d.ngu...@dektech.com.au]
Sent: Thursday, October 20, 2016 6:48 PM
To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] Max timeout limit problem in the "headless cluster" feature


Hi,



I opened a defect ticket for this.

https://sourceforge.net/p/opensaf/tickets/2130/



Regards,

Hung Nguyen - DEK Technologies




From: Jianfeng Dong jd...@juniper.net<mailto:jd...@juniper.net>

Sent: Tuesday, October 18, 2016 11:47PM

To: Opensaf-users


opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>

Cc: Jianfeng Dong

jd...@juniper.net<mailto:jd...@juniper.net>

Subject: [users] Max timeout limit problem in the "headless cluster" feature





Hi,



Now we are trying to apply OpenSAF's new feature "headless cluster" in our 
product, but we find out "IMMSV_SC_ABSENCE_ALLOWED"(saved in a 16 bit unsigned 
variable) can be set no more than 65535 which means only 18 hours. We think 
it's too short for our product in some special cases, in those cases we want 
payload card to work as long as possible in "headless" mode. We checked OpenSAF 
source code and find this 16 bit unsigned variable should be able to change to 
32bit unsigned type, with this change we can prevent payload from 
auto-rebooting for a enough long time till the controller come back, and this 
change should be not much risky .



So, could you please change to save "IMMSV_SC_ABSENCE_ALLOWED" into a 32bit 
unsigned variable to keep payload continue running more than 18 hours in 
"headless" mode? Thanks!



Much appreciate to any comment!



Regards,

Jianfeng Dong

--

Check out the vibrant tech community on one of the world's most

engaging tech sites, SlashDot.org! http://sdm.link/slashdot

___

Opensaf-users mailing list

Opensaf-users@lists.sourceforge.net<mailto:Opensaf-users@lists.sourceforge.net>

https://lists.sourceforge.net/lists/listinfo/opensaf-users

--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users


Re: [users] OpenSAF release 5.0.1 can not promote SC after enable "headless cluster" feature

2016-10-12 Thread Jianfeng Dong
Hi Anders,

You are right, the problem is controller's slot-id is higher than payload in 
our case.

Today we ran a test,  built a system based on release 5.1.0 and configured it 
into a special setup, to make sure controller's slot-id is lower than payload, 
and then 'headless' feature worked perfectly: after reboot both controllers on 
purpose, all payload card didn't reload automatically and got recovered 
together with both controllers after they finished rebooting. 

We can say the "headless cluster" feature can meet our customer requirement 
completely, now looking forward to your patch very much! Thanks in advance.


Thanks,
Jianfeng

-Original Message-----
From: Jianfeng Dong 
Sent: Tuesday, October 11, 2016 8:48 PM
To: 'Anders Widell' <anders.wid...@ericsson.com>; Neelakanta Reddy 
<reddy.neelaka...@oracle.com>; opensaf-users@lists.sourceforge.net
Subject: RE: [users] OpenSAF release 5.0.1 can not promote SC after enable 
"headless cluster" feature

OK, I would like to have a try with your patch, please let me know when you're 
ready. 

Much appreciate to you Anders! And also thanks Neel for your help!

Have a good day guys!

Regards,
Jianfeng

-Original Message-
From: Anders Widell [mailto:anders.wid...@ericsson.com]
Sent: Tuesday, October 11, 2016 5:59 PM
To: Jianfeng Dong <jd...@juniper.net>; Neelakanta Reddy 
<reddy.neelaka...@oracle.com>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] OpenSAF release 5.0.1 can not promote SC after enable 
"headless cluster" feature

I can send you a patch within the next few days and let you try it out.

regards,

Anders Widell


On 10/11/2016 11:36 AM, Jianfeng Dong wrote:
> Do you have a clear plan to remove this requirement?
> We want to know if we can't change node_id due to our architecture,  when we 
> could get a no-this-limit release to upgrade? After all, our products have 
> been deployed to many customers so we have to think about upgrade and 
> compatibility issues.
>
> Thanks,
> Jianfeng
>
> -Original Message-
> From: Anders Widell [mailto:anders.wid...@ericsson.com]
> Sent: Tuesday, October 11, 2016 4:10 PM
> To: Jianfeng Dong <jd...@juniper.net>; Neelakanta Reddy 
> <reddy.neelaka...@oracle.com>; opensaf-users@lists.sourceforge.net
> Subject: Re: [users] OpenSAF release 5.0.1 can not promote SC after 
> enable "headless cluster" feature
>
> Yes, this is required with the current implementation. It might be possible 
> to remove this requirement - I will think about how it can be done.
>
> regards,
>
> Anders Widell
>
>
> On 10/11/2016 09:06 AM, Jianfeng Dong wrote:
>> Is it obligatory that controller must have a slower slot_id than payload if 
>> we want to enable "headless" feature?
>> If it is obligatory, seems it's a big change to our architecture, but I will 
>> have a try at least.
>>
>> Thanks,
>> Jianfeng
>>
>> -Original Message-
>> From: Anders Widell [mailto:anders.wid...@ericsson.com]
>> Sent: Tuesday, October 11, 2016 2:30 PM
>> To: Jianfeng Dong <jd...@juniper.net>; Neelakanta Reddy 
>> <reddy.neelaka...@oracle.com>; opensaf-users@lists.sourceforge.net
>> Subject: Re: [users] OpenSAF release 5.0.1 can not promote SC after 
>> enable "headless cluster" feature
>>
>> There is a one-to-one mapping between /etc/opensaf/slot_id and the node_id. 
>> Simply make sure that all your system controller nodes have lower slot_id 
>> than any of your payloads. This file is read when the node is booted. You 
>> should be able to do an in-service renumbering of your nodes - just be 
>> careful so that you never have two nodes with the same node_id at the same 
>> time.
>>
>> Yes, the assumption is there in 5.1.0 as well.
>>
>> regards,
>>
>> Anders Widell
>>
>>
>> On 10/11/2016 04:29 AM, Jianfeng Dong wrote:
>>> Yes, in our product payload's node_id is lower than SC, could you please 
>>> tell us how to configure it?
>>>
>>> And, does this assumption exist in OpenSAF 5.1.0 as well?
>>>
>>> Thanks,
>>> Jianfeng
>>>
>>> -Original Message-
>>> From: Anders Widell [mailto:anders.wid...@ericsson.com]
>>> Sent: Tuesday, October 11, 2016 12:55 AM
>>> To: Jianfeng Dong <jd...@juniper.net>; Neelakanta Reddy 
>>> <reddy.neelaka...@oracle.com>; opensaf-users@lists.sourceforge.net
>>> Subject: Re: [users] OpenSAF release 5.0.1 can not promote SC after 
>>> enable "headless cluster" feature
>>>
>>> There is a (probably not so well documented :-) assumption that the

Re: [users] OpenSAF release 5.0.1 can not promote SC after enable "headless cluster" feature

2016-10-11 Thread Jianfeng Dong
Do you have a clear plan to remove this requirement? 
We want to know if we can't change node_id due to our architecture,  when we 
could get a no-this-limit release to upgrade? After all, our products have been 
deployed to many customers so we have to think about upgrade and compatibility 
issues.

Thanks,
Jianfeng

-Original Message-
From: Anders Widell [mailto:anders.wid...@ericsson.com] 
Sent: Tuesday, October 11, 2016 4:10 PM
To: Jianfeng Dong <jd...@juniper.net>; Neelakanta Reddy 
<reddy.neelaka...@oracle.com>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] OpenSAF release 5.0.1 can not promote SC after enable 
"headless cluster" feature

Yes, this is required with the current implementation. It might be possible to 
remove this requirement - I will think about how it can be done.

regards,

Anders Widell


On 10/11/2016 09:06 AM, Jianfeng Dong wrote:
> Is it obligatory that controller must have a slower slot_id than payload if 
> we want to enable "headless" feature?
> If it is obligatory, seems it's a big change to our architecture, but I will 
> have a try at least.
>
> Thanks,
> Jianfeng
>
> -Original Message-
> From: Anders Widell [mailto:anders.wid...@ericsson.com]
> Sent: Tuesday, October 11, 2016 2:30 PM
> To: Jianfeng Dong <jd...@juniper.net>; Neelakanta Reddy 
> <reddy.neelaka...@oracle.com>; opensaf-users@lists.sourceforge.net
> Subject: Re: [users] OpenSAF release 5.0.1 can not promote SC after 
> enable "headless cluster" feature
>
> There is a one-to-one mapping between /etc/opensaf/slot_id and the node_id. 
> Simply make sure that all your system controller nodes have lower slot_id 
> than any of your payloads. This file is read when the node is booted. You 
> should be able to do an in-service renumbering of your nodes - just be 
> careful so that you never have two nodes with the same node_id at the same 
> time.
>
> Yes, the assumption is there in 5.1.0 as well.
>
> regards,
>
> Anders Widell
>
>
> On 10/11/2016 04:29 AM, Jianfeng Dong wrote:
>> Yes, in our product payload's node_id is lower than SC, could you please 
>> tell us how to configure it?
>>
>> And, does this assumption exist in OpenSAF 5.1.0 as well?
>>
>> Thanks,
>> Jianfeng
>>
>> -Original Message-
>> From: Anders Widell [mailto:anders.wid...@ericsson.com]
>> Sent: Tuesday, October 11, 2016 12:55 AM
>> To: Jianfeng Dong <jd...@juniper.net>; Neelakanta Reddy 
>> <reddy.neelaka...@oracle.com>; opensaf-users@lists.sourceforge.net
>> Subject: Re: [users] OpenSAF release 5.0.1 can not promote SC after 
>> enable "headless cluster" feature
>>
>> There is a (probably not so well documented :-) assumption that the system 
>> controllers are configured with a lower node_id than the payloads. From what 
>> I can see in the logs you sent, I think it looks like you have configured 
>> the payload with a lower node_id than the system controllers.
>>
>> By the way, the headless feature has been improved in OpenSAF 5.1.0 so I 
>> would suggest that you upgrade to that version if possible.
>>
>> regards,
>>
>> Anders Widell
>>
>>
>> On 10/10/2016 06:04 PM, Jianfeng Dong wrote:
>>> I tried with sufficient drive space but got same result, neither of the two 
>>> SCs can be promoted to be controller until the payload reboot.
>>>
>>> I also checked the network link between SC and payload, they can PING each 
>>> other when this issue happened. I suspect too the problem is caused by 
>>> IMMD/IMMND link among those nodes, but don't know how to prove it.
>>>
>>> From: Neelakanta Reddy [mailto:reddy.neelaka...@oracle.com]
>>> Sent: Monday, October 10, 2016 8:39 PM
>>> To: Jianfeng Dong <jd...@juniper.net>; 
>>> opensaf-users@lists.sourceforge.net
>>> Subject: Re: [users] OpenSAF release 5.0.1 can not promote SC after 
>>> enable "headless cluster" feature
>>>
>>> Hi,
>>>
>>> Once after the "Headless" if any of the controller started then the IMMND 
>>> from the payaload will send the intro message to IMMD.
>>> Looks like this did not happen, the following is the log from the payload:
>>>
>>> 2016-10-10T11:09:18.507851+08:00 pld0101 osafimmnd[3141]: message 
>>> repeated 2 times: [ logtrace: write failed, No space left on device]
>>> 2016-10-10T11:09:18.507883+08:00 pld0101 osafimmnd[3141]: NO 
>>> Re-introduce-me highestProcessed:23839 highestReceived:23839
>>> 2016-10-10T11:09:18.508011+08:00 pld0101

Re: [users] OpenSAF release 5.0.1 can not promote SC after enable "headless cluster" feature

2016-10-11 Thread Jianfeng Dong
Is it obligatory that controller must have a slower slot_id than payload if we 
want to enable "headless" feature? 
If it is obligatory, seems it's a big change to our architecture, but I will 
have a try at least.

Thanks,
Jianfeng

-Original Message-
From: Anders Widell [mailto:anders.wid...@ericsson.com] 
Sent: Tuesday, October 11, 2016 2:30 PM
To: Jianfeng Dong <jd...@juniper.net>; Neelakanta Reddy 
<reddy.neelaka...@oracle.com>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] OpenSAF release 5.0.1 can not promote SC after enable 
"headless cluster" feature

There is a one-to-one mapping between /etc/opensaf/slot_id and the node_id. 
Simply make sure that all your system controller nodes have lower slot_id than 
any of your payloads. This file is read when the node is booted. You should be 
able to do an in-service renumbering of your nodes - just be careful so that 
you never have two nodes with the same node_id at the same time.

Yes, the assumption is there in 5.1.0 as well.

regards,

Anders Widell


On 10/11/2016 04:29 AM, Jianfeng Dong wrote:
> Yes, in our product payload's node_id is lower than SC, could you please tell 
> us how to configure it?
>
> And, does this assumption exist in OpenSAF 5.1.0 as well?
>
> Thanks,
> Jianfeng
>
> -Original Message-
> From: Anders Widell [mailto:anders.wid...@ericsson.com]
> Sent: Tuesday, October 11, 2016 12:55 AM
> To: Jianfeng Dong <jd...@juniper.net>; Neelakanta Reddy 
> <reddy.neelaka...@oracle.com>; opensaf-users@lists.sourceforge.net
> Subject: Re: [users] OpenSAF release 5.0.1 can not promote SC after 
> enable "headless cluster" feature
>
> There is a (probably not so well documented :-) assumption that the system 
> controllers are configured with a lower node_id than the payloads. From what 
> I can see in the logs you sent, I think it looks like you have configured the 
> payload with a lower node_id than the system controllers.
>
> By the way, the headless feature has been improved in OpenSAF 5.1.0 so I 
> would suggest that you upgrade to that version if possible.
>
> regards,
>
> Anders Widell
>
>
> On 10/10/2016 06:04 PM, Jianfeng Dong wrote:
>> I tried with sufficient drive space but got same result, neither of the two 
>> SCs can be promoted to be controller until the payload reboot.
>>
>> I also checked the network link between SC and payload, they can PING each 
>> other when this issue happened. I suspect too the problem is caused by 
>> IMMD/IMMND link among those nodes, but don't know how to prove it.
>>
>> From: Neelakanta Reddy [mailto:reddy.neelaka...@oracle.com]
>> Sent: Monday, October 10, 2016 8:39 PM
>> To: Jianfeng Dong <jd...@juniper.net>; 
>> opensaf-users@lists.sourceforge.net
>> Subject: Re: [users] OpenSAF release 5.0.1 can not promote SC after 
>> enable "headless cluster" feature
>>
>> Hi,
>>
>> Once after the "Headless" if any of the controller started then the IMMND 
>> from the payaload will send the intro message to IMMD.
>> Looks like this did not happen, the following is the log from the payload:
>>
>> 2016-10-10T11:09:18.507851+08:00 pld0101 osafimmnd[3141]: message 
>> repeated 2 times: [ logtrace: write failed, No space left on device]
>> 2016-10-10T11:09:18.507883+08:00 pld0101 osafimmnd[3141]: NO 
>> Re-introduce-me highestProcessed:23839 highestReceived:23839
>> 2016-10-10T11:09:18.508011+08:00 pld0101 osafimmnd[3141]: logtrace:
>> write failed, No space left on device
>> 2016-10-10T11:09:18.508129+08:00 pld0101 osafimmnd[3141]: logtrace:
>> write failed, No space left on device
>> 2016-10-10T11:09:18.508501+08:00 pld0101 osafimmnd[3141]: WA MDS Send 
>> Failed to service:IMMD rc:2
>>
>>
>> Retry, again with the sufficient space in payload.
>>
>> /Neel.
>>
>> On 2016/10/10 03:59 PM, Jianfeng Dong wrote:
>>
>> Hi,
>>
>>
>>
>> For several years we use OpenSAF(4.5.2 now) to provide HA service in our 
>> product(including 2 SC and several payload cards), but our customer keep on 
>> requiring that it's better to do NOT reboot payload card even if both SC 
>> reload or hang.
>>
>>
>>
>> We just knew that the new release 5.0.0 has provided this feature(i.e. 
>> "headless cluster"), so we installed 5.0.0 into our product and enable 
>> "headless" feature by setting "IMMSV_SC_ABSENCE_ALLOWED" to 900 seconds. 
>> After installation we found it worked fine, our system with new OpenSAF 
>> release can start to run successfully, all SC and payload cards can be "UP", 
>

[users] OpenSAF release 5.0.1 can not promote SC after enable "headless cluster" feature

2016-10-10 Thread Jianfeng Dong
Hi,

For several years we use OpenSAF(4.5.2 now) to provide HA service in our 
product(including 2 SC and several payload cards), but our customer keep on 
requiring that it's better to do NOT reboot payload card even if both SC reload 
or hang.

We just knew that the new release 5.0.0 has provided this feature(i.e. 
"headless cluster"), so we installed 5.0.0 into our product and enable 
"headless" feature by setting "IMMSV_SC_ABSENCE_ALLOWED" to 900 seconds. After 
installation we found it worked fine, our system with new OpenSAF release can 
start to run successfully, all SC and payload cards can be "UP", and payload 
card will NOT reboot immediately after we reload both SC.

However we got a problem that, neither of two SC can't be promoted to be 
controller after reboot until the "headless" payload reboot due to 
'IMMSV_SC_ABSENCE_ALLOWED' timeout after 900 seconds. Seems OpenSAF modules in 
both SC just wait there and do nothing, till payload reboot due to timeout, 
then OpenSAF in SC continue to run, whole system recovered finally.

We thought ticket #1828 may has resolved this issue so we took another try with 
release 5.0.1 but got same result.

Could you please tell us in our case, why OpenSAF in both SC could not run 
until payload card(in "headless" status) rebooted due to timeout?
Besides 'IMMSV_SC_ABSENCE_ALLOWED', is there any other variable or parameter 
need to set/modify to enable 'headless cluster' feature? Do we miss anything?
Attachments are the syslog of SC and payload card when this problem happened, 
hope the log files can help us to find out the root cause.

Much appreciated to any comment, thanks!

Regards,
Jianfeng Dong



SC.log
Description: SC.log


payload.log
Description: payload.log
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users