Re: [users] Payload card reboot due to a short time network break

2018-04-10 Thread Jianfeng Dong
Anders, you are right we do need to care about other nodes in the whole system, 
we have to keep IMMA_SYNCR_TIMEOUT bigger than TIPC tolerance for fixing 
another issue we ever had.
Regarding to the multi hop, fortunately in our system every PLD connects 
directly  with every SC, so probably we don’t need to worry about it.

I will make some tests on the change in our system, and also I will read the 
description about the parameter again in OpenSAF’s docs in case I missed 
something there.

Much appreciate!

Regards,
Jianfeng

From: Anders Widell <anders.wid...@ericsson.com>
Sent: Tuesday, April 10, 2018 2:19 AM
To: Jianfeng Dong <jd...@juniper.net>
Cc: opensaf-users@lists.sourceforge.net
Subject: Re: [users] Payload card reboot due to a short time network break


The only way to be sure if it is appropriate is to test under realistic 
conditions. I agree that it makes sense to increase it so that it is larger 
than the TIPC link tolerance. It should be noted that the IMM agent always 
communicates directly with the IMM node director running on the same node, and 
for this communication I don't think the TIPC link tolerance is relevant (you 
will immediately detect if the IMM node director process goes away). However, 
the IMM node director may in turn have to communicate with IMM processes 
running on other nodes in the cluster in order to fulfill your request, and for 
that communication the TIPC link tolerance comes into play. If it needs to 
communicate in several hops it may even make sense to have a time-out which is 
several times the TIPC link tolerance (compare with the default values for 
these time-outs: link tolerance=1.5 seconds and IMMA time-out=10 seconds).

regards,

Anders Widell

On 04/09/2018 10:19 AM, Jianfeng Dong wrote:
Hi Anders,

Now we want to increase TIPC tolerance from current 10 seconds to 12 or 15, 
thus we also need to increase a OpenSAF parameter ‘IMMA_SYNCR_TIMEOUT’ from 
current 12 seconds to a bigger value(20 maybe), do you think 20 seconds is 
proper for the parameter?
Thanks.

Regards,
Jianfeng

From: Jianfeng Dong
Sent: Tuesday, March 13, 2018 5:38 PM
To: Anders Widell 
<anders.wid...@ericsson.com><mailto:anders.wid...@ericsson.com>; Mathi N P 
<mathi.np@gmail.com><mailto:mathi.np@gmail.com>
Cc: 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: RE: [users] Payload card reboot due to a short time network break

Anders,

As you can see in those logs we had set the TIPC link tolerance to 10 seconds, 
I’m just not sure how long is proper especially for this case.
I think I can take a try at least, to turn TIPC running on the Ethernet 
interfaces instead.
Thanks for your comment for the CLM design idea, I understand it definitely 
would not be easy to make such a change.

Thanks,
Jianfeng

From: Anders Widell [mailto:anders.wid...@ericsson.com]
Sent: Monday, March 12, 2018 7:52 PM
To: Mathi N P <mathi.np@gmail.com<mailto:mathi.np@gmail.com>>; Jianfeng 
Dong <jd...@juniper.net<mailto:jd...@juniper.net>>
Cc: 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: Re: [users] Payload card reboot due to a short time network break


We also tried running TIPC on a bonded interface but ended up having to change 
it since it never worked well. When you have two redundant Ethernet interfaces, 
TIPC will tolerate failures in one of them seamlessly without losing 
connectivity. But when you run TIPC on a bonded interface it doesn't work, as 
you can see in your case. I guess the reason is that you have two separate 
mechanisms on top of each other, trying to achieve the same thing. One possible 
workaround is to increase the TIPC link tolerance.

When we lose connectivity with a node in the cluster, we are expecting that it 
happened because the other node went down (rebooted or permanently died). We 
don't expect to re-establish connectivity with the same node unless it has 
rebooted in between. It would be possible to introduce a grace time to allow a 
node to stay in the CLM cluster for a while after the connectivity with it has 
been lost, and allow it to continue as a cluster member if connectivity is 
re-established before this grace time has expired. However, this is not so easy 
and it is much easier to increase the TIPC link tolerance and let TIPC handle 
this for us.

regards,

Anders Widell

On 03/09/2018 12:42 PM, Mathi N P wrote:
This is an interesting case (and 'rare' :-))

2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node reboot 
order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late 
node_up_msg after cluster startup timeout
2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the 
reboot command from SC and thus it reboot itself.
Given that the node has not 'instantiated' completely and a reboot order can be 
treated as a 'failed start

Re: [users] Payload card reboot due to a short time network break

2018-04-09 Thread Anders Widell
The only way to be sure if it is appropriate is to test under realistic 
conditions. I agree that it makes sense to increase it so that it is 
larger than the TIPC link tolerance. It should be noted that the IMM 
agent always communicates directly with the IMM node director running on 
the same node, and for this communication I don't think the TIPC link 
tolerance is relevant (you will immediately detect if the IMM node 
director process goes away). However, the IMM node director may in turn 
have to communicate with IMM processes running on other nodes in the 
cluster in order to fulfill your request, and for that communication the 
TIPC link tolerance comes into play. If it needs to communicate in 
several hops it may even make sense to have a time-out which is several 
times the TIPC link tolerance (compare with the default values for these 
time-outs: link tolerance=1.5 seconds and IMMA time-out=10 seconds).


regards,

Anders Widell


On 04/09/2018 10:19 AM, Jianfeng Dong wrote:


Hi Anders,

Now we want to increase TIPC tolerance from current 10 seconds to 12 
or 15, thus we also need to increase a OpenSAF parameter 
‘IMMA_SYNCR_TIMEOUT’ from current 12 seconds to a bigger value(20 
maybe), do you think 20 seconds is proper for the parameter?


Thanks.

Regards,

Jianfeng

*From:*Jianfeng Dong
*Sent:* Tuesday, March 13, 2018 5:38 PM
*To:* Anders Widell <anders.wid...@ericsson.com>; Mathi N P 
<mathi.np@gmail.com>

*Cc:* opensaf-users@lists.sourceforge.net
*Subject:* RE: [users] Payload card reboot due to a short time network 
break


Anders,

As you can see in those logs we had set the TIPC link tolerance to 10 
seconds, I’m just not sure how long is proper especially for this case.


I think I can take a try at least, to turn TIPC running on the 
Ethernet interfaces instead.


Thanks for your comment for the CLM design idea, I understand it 
definitely would not be easy to make such a change.


Thanks,

Jianfeng

*From:*Anders Widell [mailto:anders.wid...@ericsson.com]
*Sent:* Monday, March 12, 2018 7:52 PM
*To:* Mathi N P <mathi.np@gmail.com 
<mailto:mathi.np@gmail.com>>; Jianfeng Dong <jd...@juniper.net 
<mailto:jd...@juniper.net>>
*Cc:* opensaf-users@lists.sourceforge.net 
<mailto:opensaf-users@lists.sourceforge.net>
*Subject:* Re: [users] Payload card reboot due to a short time network 
break


We also tried running TIPC on a bonded interface but ended up having 
to change it since it never worked well. When you have two redundant 
Ethernet interfaces, TIPC will tolerate failures in one of them 
seamlessly without losing connectivity. But when you run TIPC on a 
bonded interface it doesn't work, as you can see in your case. I guess 
the reason is that you have two separate mechanisms on top of each 
other, trying to achieve the same thing. One possible workaround is to 
increase the TIPC link tolerance.


When we lose connectivity with a node in the cluster, we are expecting 
that it happened because the other node went down (rebooted or 
permanently died). We don't expect to re-establish connectivity with 
the same node unless it has rebooted in between. It would be possible 
to introduce a grace time to allow a node to stay in the CLM cluster 
for a while after the connectivity with it has been lost, and allow it 
to continue as a cluster member if connectivity is re-established 
before this grace time has expired. However, this is not so easy and 
it is much easier to increase the TIPC link tolerance and let TIPC 
handle this for us.


regards,

Anders Widell

On 03/09/2018 12:42 PM, Mathi N P wrote:

This is an interesting case (and 'rare' :-))

2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending
node reboot order to
node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late
node_up_msg after cluster startup timeout
2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then
it got the reboot command from SC and thus it reboot itself.

Given that the node has not 'instantiated' completely and a reboot
order can be treated as a 'failed start up', based on the current
AMF state,

AMF can make a decision by reading the
'saamfnodefailfastoninstantiationfailure' (or perhaps
'saamfnodeautorepair' ) attribute to reboot or not and report a
node instantantiation failure (back to the rc script and other
associated events for that state).

Thanks,

Mathi.

On Fri, Mar 9, 2018 at 10:42 AM, Jianfeng Dong <jd...@juniper.net
<mailto:jd...@juniper.net>> wrote:

Thanks Anders, much appreciate.

And yes, in PLD we run TIPC on a bonded interface which
comprises two Ethernet interfaces.
I'm wondering why a bonding interface can't provide similar
protection like TIPC does, is it because TIPC is more robust
or something else? I'm not sure if it is right to change the
low-level design at this t

Re: [users] Payload card reboot due to a short time network break

2018-04-09 Thread Jianfeng Dong
Hi Anders,

Now we want to increase TIPC tolerance from current 10 seconds to 12 or 15, 
thus we also need to increase a OpenSAF parameter ‘IMMA_SYNCR_TIMEOUT’ from 
current 12 seconds to a bigger value(20 maybe), do you think 20 seconds is 
proper for the parameter?
Thanks.

Regards,
Jianfeng

From: Jianfeng Dong
Sent: Tuesday, March 13, 2018 5:38 PM
To: Anders Widell <anders.wid...@ericsson.com>; Mathi N P 
<mathi.np@gmail.com>
Cc: opensaf-users@lists.sourceforge.net
Subject: RE: [users] Payload card reboot due to a short time network break

Anders,

As you can see in those logs we had set the TIPC link tolerance to 10 seconds, 
I’m just not sure how long is proper especially for this case.
I think I can take a try at least, to turn TIPC running on the Ethernet 
interfaces instead.
Thanks for your comment for the CLM design idea, I understand it definitely 
would not be easy to make such a change.

Thanks,
Jianfeng

From: Anders Widell [mailto:anders.wid...@ericsson.com]
Sent: Monday, March 12, 2018 7:52 PM
To: Mathi N P <mathi.np@gmail.com<mailto:mathi.np@gmail.com>>; Jianfeng 
Dong <jd...@juniper.net<mailto:jd...@juniper.net>>
Cc: 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: Re: [users] Payload card reboot due to a short time network break


We also tried running TIPC on a bonded interface but ended up having to change 
it since it never worked well. When you have two redundant Ethernet interfaces, 
TIPC will tolerate failures in one of them seamlessly without losing 
connectivity. But when you run TIPC on a bonded interface it doesn't work, as 
you can see in your case. I guess the reason is that you have two separate 
mechanisms on top of each other, trying to achieve the same thing. One possible 
workaround is to increase the TIPC link tolerance.

When we lose connectivity with a node in the cluster, we are expecting that it 
happened because the other node went down (rebooted or permanently died). We 
don't expect to re-establish connectivity with the same node unless it has 
rebooted in between. It would be possible to introduce a grace time to allow a 
node to stay in the CLM cluster for a while after the connectivity with it has 
been lost, and allow it to continue as a cluster member if connectivity is 
re-established before this grace time has expired. However, this is not so easy 
and it is much easier to increase the TIPC link tolerance and let TIPC handle 
this for us.

regards,

Anders Widell

On 03/09/2018 12:42 PM, Mathi N P wrote:
This is an interesting case (and 'rare' :-))

2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node reboot 
order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late 
node_up_msg after cluster startup timeout
2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the 
reboot command from SC and thus it reboot itself.
Given that the node has not 'instantiated' completely and a reboot order can be 
treated as a 'failed start up', based on the current AMF state,
AMF can make a decision by reading the 
'saamfnodefailfastoninstantiationfailure' (or perhaps 'saamfnodeautorepair' ) 
attribute to reboot or not and report a node instantantiation failure (back to 
the rc script and other associated events for that state).

Thanks,
Mathi.



On Fri, Mar 9, 2018 at 10:42 AM, Jianfeng Dong 
<jd...@juniper.net<mailto:jd...@juniper.net>> wrote:
Thanks Anders, much appreciate.

And yes, in PLD we run TIPC on a bonded interface which comprises two Ethernet 
interfaces.
I'm wondering why a bonding interface can't provide similar protection like 
TIPC does, is it because TIPC is more robust or something else? I'm not sure if 
it is right to change the low-level design at this time point for our product, 
I will talk with my workmates on this change and find more details in TIPC 
manual.

Regarding to OpenSAF part, do you guys think is it possible that SC do not 
force rebooting the PLD in this case? After all the connection recovered 
quickly.

Regards,
Jianfeng

-Original Message-
From: Anders Widell 
[mailto:anders.wid...@ericsson.com<mailto:anders.wid...@ericsson.com>]
Sent: Thursday, March 8, 2018 8:38 PM
To: Jianfeng Dong <jd...@juniper.net<mailto:jd...@juniper.net>>; 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: Re: [users] Payload card reboot due to a short time network break

Hi!

Are you running TIPC on a bonded interface? I wouldn't recommend this.
Instead, you should run TIPC on the raw Ethernet interfaces and let TIPC handle 
the link fail-over in case of a failure in one of them. TIPC should be able to 
do this without ever losing the connectivity between the nodes.

regards,

Anders Widell


On 03/08/2018 10:43 AM, Jianfeng Dong wrote:
> Hi,
>
> Several days ago we got a payload card reboot issue in customer field, a PLD 
&g

Re: [users] Payload card reboot due to a short time network break

2018-03-13 Thread Jianfeng Dong
Anders,

As you can see in those logs we had set the TIPC link tolerance to 10 seconds, 
I’m just not sure how long is proper especially for this case.
I think I can take a try at least, to turn TIPC running on the Ethernet 
interfaces instead.
Thanks for your comment for the CLM design idea, I understand it definitely 
would not be easy to make such a change.

Thanks,
Jianfeng

From: Anders Widell [mailto:anders.wid...@ericsson.com]
Sent: Monday, March 12, 2018 7:52 PM
To: Mathi N P <mathi.np@gmail.com>; Jianfeng Dong <jd...@juniper.net>
Cc: opensaf-users@lists.sourceforge.net
Subject: Re: [users] Payload card reboot due to a short time network break


We also tried running TIPC on a bonded interface but ended up having to change 
it since it never worked well. When you have two redundant Ethernet interfaces, 
TIPC will tolerate failures in one of them seamlessly without losing 
connectivity. But when you run TIPC on a bonded interface it doesn't work, as 
you can see in your case. I guess the reason is that you have two separate 
mechanisms on top of each other, trying to achieve the same thing. One possible 
workaround is to increase the TIPC link tolerance.

When we lose connectivity with a node in the cluster, we are expecting that it 
happened because the other node went down (rebooted or permanently died). We 
don't expect to re-establish connectivity with the same node unless it has 
rebooted in between. It would be possible to introduce a grace time to allow a 
node to stay in the CLM cluster for a while after the connectivity with it has 
been lost, and allow it to continue as a cluster member if connectivity is 
re-established before this grace time has expired. However, this is not so easy 
and it is much easier to increase the TIPC link tolerance and let TIPC handle 
this for us.

regards,

Anders Widell

On 03/09/2018 12:42 PM, Mathi N P wrote:
This is an interesting case (and 'rare' :-))

2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node reboot 
order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late 
node_up_msg after cluster startup timeout
2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the 
reboot command from SC and thus it reboot itself.
Given that the node has not 'instantiated' completely and a reboot order can be 
treated as a 'failed start up', based on the current AMF state,
AMF can make a decision by reading the 
'saamfnodefailfastoninstantiationfailure' (or perhaps 'saamfnodeautorepair' ) 
attribute to reboot or not and report a node instantantiation failure (back to 
the rc script and other associated events for that state).

Thanks,
Mathi.



On Fri, Mar 9, 2018 at 10:42 AM, Jianfeng Dong 
<jd...@juniper.net<mailto:jd...@juniper.net>> wrote:
Thanks Anders, much appreciate.

And yes, in PLD we run TIPC on a bonded interface which comprises two Ethernet 
interfaces.
I'm wondering why a bonding interface can't provide similar protection like 
TIPC does, is it because TIPC is more robust or something else? I'm not sure if 
it is right to change the low-level design at this time point for our product, 
I will talk with my workmates on this change and find more details in TIPC 
manual.

Regarding to OpenSAF part, do you guys think is it possible that SC do not 
force rebooting the PLD in this case? After all the connection recovered 
quickly.

Regards,
Jianfeng

-Original Message-
From: Anders Widell 
[mailto:anders.wid...@ericsson.com<mailto:anders.wid...@ericsson.com>]
Sent: Thursday, March 8, 2018 8:38 PM
To: Jianfeng Dong <jd...@juniper.net<mailto:jd...@juniper.net>>; 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>
Subject: Re: [users] Payload card reboot due to a short time network break

Hi!

Are you running TIPC on a bonded interface? I wouldn't recommend this.
Instead, you should run TIPC on the raw Ethernet interfaces and let TIPC handle 
the link fail-over in case of a failure in one of them. TIPC should be able to 
do this without ever losing the connectivity between the nodes.

regards,

Anders Widell


On 03/08/2018 10:43 AM, Jianfeng Dong wrote:
> Hi,
>
> Several days ago we got a payload card reboot issue in customer field, a PLD 
> lost connection with SC for a little while(about 10 seconds), then SC forced 
> the PLD to reboot even though the PLD was going into “SC Absent mode”.
>
> System summary:
> our product is a system with 2 SC boards and at most 14 PLD cards, running 
> OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC connect with 
> PLD via Ethernet and TIPC.
>
> Issue course:
> 1. PLD’s internal network went down for a hardware/driver problem, but it 
> recovered quickly in 2 seconds.
>
> 2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link
> status definitely down for interface eth0, disabling it
> 2018-02-16T17:56:00.7

Re: [users] Payload card reboot due to a short time network break

2018-03-12 Thread Anders Widell
We also tried running TIPC on a bonded interface but ended up having to 
change it since it never worked well. When you have two redundant 
Ethernet interfaces, TIPC will tolerate failures in one of them 
seamlessly without losing connectivity. But when you run TIPC on a 
bonded interface it doesn't work, as you can see in your case. I guess 
the reason is that you have two separate mechanisms on top of each 
other, trying to achieve the same thing. One possible workaround is to 
increase the TIPC link tolerance.


When we lose connectivity with a node in the cluster, we are expecting 
that it happened because the other node went down (rebooted or 
permanently died). We don't expect to re-establish connectivity with the 
same node unless it has rebooted in between. It would be possible to 
introduce a grace time to allow a node to stay in the CLM cluster for a 
while after the connectivity with it has been lost, and allow it to 
continue as a cluster member if connectivity is re-established before 
this grace time has expired. However, this is not so easy and it is much 
easier to increase the TIPC link tolerance and let TIPC handle this for us.


regards,

Anders Widell


On 03/09/2018 12:42 PM, Mathi N P wrote:

This is an interesting case (and 'rare' :-))

2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node 
reboot order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, 
due to late node_up_msg after cluster startup timeout
2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it 
got the reboot command from SC and thus it reboot itself.


Given that the node has not 'instantiated' completely and a reboot 
order can be treated as a 'failed start up', based on the current AMF 
state,
AMF can make a decision by reading the 
'saamfnodefailfastoninstantiationfailure' (or perhaps 
'saamfnodeautorepair' ) attribute to reboot or not and report a node 
instantantiation failure (back to the rc script and other associated 
events for that state).


Thanks,
Mathi.



On Fri, Mar 9, 2018 at 10:42 AM, Jianfeng Dong <jd...@juniper.net 
<mailto:jd...@juniper.net>> wrote:


Thanks Anders, much appreciate.

And yes, in PLD we run TIPC on a bonded interface which comprises
two Ethernet interfaces.
I'm wondering why a bonding interface can't provide similar
protection like TIPC does, is it because TIPC is more robust or
something else? I'm not sure if it is right to change the
low-level design at this time point for our product, I will talk
with my workmates on this change and find more details in TIPC manual.

Regarding to OpenSAF part, do you guys think is it possible that
SC do not force rebooting the PLD in this case? After all the
connection recovered quickly.

Regards,
Jianfeng

-Original Message-
From: Anders Widell [mailto:anders.wid...@ericsson.com
<mailto:anders.wid...@ericsson.com>]
Sent: Thursday, March 8, 2018 8:38 PM
To: Jianfeng Dong <jd...@juniper.net <mailto:jd...@juniper.net>>;
opensaf-users@lists.sourceforge.net
<mailto:opensaf-users@lists.sourceforge.net>
    Subject: Re: [users] Payload card reboot due to a short time
network break

Hi!

Are you running TIPC on a bonded interface? I wouldn't recommend this.
Instead, you should run TIPC on the raw Ethernet interfaces and
let TIPC handle the link fail-over in case of a failure in one of
them. TIPC should be able to do this without ever losing the
connectivity between the nodes.

regards,

Anders Widell


On 03/08/2018 10:43 AM, Jianfeng Dong wrote:
> Hi,
>
> Several days ago we got a payload card reboot issue in customer
field, a PLD lost connection with SC for a little while(about 10
seconds), then SC forced the PLD to reboot even though the PLD was
going into “SC Absent mode”.
>
> System summary:
> our product is a system with 2 SC boards and at most 14 PLD
cards, running OpenSAF 5.1.0 with the feature “SC Absent Mode”
enabled, and SC connect with PLD via Ethernet and TIPC.
>
> Issue course:
> 1. PLD’s internal network went down for a hardware/driver
problem, but it recovered quickly in 2 seconds.
>
> 2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0:
link
> status definitely down for interface eth0, disabling it
> 2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding: bond0:
link status up for interface eth0, enabling it in 6 ms.
>
> 2. 10 seconds later TIPC still broke even though the network got
recovered.
>
> 2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc: Resetting
link
> <1.1.14:bond0-1.1.16:eth2>, peer not responding
> 2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link
> <1.1.14:bond0-1.1.16:eth2> on network plane A
   

Re: [users] Payload card reboot due to a short time network break

2018-03-09 Thread Mathi N P
This is an interesting case (and 'rare' :-))

2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node
reboot order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to
late node_up_msg after cluster startup timeout
2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the
reboot command from SC and thus it reboot itself.

Given that the node has not 'instantiated' completely and a reboot order
can be treated as a 'failed start up', based on the current AMF state,
AMF can make a decision by reading the
'saamfnodefailfastoninstantiationfailure' (or perhaps 'saamfnodeautorepair'
) attribute to reboot or not and report a node instantantiation failure
(back to the rc script and other associated events for that state).

Thanks,
Mathi.



On Fri, Mar 9, 2018 at 10:42 AM, Jianfeng Dong <jd...@juniper.net> wrote:

> Thanks Anders, much appreciate.
>
> And yes, in PLD we run TIPC on a bonded interface which comprises two
> Ethernet interfaces.
> I'm wondering why a bonding interface can't provide similar protection
> like TIPC does, is it because TIPC is more robust or something else? I'm
> not sure if it is right to change the low-level design at this time point
> for our product, I will talk with my workmates on this change and find more
> details in TIPC manual.
>
> Regarding to OpenSAF part, do you guys think is it possible that SC do not
> force rebooting the PLD in this case? After all the connection recovered
> quickly.
>
> Regards,
> Jianfeng
>
> -Original Message-
> From: Anders Widell [mailto:anders.wid...@ericsson.com]
> Sent: Thursday, March 8, 2018 8:38 PM
> To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
> Subject: Re: [users] Payload card reboot due to a short time network break
>
> Hi!
>
> Are you running TIPC on a bonded interface? I wouldn't recommend this.
> Instead, you should run TIPC on the raw Ethernet interfaces and let TIPC
> handle the link fail-over in case of a failure in one of them. TIPC should
> be able to do this without ever losing the connectivity between the nodes.
>
> regards,
>
> Anders Widell
>
>
> On 03/08/2018 10:43 AM, Jianfeng Dong wrote:
> > Hi,
> >
> > Several days ago we got a payload card reboot issue in customer field, a
> PLD lost connection with SC for a little while(about 10 seconds), then SC
> forced the PLD to reboot even though the PLD was going into “SC Absent
> mode”.
> >
> > System summary:
> > our product is a system with 2 SC boards and at most 14 PLD cards,
> running OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC
> connect with PLD via Ethernet and TIPC.
> >
> > Issue course:
> > 1. PLD’s internal network went down for a hardware/driver problem, but
> it recovered quickly in 2 seconds.
> >
> > 2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link
> > status definitely down for interface eth0, disabling it
> > 2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding: bond0: link
> status up for interface eth0, enabling it in 6 ms.
> >
> > 2. 10 seconds later TIPC still broke even though the network got
> recovered.
> >
> > 2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc: Resetting link
> > <1.1.14:bond0-1.1.16:eth2>, peer not responding
> > 2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link
> > <1.1.14:bond0-1.1.16:eth2> on network plane A
> > 2018-02-16T17:56:10.050440+00:00 pld0114 kernel: tipc: Lost contact
> > with <1.1.16>
> >
> > 3. SC found the PLD left the cluster.
> >
> > 2018-02-16T17:56:10.050704+00:00 scm2 osafimmd[3095]: NO MDS event
> > from svc_id 25 (change:4, dest:296935520731140)
> > 2018-02-16T17:56:10.052770+00:00 scm2 osafclmd[3302]: NO Node 69135
> > went down. Not sending track callback for agents on that node
> > 2018-02-16T17:56:10.054411+00:00 scm2 osafimmnd[3106]: NO Global
> > discard node received for nodeId:10e0f pid:3516
> > 2018-02-16T17:56:10.054505+00:00 scm2 osafimmnd[3106]: NO Implementer
> > disconnected 15 <0, 10e0f(down)> (MsgQueueService69135)
> > 2018-02-16T17:56:10.055158+00:00 scm2 osafamfd[3312]: NO Node
> > 'PLD0114' left the cluster
> >
> > 4. One more second later, the TIPC link also got recovered.
> >
> > 2018-02-16T17:56:11.054553+00:00 pld0114 kernel: tipc: Established
> > link <1.1.14:bond0-1.1.16:eth2> on network plane A
> >
> > 5. However, PLD was still impacted by the network issue and was trying
> to go into ‘SC Absent Mode’.
> >
> > 2018-02-16T17:56:11.057260+00:00 pld0114 osafamfnd[3626]: NO AVD
> > NEW_ACTIVE, adest:1
> > 

Re: [users] Payload card reboot due to a short time network break

2018-03-09 Thread Jianfeng Dong
Thanks Anders, much appreciate. 

And yes, in PLD we run TIPC on a bonded interface which comprises two Ethernet 
interfaces. 
I'm wondering why a bonding interface can't provide similar protection like 
TIPC does, is it because TIPC is more robust or something else? I'm not sure if 
it is right to change the low-level design at this time point for our product, 
I will talk with my workmates on this change and find more details in TIPC 
manual.

Regarding to OpenSAF part, do you guys think is it possible that SC do not 
force rebooting the PLD in this case? After all the connection recovered 
quickly.

Regards,
Jianfeng

-Original Message-
From: Anders Widell [mailto:anders.wid...@ericsson.com] 
Sent: Thursday, March 8, 2018 8:38 PM
To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
Subject: Re: [users] Payload card reboot due to a short time network break

Hi!

Are you running TIPC on a bonded interface? I wouldn't recommend this. 
Instead, you should run TIPC on the raw Ethernet interfaces and let TIPC handle 
the link fail-over in case of a failure in one of them. TIPC should be able to 
do this without ever losing the connectivity between the nodes.

regards,

Anders Widell


On 03/08/2018 10:43 AM, Jianfeng Dong wrote:
> Hi,
>
> Several days ago we got a payload card reboot issue in customer field, a PLD 
> lost connection with SC for a little while(about 10 seconds), then SC forced 
> the PLD to reboot even though the PLD was going into “SC Absent mode”.
>
> System summary:
> our product is a system with 2 SC boards and at most 14 PLD cards, running 
> OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC connect with 
> PLD via Ethernet and TIPC.
>
> Issue course:
> 1. PLD’s internal network went down for a hardware/driver problem, but it 
> recovered quickly in 2 seconds.
>
> 2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link 
> status definitely down for interface eth0, disabling it
> 2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding: bond0: link status 
> up for interface eth0, enabling it in 6 ms.
>
> 2. 10 seconds later TIPC still broke even though the network got recovered.
>
> 2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc: Resetting link 
> <1.1.14:bond0-1.1.16:eth2>, peer not responding
> 2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link 
> <1.1.14:bond0-1.1.16:eth2> on network plane A
> 2018-02-16T17:56:10.050440+00:00 pld0114 kernel: tipc: Lost contact 
> with <1.1.16>
>
> 3. SC found the PLD left the cluster.
>
> 2018-02-16T17:56:10.050704+00:00 scm2 osafimmd[3095]: NO MDS event 
> from svc_id 25 (change:4, dest:296935520731140)
> 2018-02-16T17:56:10.052770+00:00 scm2 osafclmd[3302]: NO Node 69135 
> went down. Not sending track callback for agents on that node
> 2018-02-16T17:56:10.054411+00:00 scm2 osafimmnd[3106]: NO Global 
> discard node received for nodeId:10e0f pid:3516
> 2018-02-16T17:56:10.054505+00:00 scm2 osafimmnd[3106]: NO Implementer 
> disconnected 15 <0, 10e0f(down)> (MsgQueueService69135)
> 2018-02-16T17:56:10.055158+00:00 scm2 osafamfd[3312]: NO Node 
> 'PLD0114' left the cluster
>
> 4. One more second later, the TIPC link also got recovered.
>
> 2018-02-16T17:56:11.054553+00:00 pld0114 kernel: tipc: Established 
> link <1.1.14:bond0-1.1.16:eth2> on network plane A
>
> 5. However, PLD was still impacted by the network issue and was trying to go 
> into ‘SC Absent Mode’.
>
> 2018-02-16T17:56:11.057260+00:00 pld0114 osafamfnd[3626]: NO AVD 
> NEW_ACTIVE, adest:1
> 2018-02-16T17:56:11.057407+00:00 pld0114 osafamfnd[3626]: NO Sending 
> node up due to NCSMDS_NEW_ACTIVE
> 2018-02-16T17:56:11.057684+00:00 pld0114 osafamfnd[3626]: NO 19 SISU 
> states sent
> 2018-02-16T17:56:11.057715+00:00 pld0114 osafamfnd[3626]: NO 22 SU 
> states sent
> 2018-02-16T17:56:11.057775+00:00 pld0114 osafimmnd[3516]: NO Sleep 
> done registering IMMND with MDS
> 2018-02-16T17:56:11.058243+00:00 pld0114 osafmsgnd[3665]: ER 
> saClmDispatch Failed with error 9
> 2018-02-16T17:56:11.058283+00:00 pld0114 osafckptnd[3697]: NO Bad CLM handle. 
> Reinitializing.
> 2018-02-16T17:56:11.059054+00:00 pld0114 osafimmnd[3516]: NO SUCCESS 
> IN REGISTERING IMMND WITH MDS
> 2018-02-16T17:56:11.059116+00:00 pld0114 osafimmnd[3516]: NO 
> Re-introduce-me highestProcessed:26209 highestReceived:26209
> 2018-02-16T17:56:11.059699+00:00 pld0114 osafimmnd[3516]: NO IMMD 
> service is UP ... ScAbsenseAllowed?:31536 introduced?:2
> 2018-02-16T17:56:11.059932+00:00 pld0114 osafimmnd[3516]: NO MDS: 
> mds_register_callback: dest 10e0fb03c0010 already exist
> 2018-02-16T17:56:11.060297+00:00 pld0114 osafimmnd[3516]: NO 
> Re-introduce-me highestProcessed:26209 highestRecei

Re: [users] Payload card reboot due to a short time network break

2018-03-08 Thread Anders Widell

Hi!

Are you running TIPC on a bonded interface? I wouldn't recommend this. 
Instead, you should run TIPC on the raw Ethernet interfaces and let TIPC 
handle the link fail-over in case of a failure in one of them. TIPC 
should be able to do this without ever losing the connectivity between 
the nodes.


regards,

Anders Widell


On 03/08/2018 10:43 AM, Jianfeng Dong wrote:

Hi,

Several days ago we got a payload card reboot issue in customer field, a PLD 
lost connection with SC for a little while(about 10 seconds), then SC forced 
the PLD to reboot even though the PLD was going into “SC Absent mode”.

System summary:
our product is a system with 2 SC boards and at most 14 PLD cards, running 
OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC connect with 
PLD via Ethernet and TIPC.

Issue course:
1. PLD’s internal network went down for a hardware/driver problem, but it 
recovered quickly in 2 seconds.

2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link status 
definitely down for interface eth0, disabling it
2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding: bond0: link status up 
for interface eth0, enabling it in 6 ms.

2. 10 seconds later TIPC still broke even though the network got recovered.

2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc: Resetting link 
<1.1.14:bond0-1.1.16:eth2>, peer not responding
2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link 
<1.1.14:bond0-1.1.16:eth2> on network plane A
2018-02-16T17:56:10.050440+00:00 pld0114 kernel: tipc: Lost contact with 
<1.1.16>

3. SC found the PLD left the cluster.

2018-02-16T17:56:10.050704+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 
25 (change:4, dest:296935520731140)
2018-02-16T17:56:10.052770+00:00 scm2 osafclmd[3302]: NO Node 69135 went down. 
Not sending track callback for agents on that node
2018-02-16T17:56:10.054411+00:00 scm2 osafimmnd[3106]: NO Global discard node 
received for nodeId:10e0f pid:3516
2018-02-16T17:56:10.054505+00:00 scm2 osafimmnd[3106]: NO Implementer disconnected 15 
<0, 10e0f(down)> (MsgQueueService69135)
2018-02-16T17:56:10.055158+00:00 scm2 osafamfd[3312]: NO Node 'PLD0114' left 
the cluster

4. One more second later, the TIPC link also got recovered.

2018-02-16T17:56:11.054553+00:00 pld0114 kernel: tipc: Established link 
<1.1.14:bond0-1.1.16:eth2> on network plane A

5. However, PLD was still impacted by the network issue and was trying to go 
into ‘SC Absent Mode’.

2018-02-16T17:56:11.057260+00:00 pld0114 osafamfnd[3626]: NO AVD NEW_ACTIVE, 
adest:1
2018-02-16T17:56:11.057407+00:00 pld0114 osafamfnd[3626]: NO Sending node up 
due to NCSMDS_NEW_ACTIVE
2018-02-16T17:56:11.057684+00:00 pld0114 osafamfnd[3626]: NO 19 SISU states sent
2018-02-16T17:56:11.057715+00:00 pld0114 osafamfnd[3626]: NO 22 SU states sent
2018-02-16T17:56:11.057775+00:00 pld0114 osafimmnd[3516]: NO Sleep done 
registering IMMND with MDS
2018-02-16T17:56:11.058243+00:00 pld0114 osafmsgnd[3665]: ER saClmDispatch 
Failed with error 9
2018-02-16T17:56:11.058283+00:00 pld0114 osafckptnd[3697]: NO Bad CLM handle. 
Reinitializing.
2018-02-16T17:56:11.059054+00:00 pld0114 osafimmnd[3516]: NO SUCCESS IN 
REGISTERING IMMND WITH MDS
2018-02-16T17:56:11.059116+00:00 pld0114 osafimmnd[3516]: NO Re-introduce-me 
highestProcessed:26209 highestReceived:26209
2018-02-16T17:56:11.059699+00:00 pld0114 osafimmnd[3516]: NO IMMD service is UP 
... ScAbsenseAllowed?:31536 introduced?:2
2018-02-16T17:56:11.059932+00:00 pld0114 osafimmnd[3516]: NO MDS: 
mds_register_callback: dest 10e0fb03c0010 already exist
2018-02-16T17:56:11.060297+00:00 pld0114 osafimmnd[3516]: NO Re-introduce-me 
highestProcessed:26209 highestReceived:26209
2018-02-16T17:56:11.062053+00:00 pld0114 osafamfnd[3626]: NO 25 CSICOMP states 
synced
2018-02-16T17:56:11.062102+00:00 pld0114 osafamfnd[3626]: NO 28 SU states sent
2018-02-16T17:56:11.064418+00:00 pld0114 osafimmnd[3516]: ER MESSAGE:26438 OUT 
OF ORDER my highest processed:26209 - exiting
2018-02-16T17:56:11.160121+00:00 pld0114 osafckptnd[3697]: NO CLM selection 
object was updated. (12)
2018-02-16T17:56:11.166764+00:00 pld0114 osafamfnd[3626]: NO saClmDispatch 
BAD_HANDLE
2018-02-16T17:56:11.167030+00:00 pld0114 osafamfnd[3626]: NO 
'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' component restart probation timer 
started (timeout: 600 ns)
2018-02-16T17:56:11.167102+00:00 pld0114 osafamfnd[3626]: NO Restarting a 
component of 'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1)
2018-02-16T17:56:11.167135+00:00 pld0114 osafamfnd[3626]: NO 
'safComp=IMMND,safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' faulted due to 
'avaDown' : Recovery is 'componentRestart'

6. SC received messages from the PLD, then it forced the PLD to reboot(due to 
the node sync timeout?).

2018-02-16T17:56:11.058121+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 
25 (change:3, dest:296935520731140)
2018-02-16T17:56:11.058515+00:00 scm2 osafsmfd[3391]: ER