Re: [controller-dev] Circuit Breaker timed out

2017-11-14 Thread Muthukumaran K
Correct Ajay. We have done a bit of experiment in increasing the timeout and we 
have not seen the recurrence.

We considered following options

1)  Recreate affected Shard

a.   We tried this using a “simulated shard-kill” and upon observance of 
shard-kill, Shard Manager recreates the shard - it did work !!  But we stay 
away from this solution mainly because of DTCN side-effects . We do not know 
how many applications would be tolerant to get the DTCNs all of a sudden 
(similar to restart of node) in unexpected manner because silent Shard-Restart 
implies that applications have to be thoroughly idempotent in handling DTCNs 
across shards so that single shard recreate does not affect whatever state they 
internally build-up via DTCNs

2)  Restart entire controller
b.A more intrusive change would be to perfom bundle – 0 stop and let the 
restart logic take care of restarting the node depending upon the environment 
(systemd , pacemaker etc.) upon
   onPersistFailure because anyway the system would 
be useless if one shard stops completely. We are trying to get this correctly 
working but have been unsuccessful so far.
Regards
Muthu


From: Ajay L [mailto:ajayl@gmail.com]
Sent: Wednesday, November 15, 2017 1:46 AM
To: controller-dev@lists.opendaylight.org
Cc: Muthukumaran K; Srini Seetharaman; Robert Varga; ajaysl...@gmail.com; Sai 
MarapaReddy
Subject: Re: [controller-dev] Circuit Breaker timed out

Hi All,

We are also seeing the "circuit breaker" error under heavy load. When this 
happens, the affected shard is stopped and never restarted and I think the only 
way to recover is to restart the node. I have opened 
https://jira.opendaylight.org/browse/CONTROLLER-1789 to request better recovery 
behavior. Increasing the akka journal persistence circuit-breaker call-timeout 
value (default is 10s) does help in making it more tolerant to outage

Regards
Ajay

On Wed, Aug 16, 2017 at 2:23 AM, Robert Varga <n...@hq.sk<mailto:n...@hq.sk>> 
wrote:
On 16/08/17 08:37, Muthukumaran K wrote:
> We have not tried on master branch (Nitrogen  / Akka 2.5). Not sure if
> such an issue would go away with Akka 2.5 because the circuit breaker is
> primarily with LevelDB plugin.
>

Nitrogen is on akka-2.4.18. akka-2.5.x (and others) are staged for Oxygen.

Bye,
Robert


___
controller-dev mailing list
controller-dev@lists.opendaylight.org<mailto:controller-dev@lists.opendaylight.org>
https://lists.opendaylight.org/mailman/listinfo/controller-dev

___
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev


Re: [controller-dev] Circuit Breaker timed out

2017-11-14 Thread Ajay L
Hi All,

We are also seeing the "circuit breaker" error under heavy load. When this
happens, the affected shard is stopped and never restarted and I think the
only way to recover is to restart the node. I have opened
https://jira.opendaylight.org/browse/CONTROLLER-1789 to request better
recovery behavior. Increasing the akka journal persistence circuit-breaker
call-timeout value (default is 10s) does help in making it more tolerant to
outage

Regards
Ajay

On Wed, Aug 16, 2017 at 2:23 AM, Robert Varga  wrote:

> On 16/08/17 08:37, Muthukumaran K wrote:
> > We have not tried on master branch (Nitrogen  / Akka 2.5). Not sure if
> > such an issue would go away with Akka 2.5 because the circuit breaker is
> > primarily with LevelDB plugin.
> >
>
> Nitrogen is on akka-2.4.18. akka-2.5.x (and others) are staged for Oxygen.
>
> Bye,
> Robert
>
>
> ___
> controller-dev mailing list
> controller-dev@lists.opendaylight.org
> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>
>
___
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev


Re: [controller-dev] Circuit Breaker timed out

2017-08-16 Thread Robert Varga
On 16/08/17 08:37, Muthukumaran K wrote:
> We have not tried on master branch (Nitrogen  / Akka 2.5). Not sure if
> such an issue would go away with Akka 2.5 because the circuit breaker is
> primarily with LevelDB plugin.
> 

Nitrogen is on akka-2.4.18. akka-2.5.x (and others) are staged for Oxygen.

Bye,
Robert



signature.asc
Description: OpenPGP digital signature
___
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev


Re: [controller-dev] Circuit Breaker timed out

2017-08-16 Thread Muthukumaran K
Sorry for the delay in response Srini.

We have not tried on master branch (Nitrogen  / Akka 2.5). Not sure if such an 
issue would go away with Akka 2.5 because the circuit breaker is primarily with 
LevelDB plugin.

For about 20 days, we have not been able to consistently reproduce this issue 
yet and seen this only once on one of the cluster nodes. We are using plain 
Ubuntu VMs to bring up the cluster.
‘dmesg’ also did not indicate any issues wrt disk.

Some “theoretical” candidates which we started suspecting were

a)  Compaction of LevelDB colliding with incoming writes – ie. if heavy 
compaction delays the incoming writes in LevelDB

b)  Difference in VM’s disk Vs Host’s Disk performance – we may have to 
beat this by doing heavy ‘dd’ on the disks to see if disk writes are slow at 
different level altogether

But yet to confirm on both. Consulting Akka google groups also did not help 
because most of the recommendation is that LevelDB is not meant for production. 
But, even if we use Cassandra (for example for journal persistence), timeout is 
logically still possible – perhaps Cassandra plugin handles such cases in 
better manner

Regards
Muthu



From: srini...@gmail.com [mailto:srini...@gmail.com] On Behalf Of Srini 
Seetharaman
Sent: Saturday, August 12, 2017 1:55 AM
To: Muthukumaran K
Cc: Tom Pantelis; controller-dev@lists.opendaylight.org
Subject: Re: [controller-dev] Circuit Breaker timed out

Or was there a real disk issue in that machine you were using?

On Fri, Aug 11, 2017 at 10:58 AM, Srini Seetharaman 
<srini.seethara...@gmail.com<mailto:srini.seethara...@gmail.com>> wrote:
Muthu,
It's worrisome to hear that you've seen this too. Did it go away with Nitrogen 
or with moving to Akka 2.5 persistence?

I am referring to the following params within the persistence section of 
akka.conf


 circuit-breaker {

max-failures = 10

call-timeout = 10s

reset-timeout = 30s

  }


On Thu, Aug 10, 2017 at 10:17 PM, Muthukumaran K 
<muthukumara...@ericsson.com<mailto:muthukumara...@ericsson.com>> wrote:
Hi Tom, Srini,

We have also noticed this with Boron very sporadically even without any 
explicit action taken on shard like Srini did

Srini,

Are you referring “journal-plugin-fallback” from 
http://doc.akka.io/docs/akka/current/scala/general/configuration.html#config-akka-persistence
 ?

Regards
Muthu

From: 
controller-dev-boun...@lists.opendaylight.org<mailto:controller-dev-boun...@lists.opendaylight.org>
 
[mailto:controller-dev-boun...@lists.opendaylight.org<mailto:controller-dev-boun...@lists.opendaylight.org>]
 On Behalf Of Srini Seetharaman
Sent: Friday, August 11, 2017 9:40 AM
To: Tom Pantelis
Cc: 
controller-dev@lists.opendaylight.org<mailto:controller-dev@lists.opendaylight.org>
Subject: Re: [controller-dev] Circuit Breaker timed out

Thanks Tom. I will investigate further on why the local disk operation failed. 
Seems strange though because I haven't seen anything in dmesg.

The default value for the call-timeout is 10s in akka.conf.

On Thu, Aug 10, 2017 at 3:20 PM, Tom Pantelis 
<tompante...@gmail.com<mailto:tompante...@gmail.com>> wrote:
That error is from  akka persistence. It happens if the backend persistence 
plugin doesn't respond back in time. I've only seen this in a CSIT environment 
whose disk activity was overloaded. The timeouts can be tweaked - I don't 
recall exactly what they are but you can find them in the akka docs (names 
contain circuit-breaker).

On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman 
<srini.seethara...@gmail.com<mailto:srini.seethara...@gmail.com>> wrote:
Hi Tom,
In our ODL deployment that is running in standalone mode with operational store 
persistence enabled, we saw the following error being printed. Once the 
member-1-default-operational shard is shutdown, all write transactions after 
that fail and the system becomes unstable. At this point, we were probably 
doing less than 10 transactions per second. Any idea what is causing this? Has 
anyone seen this before?


2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard  
  | 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist event type 
[org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] with sequence 
number [9897493] for persistenceId [member-1-shard-default-operational].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
2017-08-07 19:15:59,628 | INFO  | lt-dispatcher-24 | Shard  
  | 188 - 
org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-akka-raft
 - 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational
2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 | 
LocalThreePhaseCommitCohort  | 193 - 
org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-distributed-datastore
 - 1.4.2.Boron-SR2 | Failed to prepare transaction 
member-1-datastore-

Re: [controller-dev] Circuit Breaker timed out

2017-08-11 Thread Srini Seetharaman
Or was there a real disk issue in that machine you were using?

On Fri, Aug 11, 2017 at 10:58 AM, Srini Seetharaman <
srini.seethara...@gmail.com> wrote:

> Muthu,
> It's worrisome to hear that you've seen this too. Did it go away with
> Nitrogen or with moving to Akka 2.5 persistence?
>
> I am referring to the following params within the persistence section of
> akka.conf
>
>  circuit-breaker {
> max-failures = 10
> call-timeout = 10s
> reset-timeout = 30s
>   }
>
>
>
> On Thu, Aug 10, 2017 at 10:17 PM, Muthukumaran K <
> muthukumara...@ericsson.com> wrote:
>
>> Hi Tom, Srini,
>>
>>
>>
>> We have also noticed this with Boron very sporadically even without any
>> explicit action taken on shard like Srini did
>>
>>
>>
>> Srini,
>>
>>
>>
>> Are you referring “journal-plugin-fallback” from
>> http://doc.akka.io/docs/akka/current/scala/general/configura
>> tion.html#config-akka-persistence ?
>>
>>
>>
>> Regards
>>
>> Muthu
>>
>>
>>
>> *From:* controller-dev-boun...@lists.opendaylight.org [mailto:
>> controller-dev-boun...@lists.opendaylight.org] *On Behalf Of *Srini
>> Seetharaman
>> *Sent:* Friday, August 11, 2017 9:40 AM
>> *To:* Tom Pantelis
>> *Cc:* controller-dev@lists.opendaylight.org
>> *Subject:* Re: [controller-dev] Circuit Breaker timed out
>>
>>
>>
>> Thanks Tom. I will investigate further on why the local disk operation
>> failed. Seems strange though because I haven't seen anything in dmesg.
>>
>>
>>
>> The default value for the call-timeout is 10s in akka.conf.
>>
>>
>>
>> On Thu, Aug 10, 2017 at 3:20 PM, Tom Pantelis <tompante...@gmail.com>
>> wrote:
>>
>> That error is from  akka persistence. It happens if the backend
>> persistence plugin doesn't respond back in time. I've only seen this in a
>> CSIT environment whose disk activity was overloaded. The timeouts can be
>> tweaked - I don't recall exactly what they are but you can find them in the
>> akka docs (names contain circuit-breaker).
>>
>>
>>
>> On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman <
>> srini.seethara...@gmail.com> wrote:
>>
>> Hi Tom,
>>
>> In our ODL deployment that is running in standalone mode with operational
>> store persistence enabled, we saw the following error being printed. Once
>> the member-1-default-operational shard is shutdown, all write transactions
>> after that fail and the system becomes unstable. At this point, we were
>> probably doing less than 10 transactions per second. Any idea what is
>> causing this? Has anyone seen this before?
>>
>>
>>
>>
>>
>> 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard
>>  | 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist
>> event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry]
>> with sequence number [9897493] for persistenceId
>> [member-1-shard-default-operational].
>>
>> akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
>>
>> 2017-08-07 19:15:59,628 | INFO  | lt-dispatcher-24 | Shard
>>  | 188 - org.opendaylight.controller.sal-akka-raft -
>> 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational
>>
>> 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 |
>> LocalThreePhaseCommitCohort  | 193 - 
>> org.opendaylight.controller.sal-distributed-datastore
>> - 1.4.2.Boron-SR2 | Failed to prepare transaction
>> member-1-datastore-operational-fe-5-txn-791019 on backend
>>
>> java.lang.RuntimeException: Transaction aborted due to shutdown.
>>
>> at org.opendaylight.controller.cluster.datastore.ShardCommitCoo
>> rdinator.abortPendingTransactions(ShardCommitCoordinator.
>> java:399)[193:org.opendaylight.controller.sal-
>> distributed-datastore:1.4.2.Boron-SR2]
>>
>> at org.opendaylight.controller.cluster.datastore.Shard.postStop
>> (Shard.java:211)[193:org.opendaylight.controller.sal-
>> distributed-datastore:1.4.2.Boron-SR2]
>>
>> at akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175:
>> com.typesafe.akka.actor:2.4.7]
>>
>> at akka.persistence.UntypedPersistentActor.akka$persistence$
>> Eventsourced$$super$aroundPostStop(PersistentActor
>> .scala:168)[181:com.typesafe.akka.persistence:2.4.7]
>>
>> at akka.persistence.Eventsourced$class.aroundPostStop(Eventsour
>> ced.scal

Re: [controller-dev] Circuit Breaker timed out

2017-08-11 Thread Srini Seetharaman
Muthu,
It's worrisome to hear that you've seen this too. Did it go away with
Nitrogen or with moving to Akka 2.5 persistence?

I am referring to the following params within the persistence section of
akka.conf

 circuit-breaker {
max-failures = 10
call-timeout = 10s
reset-timeout = 30s
  }



On Thu, Aug 10, 2017 at 10:17 PM, Muthukumaran K <
muthukumara...@ericsson.com> wrote:

> Hi Tom, Srini,
>
>
>
> We have also noticed this with Boron very sporadically even without any
> explicit action taken on shard like Srini did
>
>
>
> Srini,
>
>
>
> Are you referring “journal-plugin-fallback” from
> http://doc.akka.io/docs/akka/current/scala/general/
> configuration.html#config-akka-persistence ?
>
>
>
> Regards
>
> Muthu
>
>
>
> *From:* controller-dev-boun...@lists.opendaylight.org [mailto:
> controller-dev-boun...@lists.opendaylight.org] *On Behalf Of *Srini
> Seetharaman
> *Sent:* Friday, August 11, 2017 9:40 AM
> *To:* Tom Pantelis
> *Cc:* controller-dev@lists.opendaylight.org
> *Subject:* Re: [controller-dev] Circuit Breaker timed out
>
>
>
> Thanks Tom. I will investigate further on why the local disk operation
> failed. Seems strange though because I haven't seen anything in dmesg.
>
>
>
> The default value for the call-timeout is 10s in akka.conf.
>
>
>
> On Thu, Aug 10, 2017 at 3:20 PM, Tom Pantelis <tompante...@gmail.com>
> wrote:
>
> That error is from  akka persistence. It happens if the backend
> persistence plugin doesn't respond back in time. I've only seen this in a
> CSIT environment whose disk activity was overloaded. The timeouts can be
> tweaked - I don't recall exactly what they are but you can find them in the
> akka docs (names contain circuit-breaker).
>
>
>
> On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman <
> srini.seethara...@gmail.com> wrote:
>
> Hi Tom,
>
> In our ODL deployment that is running in standalone mode with operational
> store persistence enabled, we saw the following error being printed. Once
> the member-1-default-operational shard is shutdown, all write transactions
> after that fail and the system becomes unstable. At this point, we were
> probably doing less than 10 transactions per second. Any idea what is
> causing this? Has anyone seen this before?
>
>
>
>
>
> 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard
>| 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist
> event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry]
> with sequence number [9897493] for persistenceId [member-1-shard-default-
> operational].
>
> akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
>
> 2017-08-07 19:15:59,628 | INFO  | lt-dispatcher-24 | Shard
>| 188 - org.opendaylight.controller.sal-akka-raft -
> 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational
>
> 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 |
> LocalThreePhaseCommitCohort  | 193 - 
> org.opendaylight.controller.sal-distributed-datastore
> - 1.4.2.Boron-SR2 | Failed to prepare transaction 
> member-1-datastore-operational-fe-5-txn-791019
> on backend
>
> java.lang.RuntimeException: Transaction aborted due to shutdown.
>
> at org.opendaylight.controller.cluster.datastore.
> ShardCommitCoordinator.abortPendingTransactions(
> ShardCommitCoordinator.java:399)[193:org.opendaylight.
> controller.sal-distributed-datastore:1.4.2.Boron-SR2]
>
> at org.opendaylight.controller.cluster.datastore.Shard.
> postStop(Shard.java:211)[193:org.opendaylight.controller.
> sal-distributed-datastore:1.4.2.Boron-SR2]
>
> at akka.actor.Actor$class.aroundPostStop(Actor.scala:
> 494)[175:com.typesafe.akka.actor:2.4.7]
>
> at akka.persistence.UntypedPersistentActor.akka$
> persistence$Eventsourced$$super$aroundPostStop(PersistentActor.scala:168)[
> 181:com.typesafe.akka.persistence:2.4.7]
>
> at akka.persistence.Eventsourced$class.aroundPostStop(
> Eventsourced.scala:223)[181:com.typesafe.akka.persistence:2.4.7]
>
> at akka.persistence.UntypedPersistentActor.aroundPostStop(
> PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7]
>
> at akka.actor.dungeon.FaultHandling$class.akka$
> actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.
> scala:210)[175:com.typesafe.akka.actor:2.4.7]
>
> at akka.actor.dungeon.FaultHandling$class.handleChildTerminated(
> FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7]
>
> at akka.actor.ActorCell.handleChildTerminated(
> ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7]
>
> at akka.actor.dun

Re: [controller-dev] Circuit Breaker timed out

2017-08-10 Thread Muthukumaran K
Hi Tom, Srini,

We have also noticed this with Boron very sporadically even without any 
explicit action taken on shard like Srini did

Srini,

Are you referring “journal-plugin-fallback” from 
http://doc.akka.io/docs/akka/current/scala/general/configuration.html#config-akka-persistence
 ?

Regards
Muthu

From: controller-dev-boun...@lists.opendaylight.org 
[mailto:controller-dev-boun...@lists.opendaylight.org] On Behalf Of Srini 
Seetharaman
Sent: Friday, August 11, 2017 9:40 AM
To: Tom Pantelis
Cc: controller-dev@lists.opendaylight.org
Subject: Re: [controller-dev] Circuit Breaker timed out

Thanks Tom. I will investigate further on why the local disk operation failed. 
Seems strange though because I haven't seen anything in dmesg.

The default value for the call-timeout is 10s in akka.conf.

On Thu, Aug 10, 2017 at 3:20 PM, Tom Pantelis 
<tompante...@gmail.com<mailto:tompante...@gmail.com>> wrote:
That error is from  akka persistence. It happens if the backend persistence 
plugin doesn't respond back in time. I've only seen this in a CSIT environment 
whose disk activity was overloaded. The timeouts can be tweaked - I don't 
recall exactly what they are but you can find them in the akka docs (names 
contain circuit-breaker).

On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman 
<srini.seethara...@gmail.com<mailto:srini.seethara...@gmail.com>> wrote:
Hi Tom,
In our ODL deployment that is running in standalone mode with operational store 
persistence enabled, we saw the following error being printed. Once the 
member-1-default-operational shard is shutdown, all write transactions after 
that fail and the system becomes unstable. At this point, we were probably 
doing less than 10 transactions per second. Any idea what is causing this? Has 
anyone seen this before?


2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard  
  | 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist event type 
[org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] with sequence 
number [9897493] for persistenceId [member-1-shard-default-operational].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
2017-08-07 19:15:59,628 | INFO  | lt-dispatcher-24 | Shard  
  | 188 - 
org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-akka-raft
 - 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational
2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 | 
LocalThreePhaseCommitCohort  | 193 - 
org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-distributed-datastore
 - 1.4.2.Boron-SR2 | Failed to prepare transaction 
member-1-datastore-operational-fe-5-txn-791019 on backend
java.lang.RuntimeException: Transaction aborted due to shutdown.
at 
org.opendaylight.controller.cl<http://org.opendaylight.controller.cl>uster.datastore.ShardCommitCoordinator.abortPendingTransactions(ShardCommitCoordinator.java:399)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2]
at 
org.opendaylight.controller.cl<http://org.opendaylight.controller.cl>uster.datastore.Shard.postStop(Shard.java:211)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2]
at 
akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175:com.typesafe.akka.actor:2.4.7]
at 
akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7]
at 
akka.persistence.Eventsourced$class.aroundPostStop(Eventsourced.scala:223)[181:com.typesafe.akka.persistence:2.4.7]
at 
akka.persistence.UntypedPersistentActor.aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7]
at 
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)[175:com.typesafe.akka.actor:2.4.7]
at 
akka.actor.dungeon.FaultHandling$class.handleChildTerminated(FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7]
at 
akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7]
at 
akka.actor.dungeon.DeathWatch$class.watchedActorTerminated(DeathWatch.scala:61)[175:com.typesafe.akka.actor:2.4.7]
at 
akka.actor.ActorCell.watchedActorTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7]
at 
akka.actor.ActorCell.invokeAll$1(ActorCell.scala:460)[175:com.typesafe.akka.actor:2.4.7]
at 
akka.actor.ActorCell.systemInvoke(ActorCell.scala:483)[175:com.typesafe.akka.actor:2.4.7]
at 
akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282)[175:com.typesafe.akka.actor:2.4.7]
at 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260)[175:com.typesafe.akka.actor:2.4.7]
at 
akka.dispatch.Mailbox.run(Mailbox.scala:224)[175:com.typesafe.akka.actor:2.4.7]
at 
a

Re: [controller-dev] Circuit Breaker timed out

2017-08-10 Thread Srini Seetharaman
Thanks Tom. I will investigate further on why the local disk operation
failed. Seems strange though because I haven't seen anything in dmesg.

The default value for the call-timeout is 10s in akka.conf.

On Thu, Aug 10, 2017 at 3:20 PM, Tom Pantelis  wrote:

> That error is from  akka persistence. It happens if the backend
> persistence plugin doesn't respond back in time. I've only seen this in a
> CSIT environment whose disk activity was overloaded. The timeouts can be
> tweaked - I don't recall exactly what they are but you can find them in the
> akka docs (names contain circuit-breaker).
>
> On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman <
> srini.seethara...@gmail.com> wrote:
>
>> Hi Tom,
>> In our ODL deployment that is running in standalone mode with operational
>> store persistence enabled, we saw the following error being printed. Once
>> the member-1-default-operational shard is shutdown, all write transactions
>> after that fail and the system becomes unstable. At this point, we were
>> probably doing less than 10 transactions per second. Any idea what is
>> causing this? Has anyone seen this before?
>>
>>
>> 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard
>>  | 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist
>> event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry]
>> with sequence number [9897493] for persistenceId
>> [member-1-shard-default-operational].
>> akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
>> 2017-08-07 19:15:59,628 | INFO  | lt-dispatcher-24 | Shard
>>  | 188 - org.opendaylight.controller.sal-akka-raft -
>> 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational
>> 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 |
>> LocalThreePhaseCommitCohort  | 193 - 
>> org.opendaylight.controller.sal-distributed-datastore
>> - 1.4.2.Boron-SR2 | Failed to prepare transaction
>> member-1-datastore-operational-fe-5-txn-791019 on backend
>> java.lang.RuntimeException: Transaction aborted due to shutdown.
>> at org.opendaylight.controller.cluster.datastore.ShardCommitCoo
>> rdinator.abortPendingTransactions(ShardCommitCoordinator.
>> java:399)[193:org.opendaylight.controller.sal-
>> distributed-datastore:1.4.2.Boron-SR2]
>> at org.opendaylight.controller.cluster.datastore.Shard.postStop
>> (Shard.java:211)[193:org.opendaylight.controller.sal-
>> distributed-datastore:1.4.2.Boron-SR2]
>> at akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175:
>> com.typesafe.akka.actor:2.4.7]
>> at akka.persistence.UntypedPersistentActor.akka$persistence$
>> Eventsourced$$super$aroundPostStop(PersistentActor
>> .scala:168)[181:com.typesafe.akka.persistence:2.4.7]
>> at akka.persistence.Eventsourced$class.aroundPostStop(Eventsour
>> ced.scala:223)[181:com.typesafe.akka.persistence:2.4.7]
>> at akka.persistence.UntypedPersistentActor.aroundPostStop(Persi
>> stentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7]
>> at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$
>> FaultHandling$$finishTerminate(FaultHandling.scala:210)[175:
>> com.typesafe.akka.actor:2.4.7]
>> at akka.actor.dungeon.FaultHandling$class.handleChildTerminated
>> (FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7]
>> at akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:
>> 374)[175:com.typesafe.akka.actor:2.4.7]
>> at akka.actor.dungeon.DeathWatch$class.watchedActorTerminated(D
>> eathWatch.scala:61)[175:com.typesafe.akka.actor:2.4.7]
>> at akka.actor.ActorCell.watchedActorTerminated(ActorCell.scala:
>> 374)[175:com.typesafe.akka.actor:2.4.7]
>> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:460)[175:
>> com.typesafe.akka.actor:2.4.7]
>> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483)[175:
>> com.typesafe.akka.actor:2.4.7]
>> at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.
>> scala:282)[175:com.typesafe.akka.actor:2.4.7]
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260)[175:
>> com.typesafe.akka.actor:2.4.7]
>> at akka.dispatch.Mailbox.run(Mailbox.scala:224)[175:com.typesaf
>> e.akka.actor:2.4.7]
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[175:com.typesa
>> fe.akka.actor:2.4.7]
>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.
>> java:260)[171:org.scala-lang.scala-library:2.11.8.
>> v20160304-115712-1706a37eb8]
>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
>> ForkJoinPool.java:1339)[171:org.scala-lang.scala-library:
>> 2.11.8.v20160304-115712-1706a37eb8]
>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo
>> l.java:1979)[171:org.scala-lang.scala-library:2.11.8.
>> v20160304-115712-1706a37eb8]
>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinW
>> 

Re: [controller-dev] Circuit Breaker timed out

2017-08-10 Thread Tom Pantelis
That error is from  akka persistence. It happens if the backend persistence
plugin doesn't respond back in time. I've only seen this in a CSIT
environment whose disk activity was overloaded. The timeouts can be tweaked
- I don't recall exactly what they are but you can find them in the akka
docs (names contain circuit-breaker).

On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman <
srini.seethara...@gmail.com> wrote:

> Hi Tom,
> In our ODL deployment that is running in standalone mode with operational
> store persistence enabled, we saw the following error being printed. Once
> the member-1-default-operational shard is shutdown, all write transactions
> after that fail and the system becomes unstable. At this point, we were
> probably doing less than 10 transactions per second. Any idea what is
> causing this? Has anyone seen this before?
>
>
> 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard
>| 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist
> event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry]
> with sequence number [9897493] for persistenceId [member-1-shard-default-
> operational].
> akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
> 2017-08-07 19:15:59,628 | INFO  | lt-dispatcher-24 | Shard
>| 188 - org.opendaylight.controller.sal-akka-raft -
> 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational
> 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 |
> LocalThreePhaseCommitCohort  | 193 - 
> org.opendaylight.controller.sal-distributed-datastore
> - 1.4.2.Boron-SR2 | Failed to prepare transaction 
> member-1-datastore-operational-fe-5-txn-791019
> on backend
> java.lang.RuntimeException: Transaction aborted due to shutdown.
> at org.opendaylight.controller.cluster.datastore.
> ShardCommitCoordinator.abortPendingTransactions(
> ShardCommitCoordinator.java:399)[193:org.opendaylight.
> controller.sal-distributed-datastore:1.4.2.Boron-SR2]
> at org.opendaylight.controller.cluster.datastore.Shard.
> postStop(Shard.java:211)[193:org.opendaylight.controller.
> sal-distributed-datastore:1.4.2.Boron-SR2]
> at akka.actor.Actor$class.aroundPostStop(Actor.scala:
> 494)[175:com.typesafe.akka.actor:2.4.7]
> at akka.persistence.UntypedPersistentActor.akka$
> persistence$Eventsourced$$super$aroundPostStop(PersistentActor.scala:168)[
> 181:com.typesafe.akka.persistence:2.4.7]
> at akka.persistence.Eventsourced$class.aroundPostStop(
> Eventsourced.scala:223)[181:com.typesafe.akka.persistence:2.4.7]
> at akka.persistence.UntypedPersistentActor.aroundPostStop(
> PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7]
> at akka.actor.dungeon.FaultHandling$class.akka$
> actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.
> scala:210)[175:com.typesafe.akka.actor:2.4.7]
> at akka.actor.dungeon.FaultHandling$class.handleChildTerminated(
> FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7]
> at akka.actor.ActorCell.handleChildTerminated(
> ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7]
> at akka.actor.dungeon.DeathWatch$class.watchedActorTerminated(
> DeathWatch.scala:61)[175:com.typesafe.akka.actor:2.4.7]
> at akka.actor.ActorCell.watchedActorTerminated(
> ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7]
> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:
> 460)[175:com.typesafe.akka.actor:2.4.7]
> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:
> 483)[175:com.typesafe.akka.actor:2.4.7]
> at akka.dispatch.Mailbox.processAllSystemMessages(
> Mailbox.scala:282)[175:com.typesafe.akka.actor:2.4.7]
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:
> 260)[175:com.typesafe.akka.actor:2.4.7]
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)[175:com.
> typesafe.akka.actor:2.4.7]
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[175:com.
> typesafe.akka.actor:2.4.7]
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(
> ForkJoinTask.java:260)[171:org.scala-lang.scala-library:
> 2.11.8.v20160304-115712-1706a37eb8]
> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
> runTask(ForkJoinPool.java:1339)[171:org.scala-lang.scala-library:2.11.8.
> v20160304-115712-1706a37eb8]
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
> ForkJoinPool.java:1979)[171:org.scala-lang.scala-library:
> 2.11.8.v20160304-115712-1706a37eb8]
> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
> ForkJoinWorkerThread.java:107)[171:org.scala-lang.scala-
> library:2.11.8.v20160304-115712-1706a37eb8]
> 2017-08-07 19:15:59,629 | WARN  | ult-dispatcher-3 |
> ConcurrentDOMDataBroker  | 193 - 
> org.opendaylight.controller.sal-distributed-datastore
> - 1.4.2.Boron-SR2 | Tx: DOM-956840 Error during phase CAN_COMMIT, starting
> Abort
> java.lang.RuntimeException: Transaction aborted due to shutdown.
> at 

[controller-dev] Circuit Breaker timed out

2017-08-10 Thread Srini Seetharaman
Hi Tom,
In our ODL deployment that is running in standalone mode with operational
store persistence enabled, we saw the following error being printed. Once
the member-1-default-operational shard is shutdown, all write transactions
after that fail and the system becomes unstable. At this point, we were
probably doing less than 10 transactions per second. Any idea what is
causing this? Has anyone seen this before?


2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard
   | 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist
event type
[org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] with
sequence number [9897493] for persistenceId
[member-1-shard-default-operational].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
2017-08-07 19:15:59,628 | INFO  | lt-dispatcher-24 | Shard
   | 188 - org.opendaylight.controller.sal-akka-raft -
1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational
2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 |
LocalThreePhaseCommitCohort  | 193 -
org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 |
Failed to prepare transaction
member-1-datastore-operational-fe-5-txn-791019 on backend
java.lang.RuntimeException: Transaction aborted due to shutdown.
at
org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.abortPendingTransactions(ShardCommitCoordinator.java:399)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2]
at
org.opendaylight.controller.cluster.datastore.Shard.postStop(Shard.java:211)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2]
at
akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175:com.typesafe.akka.actor:2.4.7]
at
akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7]
at
akka.persistence.Eventsourced$class.aroundPostStop(Eventsourced.scala:223)[181:com.typesafe.akka.persistence:2.4.7]
at
akka.persistence.UntypedPersistentActor.aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7]
at
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)[175:com.typesafe.akka.actor:2.4.7]
at
akka.actor.dungeon.FaultHandling$class.handleChildTerminated(FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7]
at
akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7]
at
akka.actor.dungeon.DeathWatch$class.watchedActorTerminated(DeathWatch.scala:61)[175:com.typesafe.akka.actor:2.4.7]
at
akka.actor.ActorCell.watchedActorTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7]
at
akka.actor.ActorCell.invokeAll$1(ActorCell.scala:460)[175:com.typesafe.akka.actor:2.4.7]
at
akka.actor.ActorCell.systemInvoke(ActorCell.scala:483)[175:com.typesafe.akka.actor:2.4.7]
at
akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282)[175:com.typesafe.akka.actor:2.4.7]
at
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260)[175:com.typesafe.akka.actor:2.4.7]
at
akka.dispatch.Mailbox.run(Mailbox.scala:224)[175:com.typesafe.akka.actor:2.4.7]
at
akka.dispatch.Mailbox.exec(Mailbox.scala:234)[175:com.typesafe.akka.actor:2.4.7]
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
2017-08-07 19:15:59,629 | WARN  | ult-dispatcher-3 |
ConcurrentDOMDataBroker  | 193 -
org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 |
Tx: DOM-956840 Error during phase CAN_COMMIT, starting Abort
java.lang.RuntimeException: Transaction aborted due to shutdown.
at
org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.abortPendingTransactions(ShardCommitCoordinator.java:399)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2]
at
org.opendaylight.controller.cluster.datastore.Shard.postStop(Shard.java:211)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2]
at
akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175:com.typesafe.akka.actor:2.4.7]
at