Re: [controller-dev] Circuit Breaker timed out
Correct Ajay. We have done a bit of experiment in increasing the timeout and we have not seen the recurrence. We considered following options 1) Recreate affected Shard a. We tried this using a “simulated shard-kill” and upon observance of shard-kill, Shard Manager recreates the shard - it did work !! But we stay away from this solution mainly because of DTCN side-effects . We do not know how many applications would be tolerant to get the DTCNs all of a sudden (similar to restart of node) in unexpected manner because silent Shard-Restart implies that applications have to be thoroughly idempotent in handling DTCNs across shards so that single shard recreate does not affect whatever state they internally build-up via DTCNs 2) Restart entire controller b.A more intrusive change would be to perfom bundle – 0 stop and let the restart logic take care of restarting the node depending upon the environment (systemd , pacemaker etc.) upon onPersistFailure because anyway the system would be useless if one shard stops completely. We are trying to get this correctly working but have been unsuccessful so far. Regards Muthu From: Ajay L [mailto:ajayl@gmail.com] Sent: Wednesday, November 15, 2017 1:46 AM To: controller-dev@lists.opendaylight.org Cc: Muthukumaran K; Srini Seetharaman; Robert Varga; ajaysl...@gmail.com; Sai MarapaReddy Subject: Re: [controller-dev] Circuit Breaker timed out Hi All, We are also seeing the "circuit breaker" error under heavy load. When this happens, the affected shard is stopped and never restarted and I think the only way to recover is to restart the node. I have opened https://jira.opendaylight.org/browse/CONTROLLER-1789 to request better recovery behavior. Increasing the akka journal persistence circuit-breaker call-timeout value (default is 10s) does help in making it more tolerant to outage Regards Ajay On Wed, Aug 16, 2017 at 2:23 AM, Robert Varga <n...@hq.sk<mailto:n...@hq.sk>> wrote: On 16/08/17 08:37, Muthukumaran K wrote: > We have not tried on master branch (Nitrogen / Akka 2.5). Not sure if > such an issue would go away with Akka 2.5 because the circuit breaker is > primarily with LevelDB plugin. > Nitrogen is on akka-2.4.18. akka-2.5.x (and others) are staged for Oxygen. Bye, Robert ___ controller-dev mailing list controller-dev@lists.opendaylight.org<mailto:controller-dev@lists.opendaylight.org> https://lists.opendaylight.org/mailman/listinfo/controller-dev ___ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev
Re: [controller-dev] Circuit Breaker timed out
Hi All, We are also seeing the "circuit breaker" error under heavy load. When this happens, the affected shard is stopped and never restarted and I think the only way to recover is to restart the node. I have opened https://jira.opendaylight.org/browse/CONTROLLER-1789 to request better recovery behavior. Increasing the akka journal persistence circuit-breaker call-timeout value (default is 10s) does help in making it more tolerant to outage Regards Ajay On Wed, Aug 16, 2017 at 2:23 AM, Robert Vargawrote: > On 16/08/17 08:37, Muthukumaran K wrote: > > We have not tried on master branch (Nitrogen / Akka 2.5). Not sure if > > such an issue would go away with Akka 2.5 because the circuit breaker is > > primarily with LevelDB plugin. > > > > Nitrogen is on akka-2.4.18. akka-2.5.x (and others) are staged for Oxygen. > > Bye, > Robert > > > ___ > controller-dev mailing list > controller-dev@lists.opendaylight.org > https://lists.opendaylight.org/mailman/listinfo/controller-dev > > ___ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev
Re: [controller-dev] Circuit Breaker timed out
On 16/08/17 08:37, Muthukumaran K wrote: > We have not tried on master branch (Nitrogen / Akka 2.5). Not sure if > such an issue would go away with Akka 2.5 because the circuit breaker is > primarily with LevelDB plugin. > Nitrogen is on akka-2.4.18. akka-2.5.x (and others) are staged for Oxygen. Bye, Robert signature.asc Description: OpenPGP digital signature ___ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev
Re: [controller-dev] Circuit Breaker timed out
Sorry for the delay in response Srini. We have not tried on master branch (Nitrogen / Akka 2.5). Not sure if such an issue would go away with Akka 2.5 because the circuit breaker is primarily with LevelDB plugin. For about 20 days, we have not been able to consistently reproduce this issue yet and seen this only once on one of the cluster nodes. We are using plain Ubuntu VMs to bring up the cluster. ‘dmesg’ also did not indicate any issues wrt disk. Some “theoretical” candidates which we started suspecting were a) Compaction of LevelDB colliding with incoming writes – ie. if heavy compaction delays the incoming writes in LevelDB b) Difference in VM’s disk Vs Host’s Disk performance – we may have to beat this by doing heavy ‘dd’ on the disks to see if disk writes are slow at different level altogether But yet to confirm on both. Consulting Akka google groups also did not help because most of the recommendation is that LevelDB is not meant for production. But, even if we use Cassandra (for example for journal persistence), timeout is logically still possible – perhaps Cassandra plugin handles such cases in better manner Regards Muthu From: srini...@gmail.com [mailto:srini...@gmail.com] On Behalf Of Srini Seetharaman Sent: Saturday, August 12, 2017 1:55 AM To: Muthukumaran K Cc: Tom Pantelis; controller-dev@lists.opendaylight.org Subject: Re: [controller-dev] Circuit Breaker timed out Or was there a real disk issue in that machine you were using? On Fri, Aug 11, 2017 at 10:58 AM, Srini Seetharaman <srini.seethara...@gmail.com<mailto:srini.seethara...@gmail.com>> wrote: Muthu, It's worrisome to hear that you've seen this too. Did it go away with Nitrogen or with moving to Akka 2.5 persistence? I am referring to the following params within the persistence section of akka.conf circuit-breaker { max-failures = 10 call-timeout = 10s reset-timeout = 30s } On Thu, Aug 10, 2017 at 10:17 PM, Muthukumaran K <muthukumara...@ericsson.com<mailto:muthukumara...@ericsson.com>> wrote: Hi Tom, Srini, We have also noticed this with Boron very sporadically even without any explicit action taken on shard like Srini did Srini, Are you referring “journal-plugin-fallback” from http://doc.akka.io/docs/akka/current/scala/general/configuration.html#config-akka-persistence ? Regards Muthu From: controller-dev-boun...@lists.opendaylight.org<mailto:controller-dev-boun...@lists.opendaylight.org> [mailto:controller-dev-boun...@lists.opendaylight.org<mailto:controller-dev-boun...@lists.opendaylight.org>] On Behalf Of Srini Seetharaman Sent: Friday, August 11, 2017 9:40 AM To: Tom Pantelis Cc: controller-dev@lists.opendaylight.org<mailto:controller-dev@lists.opendaylight.org> Subject: Re: [controller-dev] Circuit Breaker timed out Thanks Tom. I will investigate further on why the local disk operation failed. Seems strange though because I haven't seen anything in dmesg. The default value for the call-timeout is 10s in akka.conf. On Thu, Aug 10, 2017 at 3:20 PM, Tom Pantelis <tompante...@gmail.com<mailto:tompante...@gmail.com>> wrote: That error is from akka persistence. It happens if the backend persistence plugin doesn't respond back in time. I've only seen this in a CSIT environment whose disk activity was overloaded. The timeouts can be tweaked - I don't recall exactly what they are but you can find them in the akka docs (names contain circuit-breaker). On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman <srini.seethara...@gmail.com<mailto:srini.seethara...@gmail.com>> wrote: Hi Tom, In our ODL deployment that is running in standalone mode with operational store persistence enabled, we saw the following error being printed. Once the member-1-default-operational shard is shutdown, all write transactions after that fail and the system becomes unstable. At this point, we were probably doing less than 10 transactions per second. Any idea what is causing this? Has anyone seen this before? 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard | 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] with sequence number [9897493] for persistenceId [member-1-shard-default-operational]. akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out. 2017-08-07 19:15:59,628 | INFO | lt-dispatcher-24 | Shard | 188 - org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-akka-raft - 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 | LocalThreePhaseCommitCohort | 193 - org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-distributed-datastore - 1.4.2.Boron-SR2 | Failed to prepare transaction member-1-datastore-
Re: [controller-dev] Circuit Breaker timed out
Or was there a real disk issue in that machine you were using? On Fri, Aug 11, 2017 at 10:58 AM, Srini Seetharaman < srini.seethara...@gmail.com> wrote: > Muthu, > It's worrisome to hear that you've seen this too. Did it go away with > Nitrogen or with moving to Akka 2.5 persistence? > > I am referring to the following params within the persistence section of > akka.conf > > circuit-breaker { > max-failures = 10 > call-timeout = 10s > reset-timeout = 30s > } > > > > On Thu, Aug 10, 2017 at 10:17 PM, Muthukumaran K < > muthukumara...@ericsson.com> wrote: > >> Hi Tom, Srini, >> >> >> >> We have also noticed this with Boron very sporadically even without any >> explicit action taken on shard like Srini did >> >> >> >> Srini, >> >> >> >> Are you referring “journal-plugin-fallback” from >> http://doc.akka.io/docs/akka/current/scala/general/configura >> tion.html#config-akka-persistence ? >> >> >> >> Regards >> >> Muthu >> >> >> >> *From:* controller-dev-boun...@lists.opendaylight.org [mailto: >> controller-dev-boun...@lists.opendaylight.org] *On Behalf Of *Srini >> Seetharaman >> *Sent:* Friday, August 11, 2017 9:40 AM >> *To:* Tom Pantelis >> *Cc:* controller-dev@lists.opendaylight.org >> *Subject:* Re: [controller-dev] Circuit Breaker timed out >> >> >> >> Thanks Tom. I will investigate further on why the local disk operation >> failed. Seems strange though because I haven't seen anything in dmesg. >> >> >> >> The default value for the call-timeout is 10s in akka.conf. >> >> >> >> On Thu, Aug 10, 2017 at 3:20 PM, Tom Pantelis <tompante...@gmail.com> >> wrote: >> >> That error is from akka persistence. It happens if the backend >> persistence plugin doesn't respond back in time. I've only seen this in a >> CSIT environment whose disk activity was overloaded. The timeouts can be >> tweaked - I don't recall exactly what they are but you can find them in the >> akka docs (names contain circuit-breaker). >> >> >> >> On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman < >> srini.seethara...@gmail.com> wrote: >> >> Hi Tom, >> >> In our ODL deployment that is running in standalone mode with operational >> store persistence enabled, we saw the following error being printed. Once >> the member-1-default-operational shard is shutdown, all write transactions >> after that fail and the system becomes unstable. At this point, we were >> probably doing less than 10 transactions per second. Any idea what is >> causing this? Has anyone seen this before? >> >> >> >> >> >> 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard >> | 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist >> event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] >> with sequence number [9897493] for persistenceId >> [member-1-shard-default-operational]. >> >> akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out. >> >> 2017-08-07 19:15:59,628 | INFO | lt-dispatcher-24 | Shard >> | 188 - org.opendaylight.controller.sal-akka-raft - >> 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational >> >> 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 | >> LocalThreePhaseCommitCohort | 193 - >> org.opendaylight.controller.sal-distributed-datastore >> - 1.4.2.Boron-SR2 | Failed to prepare transaction >> member-1-datastore-operational-fe-5-txn-791019 on backend >> >> java.lang.RuntimeException: Transaction aborted due to shutdown. >> >> at org.opendaylight.controller.cluster.datastore.ShardCommitCoo >> rdinator.abortPendingTransactions(ShardCommitCoordinator. >> java:399)[193:org.opendaylight.controller.sal- >> distributed-datastore:1.4.2.Boron-SR2] >> >> at org.opendaylight.controller.cluster.datastore.Shard.postStop >> (Shard.java:211)[193:org.opendaylight.controller.sal- >> distributed-datastore:1.4.2.Boron-SR2] >> >> at akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175: >> com.typesafe.akka.actor:2.4.7] >> >> at akka.persistence.UntypedPersistentActor.akka$persistence$ >> Eventsourced$$super$aroundPostStop(PersistentActor >> .scala:168)[181:com.typesafe.akka.persistence:2.4.7] >> >> at akka.persistence.Eventsourced$class.aroundPostStop(Eventsour >> ced.scal
Re: [controller-dev] Circuit Breaker timed out
Muthu, It's worrisome to hear that you've seen this too. Did it go away with Nitrogen or with moving to Akka 2.5 persistence? I am referring to the following params within the persistence section of akka.conf circuit-breaker { max-failures = 10 call-timeout = 10s reset-timeout = 30s } On Thu, Aug 10, 2017 at 10:17 PM, Muthukumaran K < muthukumara...@ericsson.com> wrote: > Hi Tom, Srini, > > > > We have also noticed this with Boron very sporadically even without any > explicit action taken on shard like Srini did > > > > Srini, > > > > Are you referring “journal-plugin-fallback” from > http://doc.akka.io/docs/akka/current/scala/general/ > configuration.html#config-akka-persistence ? > > > > Regards > > Muthu > > > > *From:* controller-dev-boun...@lists.opendaylight.org [mailto: > controller-dev-boun...@lists.opendaylight.org] *On Behalf Of *Srini > Seetharaman > *Sent:* Friday, August 11, 2017 9:40 AM > *To:* Tom Pantelis > *Cc:* controller-dev@lists.opendaylight.org > *Subject:* Re: [controller-dev] Circuit Breaker timed out > > > > Thanks Tom. I will investigate further on why the local disk operation > failed. Seems strange though because I haven't seen anything in dmesg. > > > > The default value for the call-timeout is 10s in akka.conf. > > > > On Thu, Aug 10, 2017 at 3:20 PM, Tom Pantelis <tompante...@gmail.com> > wrote: > > That error is from akka persistence. It happens if the backend > persistence plugin doesn't respond back in time. I've only seen this in a > CSIT environment whose disk activity was overloaded. The timeouts can be > tweaked - I don't recall exactly what they are but you can find them in the > akka docs (names contain circuit-breaker). > > > > On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman < > srini.seethara...@gmail.com> wrote: > > Hi Tom, > > In our ODL deployment that is running in standalone mode with operational > store persistence enabled, we saw the following error being printed. Once > the member-1-default-operational shard is shutdown, all write transactions > after that fail and the system becomes unstable. At this point, we were > probably doing less than 10 transactions per second. Any idea what is > causing this? Has anyone seen this before? > > > > > > 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard >| 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist > event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] > with sequence number [9897493] for persistenceId [member-1-shard-default- > operational]. > > akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out. > > 2017-08-07 19:15:59,628 | INFO | lt-dispatcher-24 | Shard >| 188 - org.opendaylight.controller.sal-akka-raft - > 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational > > 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 | > LocalThreePhaseCommitCohort | 193 - > org.opendaylight.controller.sal-distributed-datastore > - 1.4.2.Boron-SR2 | Failed to prepare transaction > member-1-datastore-operational-fe-5-txn-791019 > on backend > > java.lang.RuntimeException: Transaction aborted due to shutdown. > > at org.opendaylight.controller.cluster.datastore. > ShardCommitCoordinator.abortPendingTransactions( > ShardCommitCoordinator.java:399)[193:org.opendaylight. > controller.sal-distributed-datastore:1.4.2.Boron-SR2] > > at org.opendaylight.controller.cluster.datastore.Shard. > postStop(Shard.java:211)[193:org.opendaylight.controller. > sal-distributed-datastore:1.4.2.Boron-SR2] > > at akka.actor.Actor$class.aroundPostStop(Actor.scala: > 494)[175:com.typesafe.akka.actor:2.4.7] > > at akka.persistence.UntypedPersistentActor.akka$ > persistence$Eventsourced$$super$aroundPostStop(PersistentActor.scala:168)[ > 181:com.typesafe.akka.persistence:2.4.7] > > at akka.persistence.Eventsourced$class.aroundPostStop( > Eventsourced.scala:223)[181:com.typesafe.akka.persistence:2.4.7] > > at akka.persistence.UntypedPersistentActor.aroundPostStop( > PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7] > > at akka.actor.dungeon.FaultHandling$class.akka$ > actor$dungeon$FaultHandling$$finishTerminate(FaultHandling. > scala:210)[175:com.typesafe.akka.actor:2.4.7] > > at akka.actor.dungeon.FaultHandling$class.handleChildTerminated( > FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7] > > at akka.actor.ActorCell.handleChildTerminated( > ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7] > > at akka.actor.dun
Re: [controller-dev] Circuit Breaker timed out
Hi Tom, Srini, We have also noticed this with Boron very sporadically even without any explicit action taken on shard like Srini did Srini, Are you referring “journal-plugin-fallback” from http://doc.akka.io/docs/akka/current/scala/general/configuration.html#config-akka-persistence ? Regards Muthu From: controller-dev-boun...@lists.opendaylight.org [mailto:controller-dev-boun...@lists.opendaylight.org] On Behalf Of Srini Seetharaman Sent: Friday, August 11, 2017 9:40 AM To: Tom Pantelis Cc: controller-dev@lists.opendaylight.org Subject: Re: [controller-dev] Circuit Breaker timed out Thanks Tom. I will investigate further on why the local disk operation failed. Seems strange though because I haven't seen anything in dmesg. The default value for the call-timeout is 10s in akka.conf. On Thu, Aug 10, 2017 at 3:20 PM, Tom Pantelis <tompante...@gmail.com<mailto:tompante...@gmail.com>> wrote: That error is from akka persistence. It happens if the backend persistence plugin doesn't respond back in time. I've only seen this in a CSIT environment whose disk activity was overloaded. The timeouts can be tweaked - I don't recall exactly what they are but you can find them in the akka docs (names contain circuit-breaker). On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman <srini.seethara...@gmail.com<mailto:srini.seethara...@gmail.com>> wrote: Hi Tom, In our ODL deployment that is running in standalone mode with operational store persistence enabled, we saw the following error being printed. Once the member-1-default-operational shard is shutdown, all write transactions after that fail and the system becomes unstable. At this point, we were probably doing less than 10 transactions per second. Any idea what is causing this? Has anyone seen this before? 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard | 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] with sequence number [9897493] for persistenceId [member-1-shard-default-operational]. akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out. 2017-08-07 19:15:59,628 | INFO | lt-dispatcher-24 | Shard | 188 - org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-akka-raft - 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 | LocalThreePhaseCommitCohort | 193 - org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-distributed-datastore - 1.4.2.Boron-SR2 | Failed to prepare transaction member-1-datastore-operational-fe-5-txn-791019 on backend java.lang.RuntimeException: Transaction aborted due to shutdown. at org.opendaylight.controller.cl<http://org.opendaylight.controller.cl>uster.datastore.ShardCommitCoordinator.abortPendingTransactions(ShardCommitCoordinator.java:399)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2] at org.opendaylight.controller.cl<http://org.opendaylight.controller.cl>uster.datastore.Shard.postStop(Shard.java:211)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2] at akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175:com.typesafe.akka.actor:2.4.7] at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7] at akka.persistence.Eventsourced$class.aroundPostStop(Eventsourced.scala:223)[181:com.typesafe.akka.persistence:2.4.7] at akka.persistence.UntypedPersistentActor.aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7] at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.dungeon.FaultHandling$class.handleChildTerminated(FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.dungeon.DeathWatch$class.watchedActorTerminated(DeathWatch.scala:61)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.watchedActorTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:460)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.run(Mailbox.scala:224)[175:com.typesafe.akka.actor:2.4.7] at a
Re: [controller-dev] Circuit Breaker timed out
Thanks Tom. I will investigate further on why the local disk operation failed. Seems strange though because I haven't seen anything in dmesg. The default value for the call-timeout is 10s in akka.conf. On Thu, Aug 10, 2017 at 3:20 PM, Tom Panteliswrote: > That error is from akka persistence. It happens if the backend > persistence plugin doesn't respond back in time. I've only seen this in a > CSIT environment whose disk activity was overloaded. The timeouts can be > tweaked - I don't recall exactly what they are but you can find them in the > akka docs (names contain circuit-breaker). > > On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman < > srini.seethara...@gmail.com> wrote: > >> Hi Tom, >> In our ODL deployment that is running in standalone mode with operational >> store persistence enabled, we saw the following error being printed. Once >> the member-1-default-operational shard is shutdown, all write transactions >> after that fail and the system becomes unstable. At this point, we were >> probably doing less than 10 transactions per second. Any idea what is >> causing this? Has anyone seen this before? >> >> >> 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard >> | 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist >> event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] >> with sequence number [9897493] for persistenceId >> [member-1-shard-default-operational]. >> akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out. >> 2017-08-07 19:15:59,628 | INFO | lt-dispatcher-24 | Shard >> | 188 - org.opendaylight.controller.sal-akka-raft - >> 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational >> 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 | >> LocalThreePhaseCommitCohort | 193 - >> org.opendaylight.controller.sal-distributed-datastore >> - 1.4.2.Boron-SR2 | Failed to prepare transaction >> member-1-datastore-operational-fe-5-txn-791019 on backend >> java.lang.RuntimeException: Transaction aborted due to shutdown. >> at org.opendaylight.controller.cluster.datastore.ShardCommitCoo >> rdinator.abortPendingTransactions(ShardCommitCoordinator. >> java:399)[193:org.opendaylight.controller.sal- >> distributed-datastore:1.4.2.Boron-SR2] >> at org.opendaylight.controller.cluster.datastore.Shard.postStop >> (Shard.java:211)[193:org.opendaylight.controller.sal- >> distributed-datastore:1.4.2.Boron-SR2] >> at akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175: >> com.typesafe.akka.actor:2.4.7] >> at akka.persistence.UntypedPersistentActor.akka$persistence$ >> Eventsourced$$super$aroundPostStop(PersistentActor >> .scala:168)[181:com.typesafe.akka.persistence:2.4.7] >> at akka.persistence.Eventsourced$class.aroundPostStop(Eventsour >> ced.scala:223)[181:com.typesafe.akka.persistence:2.4.7] >> at akka.persistence.UntypedPersistentActor.aroundPostStop(Persi >> stentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7] >> at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$ >> FaultHandling$$finishTerminate(FaultHandling.scala:210)[175: >> com.typesafe.akka.actor:2.4.7] >> at akka.actor.dungeon.FaultHandling$class.handleChildTerminated >> (FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7] >> at akka.actor.ActorCell.handleChildTerminated(ActorCell.scala: >> 374)[175:com.typesafe.akka.actor:2.4.7] >> at akka.actor.dungeon.DeathWatch$class.watchedActorTerminated(D >> eathWatch.scala:61)[175:com.typesafe.akka.actor:2.4.7] >> at akka.actor.ActorCell.watchedActorTerminated(ActorCell.scala: >> 374)[175:com.typesafe.akka.actor:2.4.7] >> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:460)[175: >> com.typesafe.akka.actor:2.4.7] >> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483)[175: >> com.typesafe.akka.actor:2.4.7] >> at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox. >> scala:282)[175:com.typesafe.akka.actor:2.4.7] >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260)[175: >> com.typesafe.akka.actor:2.4.7] >> at akka.dispatch.Mailbox.run(Mailbox.scala:224)[175:com.typesaf >> e.akka.actor:2.4.7] >> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[175:com.typesa >> fe.akka.actor:2.4.7] >> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask. >> java:260)[171:org.scala-lang.scala-library:2.11.8. >> v20160304-115712-1706a37eb8] >> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask( >> ForkJoinPool.java:1339)[171:org.scala-lang.scala-library: >> 2.11.8.v20160304-115712-1706a37eb8] >> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo >> l.java:1979)[171:org.scala-lang.scala-library:2.11.8. >> v20160304-115712-1706a37eb8] >> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinW >>
Re: [controller-dev] Circuit Breaker timed out
That error is from akka persistence. It happens if the backend persistence plugin doesn't respond back in time. I've only seen this in a CSIT environment whose disk activity was overloaded. The timeouts can be tweaked - I don't recall exactly what they are but you can find them in the akka docs (names contain circuit-breaker). On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman < srini.seethara...@gmail.com> wrote: > Hi Tom, > In our ODL deployment that is running in standalone mode with operational > store persistence enabled, we saw the following error being printed. Once > the member-1-default-operational shard is shutdown, all write transactions > after that fail and the system becomes unstable. At this point, we were > probably doing less than 10 transactions per second. Any idea what is > causing this? Has anyone seen this before? > > > 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard >| 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist > event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] > with sequence number [9897493] for persistenceId [member-1-shard-default- > operational]. > akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out. > 2017-08-07 19:15:59,628 | INFO | lt-dispatcher-24 | Shard >| 188 - org.opendaylight.controller.sal-akka-raft - > 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational > 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 | > LocalThreePhaseCommitCohort | 193 - > org.opendaylight.controller.sal-distributed-datastore > - 1.4.2.Boron-SR2 | Failed to prepare transaction > member-1-datastore-operational-fe-5-txn-791019 > on backend > java.lang.RuntimeException: Transaction aborted due to shutdown. > at org.opendaylight.controller.cluster.datastore. > ShardCommitCoordinator.abortPendingTransactions( > ShardCommitCoordinator.java:399)[193:org.opendaylight. > controller.sal-distributed-datastore:1.4.2.Boron-SR2] > at org.opendaylight.controller.cluster.datastore.Shard. > postStop(Shard.java:211)[193:org.opendaylight.controller. > sal-distributed-datastore:1.4.2.Boron-SR2] > at akka.actor.Actor$class.aroundPostStop(Actor.scala: > 494)[175:com.typesafe.akka.actor:2.4.7] > at akka.persistence.UntypedPersistentActor.akka$ > persistence$Eventsourced$$super$aroundPostStop(PersistentActor.scala:168)[ > 181:com.typesafe.akka.persistence:2.4.7] > at akka.persistence.Eventsourced$class.aroundPostStop( > Eventsourced.scala:223)[181:com.typesafe.akka.persistence:2.4.7] > at akka.persistence.UntypedPersistentActor.aroundPostStop( > PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7] > at akka.actor.dungeon.FaultHandling$class.akka$ > actor$dungeon$FaultHandling$$finishTerminate(FaultHandling. > scala:210)[175:com.typesafe.akka.actor:2.4.7] > at akka.actor.dungeon.FaultHandling$class.handleChildTerminated( > FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7] > at akka.actor.ActorCell.handleChildTerminated( > ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7] > at akka.actor.dungeon.DeathWatch$class.watchedActorTerminated( > DeathWatch.scala:61)[175:com.typesafe.akka.actor:2.4.7] > at akka.actor.ActorCell.watchedActorTerminated( > ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7] > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala: > 460)[175:com.typesafe.akka.actor:2.4.7] > at akka.actor.ActorCell.systemInvoke(ActorCell.scala: > 483)[175:com.typesafe.akka.actor:2.4.7] > at akka.dispatch.Mailbox.processAllSystemMessages( > Mailbox.scala:282)[175:com.typesafe.akka.actor:2.4.7] > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala: > 260)[175:com.typesafe.akka.actor:2.4.7] > at akka.dispatch.Mailbox.run(Mailbox.scala:224)[175:com. > typesafe.akka.actor:2.4.7] > at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[175:com. > typesafe.akka.actor:2.4.7] > at scala.concurrent.forkjoin.ForkJoinTask.doExec( > ForkJoinTask.java:260)[171:org.scala-lang.scala-library: > 2.11.8.v20160304-115712-1706a37eb8] > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue. > runTask(ForkJoinPool.java:1339)[171:org.scala-lang.scala-library:2.11.8. > v20160304-115712-1706a37eb8] > at scala.concurrent.forkjoin.ForkJoinPool.runWorker( > ForkJoinPool.java:1979)[171:org.scala-lang.scala-library: > 2.11.8.v20160304-115712-1706a37eb8] > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run( > ForkJoinWorkerThread.java:107)[171:org.scala-lang.scala- > library:2.11.8.v20160304-115712-1706a37eb8] > 2017-08-07 19:15:59,629 | WARN | ult-dispatcher-3 | > ConcurrentDOMDataBroker | 193 - > org.opendaylight.controller.sal-distributed-datastore > - 1.4.2.Boron-SR2 | Tx: DOM-956840 Error during phase CAN_COMMIT, starting > Abort > java.lang.RuntimeException: Transaction aborted due to shutdown. > at
[controller-dev] Circuit Breaker timed out
Hi Tom, In our ODL deployment that is running in standalone mode with operational store persistence enabled, we saw the following error being printed. Once the member-1-default-operational shard is shutdown, all write transactions after that fail and the system becomes unstable. At this point, we were probably doing less than 10 transactions per second. Any idea what is causing this? Has anyone seen this before? 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard | 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] with sequence number [9897493] for persistenceId [member-1-shard-default-operational]. akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out. 2017-08-07 19:15:59,628 | INFO | lt-dispatcher-24 | Shard | 188 - org.opendaylight.controller.sal-akka-raft - 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 | LocalThreePhaseCommitCohort | 193 - org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 | Failed to prepare transaction member-1-datastore-operational-fe-5-txn-791019 on backend java.lang.RuntimeException: Transaction aborted due to shutdown. at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.abortPendingTransactions(ShardCommitCoordinator.java:399)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2] at org.opendaylight.controller.cluster.datastore.Shard.postStop(Shard.java:211)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2] at akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175:com.typesafe.akka.actor:2.4.7] at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7] at akka.persistence.Eventsourced$class.aroundPostStop(Eventsourced.scala:223)[181:com.typesafe.akka.persistence:2.4.7] at akka.persistence.UntypedPersistentActor.aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7] at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.dungeon.FaultHandling$class.handleChildTerminated(FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.dungeon.DeathWatch$class.watchedActorTerminated(DeathWatch.scala:61)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.watchedActorTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:460)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.run(Mailbox.scala:224)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[175:com.typesafe.akka.actor:2.4.7] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] 2017-08-07 19:15:59,629 | WARN | ult-dispatcher-3 | ConcurrentDOMDataBroker | 193 - org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 | Tx: DOM-956840 Error during phase CAN_COMMIT, starting Abort java.lang.RuntimeException: Transaction aborted due to shutdown. at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.abortPendingTransactions(ShardCommitCoordinator.java:399)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2] at org.opendaylight.controller.cluster.datastore.Shard.postStop(Shard.java:211)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2] at akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175:com.typesafe.akka.actor:2.4.7] at