[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load

2017-04-20 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977710#comment-15977710
 ] 

David McLaughlin commented on AURORA-1840:
--

Thanks John. Good to know we can customize it. For, I think that should be a 
blocker to making Curator the default and removing the twitter-commons 
implementation. 

> Issue with Curator-backed discovery under heavy load
> 
>
> Key: AURORA-1840
> URL: https://issues.apache.org/jira/browse/AURORA-1840
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
>Priority: Blocker
> Fix For: 0.17.0
>
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load

2017-04-20 Thread John Sirois (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977689#comment-15977689
 ] 

John Sirois commented on AURORA-1840:
-

[~davmclau] - I think the Curator patch I linked above 
(https://issues.apache.org/jira/browse/CURATOR-248) addresses this for the 3.x 
series - you can install a custom error handler that behaves in the way Aurora 
desires.

> Issue with Curator-backed discovery under heavy load
> 
>
> Key: AURORA-1840
> URL: https://issues.apache.org/jira/browse/AURORA-1840
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
>Priority: Blocker
> Fix For: 0.17.0
>
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load

2017-04-20 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977627#comment-15977627
 ] 

David McLaughlin commented on AURORA-1840:
--

I just wanted to bump this ticket because we've been running into issues 
recently with ZK clients (the one in Mesos) that expire sessions on the client 
side. We're noticing more and more that Aurora is surviving our internal ZK 
upgrades but Mesos is consistently failing over - and it's doing so despite the 
fact that the session would survive if the client didn't decide to end it. 

There is absolutely zero reason for us to expire sessions on the client (our 
partition tolerance is provided by quorum writes to the replicated log and 
eventual consistency is provided by reconciliation).. so self-terminating ZK 
sessions if the ZK cluster goes down is just needlessly tying our availability 
to ZooKeeper. For the recipes we have, we only care about ZooKeeper being up 
when we're trying to reach consensus on who the leader is. After that, we can 
survive ZK downtime for as long as we have a quorum of scheduler nodes running. 

The new curator implementation will change this because it manages session 
state on the client side (despite the ZK docs saying not to do this - sigh). 

>From this SO question:

{quote}
In Curator 3.x and above, when a connection to the ensemble is lost, Curator 
sets an internal timer and if that timer passes the negotiated session timeout 
before reconnecting to the ZooKeeper ensemble, Curator changes to LOST and 
"fakes" a session timeout to the internally managed ZooKeeper handle.
{quote}

And from their docs (http://curator.apache.org/errors.html):

{quote}
Curator will set the LOST state when it believes that the ZooKeeper session has 
expired. ZooKeeper connections have a session. When the session expires, 
clients must take appropriate action. In Curator, this is complicated by the 
fact that Curator internally manages the ZooKeeper connection. Curator will set 
the LOST state when any of the following occurs: a) ZooKeeper returns a 
Watcher.Event.KeeperState.Expired or KeeperException.Code.SESSIONEXPIRED; b) 
Curator closes the internally managed ZooKeeper instance; c) The session 
timeout elapses during a network partition. It is possible to get a RECONNECTED 
state after this but you should still consider any locks, etc. as 
dirty/unstable. NOTE: The meaning of LOST has changed since Curator 3.0.0. 
Prior to 3.0.0 LOST only meant that the retry policy had expired.
{quote}

This means you'll see LOST (equivalent to session expired) if the ZK cluster is 
down (or partitioned) for longer than your session timeout. I don't think this 
is suitable for a high-availability production environment.

> Issue with Curator-backed discovery under heavy load
> 
>
> Key: AURORA-1840
> URL: https://issues.apache.org/jira/browse/AURORA-1840
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
>Priority: Blocker
> Fix For: 0.17.0
>
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC 

[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load

2016-12-01 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712240#comment-15712240
 ] 

Zameer Manji commented on AURORA-1840:
--

+1

This seems identical to the behaviour of the previous implementation.

> Issue with Curator-backed discovery under heavy load
> 
>
> Key: AURORA-1840
> URL: https://issues.apache.org/jira/browse/AURORA-1840
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
>Priority: Blocker
> Fix For: 0.17.0
>
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load

2016-12-01 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712224#comment-15712224
 ] 

Zameer Manji commented on AURORA-1840:
--

I don't object to reverting this until some analysis can be done.

> Issue with Curator-backed discovery under heavy load
> 
>
> Key: AURORA-1840
> URL: https://issues.apache.org/jira/browse/AURORA-1840
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
>Priority: Blocker
> Fix For: 0.17.0
>
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load

2016-12-01 Thread Stephan Erb (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712221#comment-15712221
 ] 

Stephan Erb commented on AURORA-1840:
-

Reverting as a tactical solution until our Curator implementation has been 
improved sounds good to me.

Regarding Curator: 
http://curator.apache.org/curator-recipes/leader-election.html seems more 
suitable for our usecase. It is call-back based and offers more control:
{quote}
The LeaderSelectorListener class extends ConnectionStateListener. When the 
LeaderSelector is started, it adds the listener to the Curator instance. Users 
of the LeaderSelector must pay attention to any connection state changes. If an 
instance becomes the leader, it should respond to notification of being 
SUSPENDED or LOST. If the SUSPENDED state is reported, the instance must assume 
that it might no longer be the leader until it receives a RECONNECTED state. If 
the LOST state is reported, the instance is no longer the leader and its 
takeLeadership method should exit.

IMPORTANT: The recommended action for receiving SUSPENDED or LOST is to throw 
CancelLeadershipException. This will cause the LeaderSelector instance to 
attempt to interrupt and cancel the thread that is executing the takeLeadership 
method. Because this is so important, you should consider extending 
LeaderSelectorListenerAdapter. LeaderSelectorListenerAdapter has the 
recommended handling already written for you.
{quote}

> Issue with Curator-backed discovery under heavy load
> 
>
> Key: AURORA-1840
> URL: https://issues.apache.org/jira/browse/AURORA-1840
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
>Priority: Blocker
> Fix For: 0.17.0
>
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load

2016-12-01 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711556#comment-15711556
 ] 

David McLaughlin commented on AURORA-1840:
--

I figured this out. The old zk commons approach was a session-based 
SingletonService. 

The new Curator-based recipe is based on Curator's LeaderLatch which is 
explicitly a connection-based leadership concept:

>From http://curator.apache.org/curator-recipes/leader-latch.html

{quote}LeaderLatch instances add a *ConnectionStateListener* to watch for 
connection problems. If SUSPENDED or LOST is reported, the LeaderLatch that is 
the leader will report that it is no longer the leader (i.e. *there will not be 
a leader until the connection is re-established*). If a LOST connection is 
RECONNECTED, the LeaderLatch will delete its previous ZNode and create a new 
one.{quote}

This is a really terrible idea for a Scheduler that can take minutes to fail 
over. We'll need to revert the above commits until someone can come up with a 
better Curator-backed recipe. 

> Issue with Curator-backed discovery under heavy load
> 
>
> Key: AURORA-1840
> URL: https://issues.apache.org/jira/browse/AURORA-1840
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: David McLaughlin
>Priority: Blocker
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load

2016-11-30 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711180#comment-15711180
 ] 

David McLaughlin commented on AURORA-1840:
--

Would anyone object to reverting the two commits above on master until we can 
root cause this, since it will be a noop for anyone using the default settings 
and who aren't having issues with curator? cc [~jsirois] [~zmanji] 


> Issue with Curator-backed discovery under heavy load
> 
>
> Key: AURORA-1840
> URL: https://issues.apache.org/jira/browse/AURORA-1840
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: David McLaughlin
>Priority: Blocker
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this somehow 
> triggers a session timeout of 20s to fire. Note: we have seen GC pauses as 
> little as 7s cause the same behavior. 
> That's problem one. The second problem is that we are using the 
> 'zk_session_timeout' flag and have the value set to 30s. This does not seem 
> to be passed to the Curator client as it is getting a value of 20s from 
> somewhere. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)