[jira] [Commented] (MESOS-9363) Improve task exec to return correct exit status

2018-11-02 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673841#comment-16673841
 ] 

Kevin Klues commented on MESOS-9363:


{noformat}
commit 25beea12f9f12143e6df7b0ad2d272d4116c217c
Author: Kevin Klues 
Date:   Fri Nov 2 19:55:44 2018 -0400

Updated new CLI task attach/exec exit strategy.

This code was pulled directly from:
https://github.com/dcos/dcos-core-cli/blob/
9d10e9d6fb2b16e46b58d67b7e9d79b2505f3451/
python/lib/dcos/dcos/mesos.py

Review: https://reviews.apache.org/r/69208/
{noformat}

> Improve task exec to return correct exit status
> ---
>
> Key: MESOS-9363
> URL: https://issues.apache.org/jira/browse/MESOS-9363
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> Whatever the exit, {{mesos task exec}} always returns 0. We need to fix that 
> to return the correct status code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9363) Improve task exec to return correct exit status

2018-11-02 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673823#comment-16673823
 ] 

Kevin Klues commented on MESOS-9363:


{noformat}
commit 992e3c4efd2f607b60f5e4b5bea7999692a01c0a
Author: Armand Grillet 
Date:   Fri Nov 2 19:42:59 2018 -0400

Updated new CLI to propagate commands exit status properly.

In a later commit, this will be used by the two subcommands 'task
attach' and 'task exec' to return their proper exit statuses.

Review: https://reviews.apache.org/r/69206/
{noformat}

> Improve task exec to return correct exit status
> ---
>
> Key: MESOS-9363
> URL: https://issues.apache.org/jira/browse/MESOS-9363
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> Whatever the exit, {{mesos task exec}} always returns 0. We need to fix that 
> to return the correct status code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9343) Add test(s) for `mesos task attach` on task launched with a TTY

2018-11-02 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673820#comment-16673820
 ] 

Kevin Klues commented on MESOS-9343:


{noformat}
commit 9441e48338a8c58adc1e88c9fc1804d10f201262
Author: Armand Grillet 
Date:   Fri Nov 2 19:39:06 2018 -0400

Simplified newline handling in 'test_exec()' test for new CLI.

The test was previously using '.strip()' to compare the command stdout
and the result we expect. This check was incorrect because it could
happen that the output ended in a bunch of extra whitespace that we
would then strip off unknowningly. By replacing the task command to use
'printf' instead of 'echo' (which artifically inserts an extra newline
in the output), we are able to simplify this assertion and make sure the
output is exactly the same as what we expect.

Review: https://reviews.apache.org/r/69237/
{noformat}

> Add test(s) for `mesos task attach` on task launched with a TTY 
> 
>
> Key: MESOS-9343
> URL: https://issues.apache.org/jira/browse/MESOS-9343
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> As a source, we could use the tests in 
> https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9343) Add test(s) for `mesos task attach` on task launched with a TTY

2018-11-02 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673818#comment-16673818
 ] 

Kevin Klues commented on MESOS-9343:


{noformat}
commit 6b7b7e6f68ef891febcce2a38077a847288c1c10
Author: Armand Grillet 
Date:   Fri Nov 2 19:17:56 2018 -0400

Refactored 'running_tasks()' call for new CLI tests.

Review: https://reviews.apache.org/r/69207/
{noformat}

> Add test(s) for `mesos task attach` on task launched with a TTY 
> 
>
> Key: MESOS-9343
> URL: https://issues.apache.org/jira/browse/MESOS-9343
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> As a source, we could use the tests in 
> https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9343) Add test(s) for `mesos task attach` on task launched with a TTY

2018-11-02 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673806#comment-16673806
 ] 

Kevin Klues commented on MESOS-9343:


{noformat}
commit 6d0cbda19ad8a5453c960e37424a38c4be1924a9
Author: Armand Grillet 
Date:   Fri Nov 2 19:08:24 2018 -0400

Added 'popen_tty()' to test util functions for the new CLI.

This code was pulled directly from:
https://github.com/dcos/dcos-core-cli/blob/
7fd55421939a7782c237e2b8719c0fe2f543acd7/
python/lib/dcoscli/dcoscli/test/common.py

This function will be used by tests requiring a TTY. This will be the
case for tests concerning the 'task attach' subcommand.

Review: https://reviews.apache.org/r/69116/
{noformat}

> Add test(s) for `mesos task attach` on task launched with a TTY 
> 
>
> Key: MESOS-9343
> URL: https://issues.apache.org/jira/browse/MESOS-9343
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> As a source, we could use the tests in 
> https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9369) Avoid blocking `Future::get()` calls

2018-11-02 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-9369:
--

 Summary: Avoid blocking `Future::get()` calls
 Key: MESOS-9369
 URL: https://issues.apache.org/jira/browse/MESOS-9369
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


{{Future::get()}} does a wait if the future is still pending. If this is 
accidentally called in an actor, the actor will be blocked. We should avoid 
calling {{Future::get()}} in the code. The plan would be:
 # Introduce {{Future::value()}}: crash if not READY
 # Make {{Future::operator*}} and {{Future::operator->}} akin to 
{{Future::value()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9258) Consider making Mesos subscribers send heartbeats

2018-11-02 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673558#comment-16673558
 ] 

Joseph Wu commented on MESOS-9258:
--

After some more investigation, requiring two-way streaming will not work for 
browsers (i.e. the WebUI) because two-way streaming requires websockets.  And 
the load balancers that do not close connections (i.e. Elastic LB) do not 
support websockets.

Now, we are considering two other workarounds:
1) Creating a separate {{HEARTBEAT}} API call and having the {{SUBSCRIBE}} 
return a stream ID.  This has the downside of requiring (sometimes) significant 
client-side changes as they would need to parse an additional message type, 
maintain state, and keep a separate thread for heartbeating.  This might also 
be harder to justify in a backport (if necessary)
2) Adding an optional field to the {{SUBSCRIBE}} call which lets the client set 
the maximum lifetime of a connection.  The master would unilaterally close the 
connection after the specified duration.  This change would require the client 
to have retry/reconnect logic (which would be expected anyway).

> Consider making Mesos subscribers send heartbeats
> -
>
> Key: MESOS-9258
> URL: https://issues.apache.org/jira/browse/MESOS-9258
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Gastón Kleiman
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> Some reverse proxies (e.g., ELB using an HTTP listener) won't close the 
> upstream connection to Mesos when they detect that their client is 
> disconnected.
> This can make Mesos leak subscribers, which generates unnecessary 
> authorization requests and affects performance.
> We should evaluate methods (e.g., heartbeats) to enable Mesos to detect that 
> a subscriber is gone, even if the TCP connection is still open.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable

2018-11-02 Thread Xudong Ni (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xudong Ni reassigned MESOS-9368:


Assignee: Xudong Ni

> The agent can be resending status updates too aggressively and the backoff is 
> not configurable
> --
>
> Key: MESOS-9368
> URL: https://issues.apache.org/jira/browse/MESOS-9368
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Assignee: Xudong Ni
>Priority: Major
>
> The current behavior is that when the agent queue status updates in a 
> "stream" which has an exponential backoff window from 10secs to 10mins. In 
> each retry the front of the queue is sent so if multiple statuses are queued 
> up, subsequent ones are not attempted unless the first one is acked. So if 
> the frameworks are for some reason not able to ack at all, there is one 
> update per task in flight at a time.
> If in a cluster we have 500,000 tasks with pending status updates and the 
> master fails over, after each agent is reregistered it starts to send these 
> updates or we are looking at 500,000 updates ~immediately + 500,000 updates 
> 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.
> Given that the initial communication of task state is covered by the agent 
> reregistration message and the framework reconciliation requests, it seems 
> that we can safely reduce the retry frequency further, optionally of course. 
> It's not currently configurable so we need to expose a flag for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable

2018-11-02 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673503#comment-16673503
 ] 

Yan Xu commented on MESOS-9368:
---

cc [~fiu]

> The agent can be resending status updates too aggressively and the backoff is 
> not configurable
> --
>
> Key: MESOS-9368
> URL: https://issues.apache.org/jira/browse/MESOS-9368
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Major
>
> The current behavior is that when the agent queue status updates in a 
> "stream" which has an exponential backoff window from 10secs to 10mins. In 
> each retry the front of the queue is sent so if multiple statuses are queued 
> up, subsequent ones are not attempted unless the first one is acked. So if 
> the frameworks are for some reason not able to ack at all, there is one 
> update per task in flight at a time.
> If in a cluster we have 500,000 tasks with pending status updates and the 
> master fails over, after each agent is reregistered it starts to send these 
> updates or we are looking at 500,000 updates ~immediately + 500,000 updates 
> 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.
> Given that the initial communication of task state is covered by the agent 
> reregistration message and the framework reconciliation requests, it seems 
> that we can safely reduce the retry frequency further, optionally of course. 
> It's not currently configurable so we need to expose a flag for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable

2018-11-02 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673497#comment-16673497
 ] 

Yan Xu commented on MESOS-9368:
---

[~ipronin] [~jasonlai] do you guys feel similarly for your environments?

> The agent can be resending status updates too aggressively and the backoff is 
> not configurable
> --
>
> Key: MESOS-9368
> URL: https://issues.apache.org/jira/browse/MESOS-9368
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Major
>
> The current behavior is that when the agent queue status updates in a 
> "stream" which has an exponential backoff window from 10secs to 10mins. In 
> each retry the front of the queue is sent so if multiple statuses are queued 
> up, subsequent ones are not attempted unless the first one is acked. So if 
> the frameworks are for some reason not able to ack at all, there is one 
> update per task in flight at a time.
> If in a cluster we have 500,000 tasks with pending status updates and the 
> master fails over, after each agent is reregistered it starts to send these 
> updates or we are looking at 500,000 updates ~immediately + 500,000 updates 
> 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.
> Given that the initial communication of task state is covered by the agent 
> reregistration message and the framework reconciliation requests, it seems 
> that we can safely reduce the retry frequency further, optionally of course. 
> It's not currently configurable so we need to expose a flag for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable

2018-11-02 Thread Yan Xu (JIRA)
Yan Xu created MESOS-9368:
-

 Summary: The agent can be resending status updates too 
aggressively and the backoff is not configurable
 Key: MESOS-9368
 URL: https://issues.apache.org/jira/browse/MESOS-9368
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


The current behavior is that when the agent queue status updates in a "stream" 
which has an exponential backoff window from 10secs to 10mins. In each retry 
the front of the queue is sent so if multiple statuses are queued up, 
subsequent ones are not attempted unless the first one is acked. So if the 
frameworks are for some reason not able to ack at all, there is one update per 
task in flight at a time.

If in a cluster we have 500,000 tasks with pending status updates and the 
master fails over, after each agent is reregistered it starts to send these 
updates or we are looking at 500,000 updates ~immediately + 500,000 updates 
10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.

Given that the initial communication of task state is covered by the agent 
reregistration message and the framework reconciliation requests, it seems that 
we can safely reduce the retry frequency further, optionally of course. It's 
not currently configurable so we need to expose a flag for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)