[jira] [Commented] (MESOS-9363) Improve task exec to return correct exit status
[ https://issues.apache.org/jira/browse/MESOS-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673841#comment-16673841 ] Kevin Klues commented on MESOS-9363: {noformat} commit 25beea12f9f12143e6df7b0ad2d272d4116c217c Author: Kevin Klues Date: Fri Nov 2 19:55:44 2018 -0400 Updated new CLI task attach/exec exit strategy. This code was pulled directly from: https://github.com/dcos/dcos-core-cli/blob/ 9d10e9d6fb2b16e46b58d67b7e9d79b2505f3451/ python/lib/dcos/dcos/mesos.py Review: https://reviews.apache.org/r/69208/ {noformat} > Improve task exec to return correct exit status > --- > > Key: MESOS-9363 > URL: https://issues.apache.org/jira/browse/MESOS-9363 > Project: Mesos > Issue Type: Task > Components: cli >Reporter: Armand Grillet >Assignee: Armand Grillet >Priority: Major > > Whatever the exit, {{mesos task exec}} always returns 0. We need to fix that > to return the correct status code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9363) Improve task exec to return correct exit status
[ https://issues.apache.org/jira/browse/MESOS-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673823#comment-16673823 ] Kevin Klues commented on MESOS-9363: {noformat} commit 992e3c4efd2f607b60f5e4b5bea7999692a01c0a Author: Armand Grillet Date: Fri Nov 2 19:42:59 2018 -0400 Updated new CLI to propagate commands exit status properly. In a later commit, this will be used by the two subcommands 'task attach' and 'task exec' to return their proper exit statuses. Review: https://reviews.apache.org/r/69206/ {noformat} > Improve task exec to return correct exit status > --- > > Key: MESOS-9363 > URL: https://issues.apache.org/jira/browse/MESOS-9363 > Project: Mesos > Issue Type: Task > Components: cli >Reporter: Armand Grillet >Assignee: Armand Grillet >Priority: Major > > Whatever the exit, {{mesos task exec}} always returns 0. We need to fix that > to return the correct status code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9343) Add test(s) for `mesos task attach` on task launched with a TTY
[ https://issues.apache.org/jira/browse/MESOS-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673820#comment-16673820 ] Kevin Klues commented on MESOS-9343: {noformat} commit 9441e48338a8c58adc1e88c9fc1804d10f201262 Author: Armand Grillet Date: Fri Nov 2 19:39:06 2018 -0400 Simplified newline handling in 'test_exec()' test for new CLI. The test was previously using '.strip()' to compare the command stdout and the result we expect. This check was incorrect because it could happen that the output ended in a bunch of extra whitespace that we would then strip off unknowningly. By replacing the task command to use 'printf' instead of 'echo' (which artifically inserts an extra newline in the output), we are able to simplify this assertion and make sure the output is exactly the same as what we expect. Review: https://reviews.apache.org/r/69237/ {noformat} > Add test(s) for `mesos task attach` on task launched with a TTY > > > Key: MESOS-9343 > URL: https://issues.apache.org/jira/browse/MESOS-9343 > Project: Mesos > Issue Type: Task > Components: cli >Reporter: Armand Grillet >Assignee: Armand Grillet >Priority: Major > > As a source, we could use the tests in > https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9343) Add test(s) for `mesos task attach` on task launched with a TTY
[ https://issues.apache.org/jira/browse/MESOS-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673818#comment-16673818 ] Kevin Klues commented on MESOS-9343: {noformat} commit 6b7b7e6f68ef891febcce2a38077a847288c1c10 Author: Armand Grillet Date: Fri Nov 2 19:17:56 2018 -0400 Refactored 'running_tasks()' call for new CLI tests. Review: https://reviews.apache.org/r/69207/ {noformat} > Add test(s) for `mesos task attach` on task launched with a TTY > > > Key: MESOS-9343 > URL: https://issues.apache.org/jira/browse/MESOS-9343 > Project: Mesos > Issue Type: Task > Components: cli >Reporter: Armand Grillet >Assignee: Armand Grillet >Priority: Major > > As a source, we could use the tests in > https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9343) Add test(s) for `mesos task attach` on task launched with a TTY
[ https://issues.apache.org/jira/browse/MESOS-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673806#comment-16673806 ] Kevin Klues commented on MESOS-9343: {noformat} commit 6d0cbda19ad8a5453c960e37424a38c4be1924a9 Author: Armand Grillet Date: Fri Nov 2 19:08:24 2018 -0400 Added 'popen_tty()' to test util functions for the new CLI. This code was pulled directly from: https://github.com/dcos/dcos-core-cli/blob/ 7fd55421939a7782c237e2b8719c0fe2f543acd7/ python/lib/dcoscli/dcoscli/test/common.py This function will be used by tests requiring a TTY. This will be the case for tests concerning the 'task attach' subcommand. Review: https://reviews.apache.org/r/69116/ {noformat} > Add test(s) for `mesos task attach` on task launched with a TTY > > > Key: MESOS-9343 > URL: https://issues.apache.org/jira/browse/MESOS-9343 > Project: Mesos > Issue Type: Task > Components: cli >Reporter: Armand Grillet >Assignee: Armand Grillet >Priority: Major > > As a source, we could use the tests in > https://github.com/dcos/dcos-core-cli/blob/b930d2004dceb47090004ab658f35cb608bc70e4/python/lib/dcoscli/tests/integrations/test_task.py -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9369) Avoid blocking `Future::get()` calls
Chun-Hung Hsiao created MESOS-9369: -- Summary: Avoid blocking `Future::get()` calls Key: MESOS-9369 URL: https://issues.apache.org/jira/browse/MESOS-9369 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Chun-Hung Hsiao Assignee: Chun-Hung Hsiao {{Future::get()}} does a wait if the future is still pending. If this is accidentally called in an actor, the actor will be blocked. We should avoid calling {{Future::get()}} in the code. The plan would be: # Introduce {{Future::value()}}: crash if not READY # Make {{Future::operator*}} and {{Future::operator->}} akin to {{Future::value()}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9258) Consider making Mesos subscribers send heartbeats
[ https://issues.apache.org/jira/browse/MESOS-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673558#comment-16673558 ] Joseph Wu commented on MESOS-9258: -- After some more investigation, requiring two-way streaming will not work for browsers (i.e. the WebUI) because two-way streaming requires websockets. And the load balancers that do not close connections (i.e. Elastic LB) do not support websockets. Now, we are considering two other workarounds: 1) Creating a separate {{HEARTBEAT}} API call and having the {{SUBSCRIBE}} return a stream ID. This has the downside of requiring (sometimes) significant client-side changes as they would need to parse an additional message type, maintain state, and keep a separate thread for heartbeating. This might also be harder to justify in a backport (if necessary) 2) Adding an optional field to the {{SUBSCRIBE}} call which lets the client set the maximum lifetime of a connection. The master would unilaterally close the connection after the specified duration. This change would require the client to have retry/reconnect logic (which would be expected anyway). > Consider making Mesos subscribers send heartbeats > - > > Key: MESOS-9258 > URL: https://issues.apache.org/jira/browse/MESOS-9258 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Gastón Kleiman >Assignee: Joseph Wu >Priority: Critical > Labels: mesosphere > > Some reverse proxies (e.g., ELB using an HTTP listener) won't close the > upstream connection to Mesos when they detect that their client is > disconnected. > This can make Mesos leak subscribers, which generates unnecessary > authorization requests and affects performance. > We should evaluate methods (e.g., heartbeats) to enable Mesos to detect that > a subscriber is gone, even if the TCP connection is still open. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable
[ https://issues.apache.org/jira/browse/MESOS-9368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Ni reassigned MESOS-9368: Assignee: Xudong Ni > The agent can be resending status updates too aggressively and the backoff is > not configurable > -- > > Key: MESOS-9368 > URL: https://issues.apache.org/jira/browse/MESOS-9368 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu >Assignee: Xudong Ni >Priority: Major > > The current behavior is that when the agent queue status updates in a > "stream" which has an exponential backoff window from 10secs to 10mins. In > each retry the front of the queue is sent so if multiple statuses are queued > up, subsequent ones are not attempted unless the first one is acked. So if > the frameworks are for some reason not able to ack at all, there is one > update per task in flight at a time. > If in a cluster we have 500,000 tasks with pending status updates and the > master fails over, after each agent is reregistered it starts to send these > updates or we are looking at 500,000 updates ~immediately + 500,000 updates > 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later. > Given that the initial communication of task state is covered by the agent > reregistration message and the framework reconciliation requests, it seems > that we can safely reduce the retry frequency further, optionally of course. > It's not currently configurable so we need to expose a flag for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable
[ https://issues.apache.org/jira/browse/MESOS-9368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673503#comment-16673503 ] Yan Xu commented on MESOS-9368: --- cc [~fiu] > The agent can be resending status updates too aggressively and the backoff is > not configurable > -- > > Key: MESOS-9368 > URL: https://issues.apache.org/jira/browse/MESOS-9368 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu >Priority: Major > > The current behavior is that when the agent queue status updates in a > "stream" which has an exponential backoff window from 10secs to 10mins. In > each retry the front of the queue is sent so if multiple statuses are queued > up, subsequent ones are not attempted unless the first one is acked. So if > the frameworks are for some reason not able to ack at all, there is one > update per task in flight at a time. > If in a cluster we have 500,000 tasks with pending status updates and the > master fails over, after each agent is reregistered it starts to send these > updates or we are looking at 500,000 updates ~immediately + 500,000 updates > 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later. > Given that the initial communication of task state is covered by the agent > reregistration message and the framework reconciliation requests, it seems > that we can safely reduce the retry frequency further, optionally of course. > It's not currently configurable so we need to expose a flag for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable
[ https://issues.apache.org/jira/browse/MESOS-9368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673497#comment-16673497 ] Yan Xu commented on MESOS-9368: --- [~ipronin] [~jasonlai] do you guys feel similarly for your environments? > The agent can be resending status updates too aggressively and the backoff is > not configurable > -- > > Key: MESOS-9368 > URL: https://issues.apache.org/jira/browse/MESOS-9368 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu >Priority: Major > > The current behavior is that when the agent queue status updates in a > "stream" which has an exponential backoff window from 10secs to 10mins. In > each retry the front of the queue is sent so if multiple statuses are queued > up, subsequent ones are not attempted unless the first one is acked. So if > the frameworks are for some reason not able to ack at all, there is one > update per task in flight at a time. > If in a cluster we have 500,000 tasks with pending status updates and the > master fails over, after each agent is reregistered it starts to send these > updates or we are looking at 500,000 updates ~immediately + 500,000 updates > 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later. > Given that the initial communication of task state is covered by the agent > reregistration message and the framework reconciliation requests, it seems > that we can safely reduce the retry frequency further, optionally of course. > It's not currently configurable so we need to expose a flag for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable
Yan Xu created MESOS-9368: - Summary: The agent can be resending status updates too aggressively and the backoff is not configurable Key: MESOS-9368 URL: https://issues.apache.org/jira/browse/MESOS-9368 Project: Mesos Issue Type: Bug Reporter: Yan Xu The current behavior is that when the agent queue status updates in a "stream" which has an exponential backoff window from 10secs to 10mins. In each retry the front of the queue is sent so if multiple statuses are queued up, subsequent ones are not attempted unless the first one is acked. So if the frameworks are for some reason not able to ack at all, there is one update per task in flight at a time. If in a cluster we have 500,000 tasks with pending status updates and the master fails over, after each agent is reregistered it starts to send these updates or we are looking at 500,000 updates ~immediately + 500,000 updates 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later. Given that the initial communication of task state is covered by the agent reregistration message and the framework reconciliation requests, it seems that we can safely reduce the retry frequency further, optionally of course. It's not currently configurable so we need to expose a flag for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)