[jira] [Commented] (MESOS-8248) Expose information about GPU assigned to a task

2017-11-17 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257840#comment-16257840
 ] 

Benjamin Mahler commented on MESOS-8248:


The suggestion here is to add it to the {{ContainerStatus}} which should 
surface through the API.

> Expose information about GPU assigned to a task
> ---
>
> Key: MESOS-8248
> URL: https://issues.apache.org/jira/browse/MESOS-8248
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, gpu
>Reporter: Karthik Anantha Padmanabhan
>  Labels: GPU
>
> As a framework author I'd like information about the gpu that was assigned to 
> a task.
> `nvidia-smi` for example provides the following information GPU UUID, boardId 
> minor number etc. It would useful to expose this information when a task is 
> assigned to a GPU instance.
> This will make it possible to monitor resource usage for a task on GPU which 
> is not possible when



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8248) Expose information about GPU assigned to a task

2017-11-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8248:
---
Issue Type: Improvement  (was: Task)

> Expose information about GPU assigned to a task
> ---
>
> Key: MESOS-8248
> URL: https://issues.apache.org/jira/browse/MESOS-8248
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, gpu
>Reporter: Karthik Anantha Padmanabhan
>  Labels: GPU
>
> As a framework author I'd like information about the gpu that was assigned to 
> a task.
> `nvidia-smi` for example provides the following information GPU UUID, boardId 
> minor number etc. It would useful to expose this information when a task is 
> assigned to a GPU instance.
> This will make it possible to monitor resource usage for a task on GPU which 
> is not possible when



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8248) Expose information about assigned GPU

2017-11-17 Thread Karthik Anantha Padmanabhan (JIRA)
Karthik Anantha Padmanabhan created MESOS-8248:
--

 Summary: Expose information about assigned GPU
 Key: MESOS-8248
 URL: https://issues.apache.org/jira/browse/MESOS-8248
 Project: Mesos
  Issue Type: Task
  Components: containerization, gpu
Reporter: Karthik Anantha Padmanabhan


As a framework author I'd like information about the gpu that was assigned to a 
task.

`nvidia-smi` for example provides the following information GPU UUID, boardId 
minor number etc. It would useful to expose this information when a task is 
assigned to a GPU instance.

This will make it possible to monitor resource usage for a task on GPU which is 
not possible when



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8248) Expose information about GPU assigned to a task

2017-11-17 Thread Karthik Anantha Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Anantha Padmanabhan updated MESOS-8248:
---
Summary: Expose information about GPU assigned to a task  (was: Expose 
information about assigned GPU)

> Expose information about GPU assigned to a task
> ---
>
> Key: MESOS-8248
> URL: https://issues.apache.org/jira/browse/MESOS-8248
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, gpu
>Reporter: Karthik Anantha Padmanabhan
>  Labels: GPU
>
> As a framework author I'd like information about the gpu that was assigned to 
> a task.
> `nvidia-smi` for example provides the following information GPU UUID, boardId 
> minor number etc. It would useful to expose this information when a task is 
> assigned to a GPU instance.
> This will make it possible to monitor resource usage for a task on GPU which 
> is not possible when



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2017-11-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman reassigned MESOS-7742:
-

Assignee: (was: Gastón Kleiman)

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> {code}
> [ RUN  ] 
> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession/0
> I0629 05:49:33.180673 25301 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0629 05:49:33.182234 25306 master.cpp:436] Master 
> 90ea1640-bdf3-49ba-b78f-b2ba7ea30077 (296af9b598c3) started on 
> 172.17.0.3:45726
> I0629 05:49:33.182289 25306 master.cpp:438] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" -
> -allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --au
> thenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/a5h5J3/credentials" 
> --framework_sorter="drf" --help="false" --hostn
> ame_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="10
> 00" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="in_memory" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registr
> y_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" -
> -version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/a5h5J3/master" --zk_session_timeout="10secs"
> I0629 05:49:33.182561 25306 master.cpp:488] Master only allowing 
> authenticated frameworks to register
> I0629 05:49:33.182610 25306 master.cpp:502] Master only allowing 
> authenticated agents to register
> I0629 05:49:33.182636 25306 master.cpp:515] Master only allowing 
> authenticated HTTP frameworks to register
> I0629 05:49:33.182656 25306 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/a5h5J3/credentials'
> I0629 05:49:33.182915 25306 master.cpp:560] Using default 'crammd5' 
> authenticator
> I0629 05:49:33.183009 25306 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0629 05:49:33.183151 25306 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0629 05:49:33.183218 25306 http.cpp:975] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0629 05:49:33.183284 25306 master.cpp:640] Authorization enabled
> I0629 05:49:33.183462 25309 hierarchical.cpp:158] Initialized hierarchical 
> allocator process
> I0629 05:49:33.183504 25309 whitelist_watcher.cpp:77] No whitelist given
> I0629 05:49:33.184311 25308 master.cpp:2161] Elected as the leading master!
> I0629 05:49:33.184341 25308 master.cpp:1700] Recovering from registrar
> I0629 05:49:33.184404 25308 registrar.cpp:345] Recovering registrar
> I0629 05:49:33.184622 25308 registrar.cpp:389] Successfully fetched the 
> registry (0B) in 183040ns
> I0629 05:49:33.184687 25308 registrar.cpp:493] Applied 1 operations in 
> 6441ns; attempting to update the registry
> I0629 05:49:33.184885 25304 registrar.cpp:550] Successfully updated the 
> registry in 147200ns
> I0629 05:49:33.184993 25304 registrar.cpp:422] Successfully recovered 
> registrar
> I0629 05:49:33.185148 25308 master.cpp:1799] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> I0629 05:49:33.185161 25302 hierarchical.cpp:185] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I0629 05:49:33.186769 25301 containerizer.cpp:221] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,netw

[jira] [Commented] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-17 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257398#comment-16257398
 ] 

Yan Xu commented on MESOS-8185:
---

I think so. [~ipronin] with MESOS-7215 no tasks will be killed by the master, 
even if the framework is not partition-aware. Will this work?

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>  Labels: reliability
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8247) Executor registered message is lost

2017-11-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257291#comment-16257291
 ] 

Andrei Budnik commented on MESOS-8247:
--

Related https://issues.apache.org/jira/browse/MESOS-3851 ?

> Executor registered message is lost
> ---
>
> Key: MESOS-8247
> URL: https://issues.apache.org/jira/browse/MESOS-8247
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>
> h3. Brief description of successful agent-executor communication.
> Executor sends `RegisterExecutorMessage` message to Agent during 
> initialization step. Agent sends a `ExecutorRegisteredMessage` message as a 
> response to the Executor in `registerExecutor()` method. Whenever executor 
> receives `ExecutorRegisteredMessage`, it prints a `Executor registered on 
> agent...` to stderr logs.
> h3. Problem description.
> The agent launches built-in docker executor, which is stuck in `STAGING` 
> state.
> stderr logs of the docker executor:
> {code}
> I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
> {code}
> It doesn't contain a message like `Executor registered on agent...`. At the 
> same time agent received `RegisterExecutorMessage` and sent `runTask` message 
> to the executor.
> stdout logs consists of the same repeating message:
> {code}
> Received killTask for task ...
> {code}
> Also, the docker executor process doesn't contain child processes.
> Currently, executor [doesn't 
> attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
>  to launch a task if it is not registered at the agent, while [task 
> killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
>  doesn't have such a check.
> It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8247) Executor registered message is lost

2017-11-17 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8247:


 Summary: Executor registered message is lost
 Key: MESOS-8247
 URL: https://issues.apache.org/jira/browse/MESOS-8247
 Project: Mesos
  Issue Type: Bug
Reporter: Andrei Budnik


h3. Brief description of successful agent-executor communication.
Executor sends `RegisterExecutorMessage` message to Agent during initialization 
step. Agent sends a `ExecutorRegisteredMessage` message as a response to the 
Executor in `registerExecutor()` method. Whenever executor receives 
`ExecutorRegisteredMessage`, it prints a `Executor registered on agent...` to 
stderr logs.

h3. Problem description.
The agent launches built-in docker executor, which is stuck in `STAGING` state.
stderr logs of the docker executor:
{code}
I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
{code}
It doesn't contain a message like `Executor registered on agent...`. At the 
same time agent received `RegisterExecutorMessage` and sent `runTask` message 
to the executor.

stdout logs consists of the same repeating message:
{code}
Received killTask for task ...
{code}
Also, the docker executor process doesn't contain child processes.

Currently, executor [doesn't 
attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
 to launch a task if it is not registered at the agent, while [task 
killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
 doesn't have such a check.

It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16256897#comment-16256897
 ] 

Andrei Budnik commented on MESOS-7506:
--

https://reviews.apache.org/r/63887/
https://reviews.apache.org/r/63888/

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ResourceLimitation-badrun.txt, 
> ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveTest.RestartSlaveRequireExecutorAuthentication
> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8246) ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_Overwrite/1 is flaky.

2017-11-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8246:
---
Attachment: ROOT_INTERNET_CURL_Overwrite-badrun.txt

> ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_Overwrite/1 is flaky.
> -
>
> Key: MESOS-8246
> URL: https://issues.apache.org/jira/browse/MESOS-8246
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 7
>Reporter: Alexander Rukletsov
>  Labels: flaky-test
> Attachments: ROOT_INTERNET_CURL_Overwrite-badrun.txt
>
>
> Observed it today in our CI. Container was not able to start with the 
> following error message:
> {noformat}
> E1115 20:44:24.533658   746 slave.cpp:5410] Container 
> '0f7a91a5-c200-472f-8b5a-5ae495ada598' for executor 
> 'b1a891f8-9705-47d1-a877-b6b64a430e13' of framework 
> e4d2eda1-ea7a-411a-9f55-4de6ccaa1c9d- failed to start: Collect failed: 
> Failed to perform 'curl': curl: (52) Empty reply from server
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8246) ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_Overwrite is flaky.

2017-11-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8246:
---
Summary: ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_Overwrite is 
flaky.  (was: ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_Overwrite/1 is 
flaky.)

> ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_Overwrite is flaky.
> ---
>
> Key: MESOS-8246
> URL: https://issues.apache.org/jira/browse/MESOS-8246
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 7
>Reporter: Alexander Rukletsov
>  Labels: flaky-test
> Attachments: ROOT_INTERNET_CURL_Overwrite-badrun.txt
>
>
> Observed it today in our CI. Container was not able to start with the 
> following error message:
> {noformat}
> E1115 20:44:24.533658   746 slave.cpp:5410] Container 
> '0f7a91a5-c200-472f-8b5a-5ae495ada598' for executor 
> 'b1a891f8-9705-47d1-a877-b6b64a430e13' of framework 
> e4d2eda1-ea7a-411a-9f55-4de6ccaa1c9d- failed to start: Collect failed: 
> Failed to perform 'curl': curl: (52) Empty reply from server
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8246) ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_Overwrite/1 is flaky.

2017-11-17 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-8246:
--

 Summary: 
ProvisionerDockerBackendTest.ROOT_INTERNET_CURL_Overwrite/1 is flaky.
 Key: MESOS-8246
 URL: https://issues.apache.org/jira/browse/MESOS-8246
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: CentOS 7
Reporter: Alexander Rukletsov


Observed it today in our CI. Container was not able to start with the following 
error message:
{noformat}
E1115 20:44:24.533658   746 slave.cpp:5410] Container 
'0f7a91a5-c200-472f-8b5a-5ae495ada598' for executor 
'b1a891f8-9705-47d1-a877-b6b64a430e13' of framework 
e4d2eda1-ea7a-411a-9f55-4de6ccaa1c9d- failed to start: Collect failed: 
Failed to perform 'curl': curl: (52) Empty reply from server
{noformat}
Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.

2017-11-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8096:
---
Attachment: scheduler-shutdown-invalid-driver.txt

> Enqueueing events in MockHTTPScheduler can lead to segfaults.
> -
>
> Key: MESOS-8096
> URL: https://issues.apache.org/jira/browse/MESOS-8096
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver, test
> Environment: Fedora 23, Ubuntu 14.04, Ubuntu 16
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: flaky-test, mesosphere
> Attachments: AsyncExecutorProcess-badrun-1.txt, 
> AsyncExecutorProcess-badrun-2.txt, AsyncExecutorProcess-badrun-3.txt, 
> scheduler-shutdown-invalid-driver.txt
>
>
> Various tests segfault due to a yet unknown reason. Comparing logs (attached) 
> hints that the problem might be in the scheduler's event queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8245) SlaveRecoveryTest/0.ReconnectExecutor is flaky.

2017-11-17 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16256760#comment-16256760
 ] 

Alexander Rukletsov commented on MESOS-8245:


The issue is likely that the executor manages to send {{TASK_RUNNING}} before 
the agent is restarted.

> SlaveRecoveryTest/0.ReconnectExecutor is flaky.
> ---
>
> Key: MESOS-8245
> URL: https://issues.apache.org/jira/browse/MESOS-8245
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 17.04
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>  Labels: flaky-test
> Attachments: ReconnectExecutor-badrun.txt, 
> ReconnectExecutor-goodrun.txt
>
>
> Observed it today in our CI. Logs attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8245) SlaveRecoveryTest/0.ReconnectExecutor is flaky.

2017-11-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8245:
---
Attachment: ReconnectExecutor-badrun.txt
ReconnectExecutor-goodrun.txt

> SlaveRecoveryTest/0.ReconnectExecutor is flaky.
> ---
>
> Key: MESOS-8245
> URL: https://issues.apache.org/jira/browse/MESOS-8245
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 17.04
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>  Labels: flaky-test
> Attachments: ReconnectExecutor-badrun.txt, 
> ReconnectExecutor-goodrun.txt
>
>
> Observed it today in our CI. Logs attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8245) SlaveRecoveryTest/0.ReconnectExecutor is flaky.

2017-11-17 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-8245:
--

 Summary: SlaveRecoveryTest/0.ReconnectExecutor is flaky.
 Key: MESOS-8245
 URL: https://issues.apache.org/jira/browse/MESOS-8245
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: Ubuntu 17.04
Reporter: Alexander Rukletsov
Assignee: Benno Evers


Observed it today in our CI. Logs attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)