[jira] [Updated] (MESOS-7488) Add `--ip6` and `--ip6_discovery_command` flag to Mesos agent

2018-01-10 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang updated MESOS-7488:
--
Description: As a first step to support IPv6 containers on Mesos, we need 
to provide {{--ip6}} and {{--ip6_discovery_command}} flags to the agent so that 
the operator can specify an IPv6 address for the {{libprocess}} actor on the 
agent. In this ticket we will not aim to add IPv6 communication support for 
Mesos but will aim to use the IPv6 address provided by the operator to fill in 
the v6 address for any containers running on the host network in a dual stack 
environment.  (was: As a first step to support IPv6 containers on Mesos, we 
need to provide `--ip6` and `--ip6_discovery_command` flags to the agent so 
that the operator can  specify an IPv6 address for the `libprocess` actor on 
the agent. In this ticket we will not aim to add IPv6 communication support for 
Mesos but will aim to use the IPv6 address provided by the operator to fill in 
the v6 address for any containers running on the host network in a dual stack 
environment.)

> Add `--ip6` and `--ip6_discovery_command` flag to Mesos agent
> -
>
> Key: MESOS-7488
> URL: https://issues.apache.org/jira/browse/MESOS-7488
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
> Fix For: 1.4.0
>
>
> As a first step to support IPv6 containers on Mesos, we need to provide 
> {{--ip6}} and {{--ip6_discovery_command}} flags to the agent so that the 
> operator can specify an IPv6 address for the {{libprocess}} actor on the 
> agent. In this ticket we will not aim to add IPv6 communication support for 
> Mesos but will aim to use the IPv6 address provided by the operator to fill 
> in the v6 address for any containers running on the host network in a dual 
> stack environment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8375) Use protobuf reflection to simplify upgrading of resources.

2018-01-10 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321431#comment-16321431
 ] 

Michael Park edited comment on MESOS-8375 at 1/11/18 12:20 AM:
---

{noformat}
commit 3685c011bb71da0ba2af75691101ea383eeb2ccd
Author: Michael Park 
Date:   Sat Jan 6 00:48:58 2018 -0800

Replaced `os::read` with `state::read`.

The `state::checkpoint` utility was updated to checkpoint
the automatically downgraded resources. This was done to mitigate
the need to manually invoke `downgradeResources` prior to
checkpointing. `state::read` was introduced to provide a symmetric
functionality for `state::checkpoint`. Specifically, it will perform
upgrade resources upon reading the checkpointed resources state.

This patch updates the previous uses of `os::read` which was used to
read the state that was written by `state::checkpoint(path, string)`.
While there is no functional change, it completes the picture where
`state::read` is used to read state written by `state::checkpoint`.

Review: https://reviews.apache.org/r/65025
{noformat}
{noformat}
commit 80f66061e343fa26dd7e3b9613f0fa8e0b9b4a36
Author: Michael Park 
Date:   Fri Jan 5 18:03:42 2018 -0800

Replaced `protobuf::read` with `state::read`.

The `state::checkpoint` utility was updated to checkpoint
the automatically downgraded resources. This was done to mitigate
the need to manually invoke `downgradeResources` prior to
checkpointing. `state::read` was introduced to provide a symmetric
functionality for `state::checkpoint`. Specifically, it will perform
upgrade resources upon reading the checkpointed resources state.

This patch updates the previous uses of `protobuf::read` accompanied
by calls to `convertResourceFormat`. Rather than reading the protobufs
then upgrading the resources, we know simply call `state::read` which
performs the reverse operation of `state::checkpoint`.

Review: https://reviews.apache.org/r/65024
{noformat}
{noformat}
commit ef303905e0be96f28d79a569afd004ec75f98296
Author: Michael Park 
Date:   Fri Jan 5 09:52:51 2018 -0800

Added `state::read` to complement `state::checkpoint`.

Review: https://reviews.apache.org/r/65023
{noformat}
{noformat}
commit fda054b50ff7cdd2d7a60d31cfe24ce42bfbfaa5
Author: Michael Park 
Date:   Fri Jan 5 17:44:41 2018 -0800

Updated uses of `protobuf::read(path)` which now returns `Try`.

Since the path version of `protobuf::read` now returns `Try`,
many of the existing code is removed and/or simplified.

Review: https://reviews.apache.org/r/65022
{noformat}
{noformat}
commit 4f9cda17e1a747bc3c4ab3667569304e09600b29
Author: Michael Park 
Date:   Fri Jan 5 16:49:53 2018 -0800

Returned `Try` from `protobuf::read(path)` rather than `Result`.

The path version of `protobuf::read` used to return `Result` and
returned `None` only when the file is empty (`ignorePartial` is always
`false`). The `None` return represents EOF for the "streaming" version
of `protobuf::read` that takes an FD, but for the path version an empty
file when we expected to read `T` is simply an error. Thus, we map the
`None` return to an `Error` for the path version and return a `Try`.

Review: https://reviews.apache.org/r/65021
{noformat}


was (Author: mcypark):
{noformat}
commit 3685c011bb71da0ba2af75691101ea383eeb2ccd
Author: Michael Park 
Date:   Sat Jan 6 00:48:58 2018 -0800

Replaced `os::read` with `state::read`.

The `state::checkpoint` utility was updated to checkpoint
the automatically downgraded resources. This was done to mitigate
the need to manually invoke `downgradeResources` prior to
checkpointing. `state::read` was introduced to provide a symmetric
functionality for `state::checkpoint`. Specifically, it will perform
upgrade resources upon reading the checkpointed resources state.

This patch updates the previous uses of `os::read` which was used to
read the state that was written by `state::checkpoint(path, string)`.
While there is no functional change, it completes the picture where
`state::read` is used to read state written by `state::checkpoint`.

Review: https://reviews.apache.org/r/65025
{noformat}
{noformat}
commit 80f66061e343fa26dd7e3b9613f0fa8e0b9b4a36
Author: Michael Park 
Date:   Fri Jan 5 18:03:42 2018 -0800

Replaced `protobuf::read` with `state::read`.

The `state::checkpoint` utility was updated to checkpoint
the automatically downgraded resources. This was done to mitigate
the need to manually invoke `downgradeResources` prior to
checkpointing. `state::read` was introduced to provide a symmetric
functionality for `state::checkpoint`. Specifically, 

[jira] [Commented] (MESOS-8375) Use protobuf reflection to simplify upgrading of resources.

2018-01-10 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321431#comment-16321431
 ] 

Michael Park commented on MESOS-8375:
-

{noformat}
commit 3685c011bb71da0ba2af75691101ea383eeb2ccd
Author: Michael Park 
Date:   Sat Jan 6 00:48:58 2018 -0800

Replaced `os::read` with `state::read`.

The `state::checkpoint` utility was updated to checkpoint
the automatically downgraded resources. This was done to mitigate
the need to manually invoke `downgradeResources` prior to
checkpointing. `state::read` was introduced to provide a symmetric
functionality for `state::checkpoint`. Specifically, it will perform
upgrade resources upon reading the checkpointed resources state.

This patch updates the previous uses of `os::read` which was used to
read the state that was written by `state::checkpoint(path, string)`.
While there is no functional change, it completes the picture where
`state::read` is used to read state written by `state::checkpoint`.

Review: https://reviews.apache.org/r/65025
{noformat}
{noformat}
commit 80f66061e343fa26dd7e3b9613f0fa8e0b9b4a36
Author: Michael Park 
Date:   Fri Jan 5 18:03:42 2018 -0800

Replaced `protobuf::read` with `state::read`.

The `state::checkpoint` utility was updated to checkpoint
the automatically downgraded resources. This was done to mitigate
the need to manually invoke `downgradeResources` prior to
checkpointing. `state::read` was introduced to provide a symmetric
functionality for `state::checkpoint`. Specifically, it will perform
upgrade resources upon reading the checkpointed resources state.

This patch updates the previous uses of `protobuf::read` accompanied
by calls to `convertResourceFormat`. Rather than reading the protobufs
then upgrading the resources, we know simply call `state::read` which
performs the reverse operation of `state::checkpoint`.

Review: https://reviews.apache.org/r/65024
{noformat}
{noformat}
commit ef303905e0be96f28d79a569afd004ec75f98296
Author: Michael Park 
Date:   Fri Jan 5 09:52:51 2018 -0800

Added `state::read` to complement `state::checkpoint`.

Review: https://reviews.apache.org/r/65023
{noformat}
{noformat}
commit fda054b50ff7cdd2d7a60d31cfe24ce42bfbfaa5
Author: Michael Park 
Date:   Fri Jan 5 17:44:41 2018 -0800

Updated uses of `protobuf::read(path)` which now returns `Try`.

Since the path version of `protobuf::read` now returns `Try`,
many of the existing code is removed and/or simplified.

Review: https://reviews.apache.org/r/65022
{noformat}

> Use protobuf reflection to simplify upgrading of resources.
> ---
>
> Key: MESOS-8375
> URL: https://issues.apache.org/jira/browse/MESOS-8375
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Park
>Assignee: Michael Park
>Priority: Blocker
>
> This is the {{upgradeResources}} half of the protobuf-reflection-based 
> upgrade/downgrade of resources: 
> https://issues.apache.org/jira/browse/MESOS-8221
> We will also add {{state::read}} to complement {{state::checkpoint}} which 
> will be used to read protobufs from disk rather than {{protobuf::read}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8433) Design doc for unified artifact store

2018-01-10 Thread Qian Zhang (JIRA)
Qian Zhang created MESOS-8433:
-

 Summary: Design doc for unified artifact store
 Key: MESOS-8433
 URL: https://issues.apache.org/jira/browse/MESOS-8433
 Project: Mesos
  Issue Type: Task
Reporter: Qian Zhang


This ticket is for the design of unified artifact store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8433) Design doc for unified artifact store

2018-01-10 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-8433:
-

Assignee: Qian Zhang

> Design doc for unified artifact store
> -
>
> Key: MESOS-8433
> URL: https://issues.apache.org/jira/browse/MESOS-8433
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> This ticket is for the design of unified artifact store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8432) Introduce a unified artifact store

2018-01-10 Thread Qian Zhang (JIRA)
Qian Zhang created MESOS-8432:
-

 Summary: Introduce a unified artifact store
 Key: MESOS-8432
 URL: https://issues.apache.org/jira/browse/MESOS-8432
 Project: Mesos
  Issue Type: Epic
Reporter: Qian Zhang


Currently in Mesos, there are some separate caching mechanisms in different 
components, e.g., cache in fetcher, image layer cache in image store. We'd like 
to create a unified artifact store and merge various cache and store instances 
into it, while providing a consistent interface for managing disk space and 
garbage collect unused artifacts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8428) SLRP recovery tests leak file descriptors.

2018-01-10 Thread Chun-Hung Hsiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321341#comment-16321341
 ] 

Chun-Hung Hsiao commented on MESOS-8428:


Reproduced the problem with the following unit test:
https://reviews.apache.org/r/65085/

> SLRP recovery tests leak file descriptors.
> --
>
> Key: MESOS-8428
> URL: https://issues.apache.org/jira/browse/MESOS-8428
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: mesosphere, storage
>
> The {{CreateDestroyVolumeRecovery}} (formerly {{NewVolumeRecovery}}) and 
> {{PublishResourcesRecovery}} (formerly {{LaunchTaskRecovery}}) tests leak 
> fds. When running them in repetition, either the following error will 
> manifest:
> {noformat}
> rocess_posix.hpp:257] CHECK_SOME(pipe): Too many open files
> {noformat}
> or the plugin container will exit possibly due to no fd.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8428) SLRP recovery tests leak file descriptors.

2018-01-10 Thread Chun-Hung Hsiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321277#comment-16321277
 ] 

Chun-Hung Hsiao commented on MESOS-8428:


The opened fds seems related to the agent's v1 http api endpoint.

Test without fd leakage:
https://github.com/apache/mesos/blob/master/src/tests/storage_local_resource_provider_tests.cpp#L178
Test with leakage:
https://github.com/apache/mesos/blob/master/src/tests/storage_local_resource_provider_tests.cpp#L355
Suspicious `http::post`: 
https://github.com/apache/mesos/blob/master/src/slave/container_daemon.cpp#L209
which is used by: 
https://github.com/apache/mesos/blob/master/src/resource_provider/storage/provider.cpp#L1796

> SLRP recovery tests leak file descriptors.
> --
>
> Key: MESOS-8428
> URL: https://issues.apache.org/jira/browse/MESOS-8428
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: mesosphere, storage
>
> The {{CreateDestroyVolumeRecovery}} (formerly {{NewVolumeRecovery}}) and 
> {{PublishResourcesRecovery}} (formerly {{LaunchTaskRecovery}}) tests leak 
> fds. When running them in repetition, either the following error will 
> manifest:
> {noformat}
> rocess_posix.hpp:257] CHECK_SOME(pipe): Too many open files
> {noformat}
> or the plugin container will exit possibly due to no fd.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8431) Non-leading master should reply to some API requests

2018-01-10 Thread Greg Mann (JIRA)
Greg Mann created MESOS-8431:


 Summary: Non-leading master should reply to some API requests
 Key: MESOS-8431
 URL: https://issues.apache.org/jira/browse/MESOS-8431
 Project: Mesos
  Issue Type: Improvement
Reporter: Greg Mann


Currently, non-leading masters forward all v1 API requests to the leader. 
However, responses for some requests like GET_FLAGS, GET_HEALTH, GET_MASTER, 
and GET_VERSION could be provided by non-leading masters as well.

We should update the master code to reply to such requests directly rather than 
redirecting them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8430) Race between operation status updates and agent update

2018-01-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8430:

Affects Version/s: 1.5.0

> Race between operation status updates and agent update
> --
>
> Key: MESOS-8430
> URL: https://issues.apache.org/jira/browse/MESOS-8430
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>
> Currently, there exists a possible race between operation status updates 
> triggered by a status update manager in the agent and updates to the agent's 
> resources.
> Consider a master failover where an agent has a resource provider with an 
> operation which was not terminal. Now let the operation succeed and become 
> terminal in the agent, but have the master failover before it processes the 
> update. After master failover, the new master would learn about the resource 
> provider resources via an {{UpdateSlaveMessage}}. Simultaneously, a status 
> update manager in the agent could inform the master about the unacknowledged, 
> successful operation. If the operation status update arrives in the master 
> before the {{UpdateSlaveMessage}}, the operation status update handler could 
> attempt to apply the operation on resources unknown to it, yet. This would 
> likely trigger a {{CHECK}} failure in a contains check in the master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8430) Race between operation status updates and agent update

2018-01-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8430:

Issue Type: Bug  (was: Task)

> Race between operation status updates and agent update
> --
>
> Key: MESOS-8430
> URL: https://issues.apache.org/jira/browse/MESOS-8430
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>
> Currently, there exists a possible race between operation status updates 
> triggered by a status update manager in the agent and updates to the agent's 
> resources.
> Consider a master failover where an agent has a resource provider with an 
> operation which was not terminal. Now let the operation succeed and become 
> terminal in the agent, but have the master failover before it processes the 
> update. After master failover, the new master would learn about the resource 
> provider resources via an {{UpdateSlaveMessage}}. Simultaneously, a status 
> update manager in the agent could inform the master about the unacknowledged, 
> successful operation. If the operation status update arrives in the master 
> before the {{UpdateSlaveMessage}}, the operation status update handler could 
> attempt to apply the operation on resources unknown to it, yet. This would 
> likely trigger a {{CHECK}} failure in a contains check in the master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8430) Race between operation status updates and agent update

2018-01-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8430:

Description: 
Currently, there exists a possible race between operation status updates 
triggered by a status update manager in the agent and updates to the agent's 
resources.

Consider a master failover where an agent has a resource provider with an 
operation which was not terminal. Now let the operation succeed and become 
terminal in the agent, but have the master failover before it processes the 
update. After master failover, the new master would learn about the resource 
provider resources via an {{UpdateSlaveMessage}}. Simultaneously, a status 
update manager in the agent could inform the master about the unacknowledged, 
successful operation. If the operation status update arrives in the master 
before the {{UpdateSlaveMessage}}, the operation status update handler could 
attempt to apply the operation on resources unknown to it, yet. This would 
likely trigger a {{CHECK}} failure in a contains check in the master.

  was:
Currently, there exists a possible race between operation status updates 
triggered by a status update manager in the agent and updates to the agent's 
resources.

Consider a master failover where an agent has a resource provider with an 
operation which was not terminal. Now let the operation succeed and become 
terminal in the agent, but have the master failover before it processes the 
update. After master failover, the new master would learn about the resource 
provider resources via an `UpdateSlaveMessage`. Simultaneously, a status update 
manager in the agent could inform the master about the unacknowledged, 
successful operation. If the operation status update arrives in the master 
before the `UpdateSlaveMessage`, the operation status update handler could 
attempt to apply the operation on resources unknown to it, yet. This would 
likely trigger a `CHECK` failure in a contains check.


> Race between operation status updates and agent update
> --
>
> Key: MESOS-8430
> URL: https://issues.apache.org/jira/browse/MESOS-8430
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Benjamin Bannier
>
> Currently, there exists a possible race between operation status updates 
> triggered by a status update manager in the agent and updates to the agent's 
> resources.
> Consider a master failover where an agent has a resource provider with an 
> operation which was not terminal. Now let the operation succeed and become 
> terminal in the agent, but have the master failover before it processes the 
> update. After master failover, the new master would learn about the resource 
> provider resources via an {{UpdateSlaveMessage}}. Simultaneously, a status 
> update manager in the agent could inform the master about the unacknowledged, 
> successful operation. If the operation status update arrives in the master 
> before the {{UpdateSlaveMessage}}, the operation status update handler could 
> attempt to apply the operation on resources unknown to it, yet. This would 
> likely trigger a {{CHECK}} failure in a contains check in the master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8375) Use protobuf reflection to simplify upgrading of resources.

2018-01-10 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321151#comment-16321151
 ] 

Michael Park edited comment on MESOS-8375 at 1/10/18 9:36 PM:
--

{noformat}

commit 93c6809122382412c1fc9324aa3cb67c54577e4e
Author: Michael Park mp...@apache.org
Date:   Mon Jan 8 11:50:06 2018 -0800

Added `vector` overloads for `(down/up)gradeResources`.

Review: https://reviews.apache.org/r/65029
{noformat}


was (Author: mcypark):
{noformat}

commit 93c6809122382412c1fc9324aa3cb67c54577e4e
Author: Michael Park mp...@apache.org
Date:   Mon Jan 8 11:50:06 2018 -0800


Added `vector` overloads for `(down/up)gradeResources`.

Review: https://reviews.apache.org/r/65029
{noformat}

> Use protobuf reflection to simplify upgrading of resources.
> ---
>
> Key: MESOS-8375
> URL: https://issues.apache.org/jira/browse/MESOS-8375
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Park
>Assignee: Michael Park
>Priority: Blocker
>
> This is the {{upgradeResources}} half of the protobuf-reflection-based 
> upgrade/downgrade of resources: 
> https://issues.apache.org/jira/browse/MESOS-8221
> We will also add {{state::read}} to complement {{state::checkpoint}} which 
> will be used to read protobufs from disk rather than {{protobuf::read}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8430) Race between operation status updates and agent update

2018-01-10 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-8430:
---

 Summary: Race between operation status updates and agent update
 Key: MESOS-8430
 URL: https://issues.apache.org/jira/browse/MESOS-8430
 Project: Mesos
  Issue Type: Task
  Components: agent
Reporter: Benjamin Bannier


Currently, there exists a possible race between operation status updates 
triggered by a status update manager in the agent and updates to the agent's 
resources.

Consider a master failover where an agent has a resource provider with an 
operation which was not terminal. Now let the operation succeed and become 
terminal in the agent, but have the master failover before it processes the 
update. After master failover, the new master would learn about the resource 
provider resources via an `UpdateSlaveMessage`. Simultaneously, a status update 
manager in the agent could inform the master about the unacknowledged, 
successful operation. If the operation status update arrives in the master 
before the `UpdateSlaveMessage`, the operation status update handler could 
attempt to apply the operation on resources unknown to it, yet. This would 
likely trigger a `CHECK` failure in a contains check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8375) Use protobuf reflection to simplify upgrading of resources.

2018-01-10 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321151#comment-16321151
 ] 

Michael Park commented on MESOS-8375:
-

{noformat}

commit 93c6809122382412c1fc9324aa3cb67c54577e4e
Author: Michael Park mp...@apache.org
Date:   Mon Jan 8 11:50:06 2018 -0800


Added `vector` overloads for `(down/up)gradeResources`.

Review: https://reviews.apache.org/r/65029
{noformat}

> Use protobuf reflection to simplify upgrading of resources.
> ---
>
> Key: MESOS-8375
> URL: https://issues.apache.org/jira/browse/MESOS-8375
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Park
>Assignee: Michael Park
>Priority: Blocker
>
> This is the {{upgradeResources}} half of the protobuf-reflection-based 
> upgrade/downgrade of resources: 
> https://issues.apache.org/jira/browse/MESOS-8221
> We will also add {{state::read}} to complement {{state::checkpoint}} which 
> will be used to read protobufs from disk rather than {{protobuf::read}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8410) Reconfiguration policy fails to handle mount disk resources.

2018-01-10 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321116#comment-16321116
 ] 

Vinod Kone commented on MESOS-8410:
---

Should this be a blocker for 1.5.0?

> Reconfiguration policy fails to handle mount disk resources.
> 
>
> Key: MESOS-8410
> URL: https://issues.apache.org/jira/browse/MESOS-8410
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Assignee: Benno Evers
>
> We deployed {{--reconfiguration_policy="additive"}} on a number of Mesos 
> agents that had mount disk resources configured, and it looks like the agent 
> confused the size of the mount disk with the size of the work directory 
> resource:
> {noformat}
> E0106 01:54:15.000123 1310889 slave.cpp:6733] EXIT with status 1: Failed to 
> perform recovery: Configuration change not permitted under 'additive' policy: 
> Value of scalar resource 'disk' decreased from 183 to 868000
> {noformat}
> The {{--resources}} flag is
> {noformat}
> --resources="[
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 868000
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/a"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/b"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/c"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/d"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/e"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/f"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/g"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/h"
> }
>   }
> }
>   }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8429) Clean up endpoint socket if the container daemon is destroyed while waiting.

2018-01-10 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8429:
--

 Summary: Clean up endpoint socket if the container daemon is 
destroyed while waiting.
 Key: MESOS-8429
 URL: https://issues.apache.org/jira/browse/MESOS-8429
 Project: Mesos
  Issue Type: Bug
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


SLRP uses a post-stop hook to ask the container daemon to clean up the endpoint 
socket after its plugin container is terminated. However, if the container 
daemon is destructed while waiting for its mont, then the plugin container is 
terminated after that, then the socket file will remain there, making SLRP 
unable to recover.

There might be two solutions:
1. During SLRP recovery, check if the plugin container is still running.
2. Start the container daemon in the waiting phase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8428) SLRP recovery tests leak file descriptors.

2018-01-10 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8428:
--

 Summary: SLRP recovery tests leak file descriptors.
 Key: MESOS-8428
 URL: https://issues.apache.org/jira/browse/MESOS-8428
 Project: Mesos
  Issue Type: Bug
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


The {{CreateDestroyVolumeRecovery}} (formerly {{NewVolumeRecovery}}) and 
{{PublishResourcesRecovery}} (formerly {{LaunchTaskRecovery}}) tests leak fds. 
When running them in repetition, either the following error will manifest:
{noformat}
rocess_posix.hpp:257] CHECK_SOME(pipe): Too many open files
{noformat}
or the plugin container will exit possibly due to no fd.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-10 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7742:
-
Sprint: Mesosphere Sprint 58, Mesosphere Sprint 72  (was: Mesosphere Sprint 
58)

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8427) Clean up residual CSI endpoints for SLRP tests.

2018-01-10 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8427:
--

 Summary: Clean up residual CSI endpoints for SLRP tests.
 Key: MESOS-8427
 URL: https://issues.apache.org/jira/browse/MESOS-8427
 Project: Mesos
  Issue Type: Improvement
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


Since the CSI endpoints are not in the sandbox directory of the unit tests, 
they need to be explicitly cleaned up.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8426) Speed up SLRP tests

2018-01-10 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8426:
--

 Summary: Speed up SLRP tests
 Key: MESOS-8426
 URL: https://issues.apache.org/jira/browse/MESOS-8426
 Project: Mesos
  Issue Type: Improvement
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


Each of the current SLRP unit tests takes seconds to run. This can be improved 
by reducing the allocation interval and declining offers with filters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8425) Validation for resource provider config agent API calls.

2018-01-10 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8425:
--

 Summary: Validation for resource provider config agent API calls.
 Key: MESOS-8425
 URL: https://issues.apache.org/jira/browse/MESOS-8425
 Project: Mesos
  Issue Type: Bug
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


Currently the API returns 200 OK if the config is put in the resource provider 
config directory, even if the config is not valid (e.g., don't specify a 
controller plugin). We should consider validating the config when the call is 
processed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8410) Reconfiguration policy fails to handle mount disk resources.

2018-01-10 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320849#comment-16320849
 ] 

Benno Evers commented on MESOS-8410:


The issue was caused by an incorrect handling of multiple resources with the 
same name. I've opened a review with a fix at 
https://reviews.apache.org/r/65074/

> Reconfiguration policy fails to handle mount disk resources.
> 
>
> Key: MESOS-8410
> URL: https://issues.apache.org/jira/browse/MESOS-8410
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Assignee: Benno Evers
>
> We deployed {{--reconfiguration_policy="additive"}} on a number of Mesos 
> agents that had mount disk resources configured, and it looks like the agent 
> confused the size of the mount disk with the size of the work directory 
> resource:
> {noformat}
> E0106 01:54:15.000123 1310889 slave.cpp:6733] EXIT with status 1: Failed to 
> perform recovery: Configuration change not permitted under 'additive' policy: 
> Value of scalar resource 'disk' decreased from 183 to 868000
> {noformat}
> The {{--resources}} flag is
> {noformat}
> --resources="[
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 868000
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/a"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/b"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/c"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/d"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/e"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/f"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/g"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/h"
> }
>   }
> }
>   }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-10 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320409#comment-16320409
 ] 

Andrei Budnik edited comment on MESOS-8391 at 1/10/18 6:47 PM:
---

https://reviews.apache.org/r/65071/
https://reviews.apache.org/r/65077/


was (Author: abudnik):
https://reviews.apache.org/r/65071/

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Assignee: Andrei Budnik
>Priority: Blocker
> Attachments: testing-log-2.tar.gz
>
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8414) DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6

2018-01-10 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8414:
--
Shepherd: Alexander Rukletsov

> DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6
> --
>
> Key: MESOS-8414
> URL: https://issues.apache.org/jira/browse/MESOS-8414
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 6, Docker version 1.7.1, build 786b29d
>Reporter: Armand Grillet
>Assignee: Armand Grillet
> Attachments: 
> centos6-ssl-DockerContainerizerTest.ROOT_DOCKER_Logs.txt, centos6-vlog2.txt, 
> docker-inspect.json, docker-logs.txt
>
>
> You can find the verbose logs attached.
> The most interesting part:
> {code}
> I0108 16:35:45.887037 17805 sched.cpp:897] Received 1 offers
> I0108 16:35:45.887070 17805 sched.cpp:921] Scheduler::resourceOffers took 
> 12130ns
> I0108 16:35:45.985957 17808 docker.cpp:349] Unable to detect IP Address at 
> 'NetworkSettings.Networks.host.IPAddress', attempting deprecated field
> I0108 16:35:45.986428 17809 task_status_update_manager.cpp:328] Received task 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986552 17809 task_status_update_manager.cpp:383] Forwarding 
> task status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to the agent
> I0108 16:35:45.986654 17809 slave.cpp:5209] Forwarding the update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to master@172.16.10.110:37252
> I0108 16:35:45.986795 17809 slave.cpp:5102] Task status update manager 
> successfully handled status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986829 17809 slave.cpp:5118] Sending acknowledgement for 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> executor(1)@172.16.10.110:38499
> I0108 16:35:45.986901 17805 master.cpp:7890] Status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.986928 17805 master.cpp:7946] Forwarding status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986984 17805 master.cpp:10193] Updating the state of task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- (latest state: 
> TASK_FAILED, status update state: TASK_FAILED)
> I0108 16:35:45.987047 17805 sched.cpp:990] Received status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from slave(1)@172.16.10.110:37252
> I0108 16:35:45.987103 17805 sched.cpp:1029] Scheduler::statusUpdate took 
> 30948ns
> I0108 16:35:45.987112 17805 sched.cpp:1048] Sending ACK for status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> master@172.16.10.110:37252
> I0108 16:35:45.987221 17805 master.cpp:5826] Processing ACKNOWLEDGE call 
> 7f544700-215b-4d27-ab43-b48e19592d00 for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- (default) at 
> scheduler-4ad5073e-c1db-4c34-9c43-e656c280a724@172.16.10.110:37252 on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0
> I0108 16:35:45.987267 17805 master.cpp:10299] Removing task 1 with resources 
> cpus(allocated: *):2; mem(allocated: *):1024; disk(allocated: *):1024; 
> ports(allocated: *):[31000-32000] of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.987473 17807 task_status_update_manager.cpp:401] Received task 
> status update acknowledgement (UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987561 17807 task_status_update_manager.cpp:538] Cleaning up 
> status update stream for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987814 17807 slave.cpp:3974] Task status update manager 
> successfully handled status update acknowledgement (UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for 

[jira] [Updated] (MESOS-8414) DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6

2018-01-10 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8414:
--
Sprint: Mesosphere Sprint 72

> DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6
> --
>
> Key: MESOS-8414
> URL: https://issues.apache.org/jira/browse/MESOS-8414
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 6, Docker version 1.7.1, build 786b29d
>Reporter: Armand Grillet
>Assignee: Armand Grillet
> Attachments: 
> centos6-ssl-DockerContainerizerTest.ROOT_DOCKER_Logs.txt, centos6-vlog2.txt, 
> docker-inspect.json, docker-logs.txt
>
>
> You can find the verbose logs attached.
> The most interesting part:
> {code}
> I0108 16:35:45.887037 17805 sched.cpp:897] Received 1 offers
> I0108 16:35:45.887070 17805 sched.cpp:921] Scheduler::resourceOffers took 
> 12130ns
> I0108 16:35:45.985957 17808 docker.cpp:349] Unable to detect IP Address at 
> 'NetworkSettings.Networks.host.IPAddress', attempting deprecated field
> I0108 16:35:45.986428 17809 task_status_update_manager.cpp:328] Received task 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986552 17809 task_status_update_manager.cpp:383] Forwarding 
> task status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to the agent
> I0108 16:35:45.986654 17809 slave.cpp:5209] Forwarding the update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to master@172.16.10.110:37252
> I0108 16:35:45.986795 17809 slave.cpp:5102] Task status update manager 
> successfully handled status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986829 17809 slave.cpp:5118] Sending acknowledgement for 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> executor(1)@172.16.10.110:38499
> I0108 16:35:45.986901 17805 master.cpp:7890] Status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.986928 17805 master.cpp:7946] Forwarding status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986984 17805 master.cpp:10193] Updating the state of task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- (latest state: 
> TASK_FAILED, status update state: TASK_FAILED)
> I0108 16:35:45.987047 17805 sched.cpp:990] Received status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from slave(1)@172.16.10.110:37252
> I0108 16:35:45.987103 17805 sched.cpp:1029] Scheduler::statusUpdate took 
> 30948ns
> I0108 16:35:45.987112 17805 sched.cpp:1048] Sending ACK for status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> master@172.16.10.110:37252
> I0108 16:35:45.987221 17805 master.cpp:5826] Processing ACKNOWLEDGE call 
> 7f544700-215b-4d27-ab43-b48e19592d00 for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- (default) at 
> scheduler-4ad5073e-c1db-4c34-9c43-e656c280a724@172.16.10.110:37252 on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0
> I0108 16:35:45.987267 17805 master.cpp:10299] Removing task 1 with resources 
> cpus(allocated: *):2; mem(allocated: *):1024; disk(allocated: *):1024; 
> ports(allocated: *):[31000-32000] of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.987473 17807 task_status_update_manager.cpp:401] Received task 
> status update acknowledgement (UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987561 17807 task_status_update_manager.cpp:538] Cleaning up 
> status update stream for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987814 17807 slave.cpp:3974] Task status update manager 
> successfully handled status update acknowledgement (UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for 

[jira] [Assigned] (MESOS-8414) DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6

2018-01-10 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet reassigned MESOS-8414:
-

Assignee: Armand Grillet

> DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6
> --
>
> Key: MESOS-8414
> URL: https://issues.apache.org/jira/browse/MESOS-8414
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 6, Docker version 1.7.1, build 786b29d
>Reporter: Armand Grillet
>Assignee: Armand Grillet
> Attachments: 
> centos6-ssl-DockerContainerizerTest.ROOT_DOCKER_Logs.txt, centos6-vlog2.txt, 
> docker-inspect.json, docker-logs.txt
>
>
> You can find the verbose logs attached.
> The most interesting part:
> {code}
> I0108 16:35:45.887037 17805 sched.cpp:897] Received 1 offers
> I0108 16:35:45.887070 17805 sched.cpp:921] Scheduler::resourceOffers took 
> 12130ns
> I0108 16:35:45.985957 17808 docker.cpp:349] Unable to detect IP Address at 
> 'NetworkSettings.Networks.host.IPAddress', attempting deprecated field
> I0108 16:35:45.986428 17809 task_status_update_manager.cpp:328] Received task 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986552 17809 task_status_update_manager.cpp:383] Forwarding 
> task status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to the agent
> I0108 16:35:45.986654 17809 slave.cpp:5209] Forwarding the update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to master@172.16.10.110:37252
> I0108 16:35:45.986795 17809 slave.cpp:5102] Task status update manager 
> successfully handled status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986829 17809 slave.cpp:5118] Sending acknowledgement for 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> executor(1)@172.16.10.110:38499
> I0108 16:35:45.986901 17805 master.cpp:7890] Status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.986928 17805 master.cpp:7946] Forwarding status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986984 17805 master.cpp:10193] Updating the state of task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- (latest state: 
> TASK_FAILED, status update state: TASK_FAILED)
> I0108 16:35:45.987047 17805 sched.cpp:990] Received status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from slave(1)@172.16.10.110:37252
> I0108 16:35:45.987103 17805 sched.cpp:1029] Scheduler::statusUpdate took 
> 30948ns
> I0108 16:35:45.987112 17805 sched.cpp:1048] Sending ACK for status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> master@172.16.10.110:37252
> I0108 16:35:45.987221 17805 master.cpp:5826] Processing ACKNOWLEDGE call 
> 7f544700-215b-4d27-ab43-b48e19592d00 for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- (default) at 
> scheduler-4ad5073e-c1db-4c34-9c43-e656c280a724@172.16.10.110:37252 on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0
> I0108 16:35:45.987267 17805 master.cpp:10299] Removing task 1 with resources 
> cpus(allocated: *):2; mem(allocated: *):1024; disk(allocated: *):1024; 
> ports(allocated: *):[31000-32000] of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.987473 17807 task_status_update_manager.cpp:401] Received task 
> status update acknowledgement (UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987561 17807 task_status_update_manager.cpp:538] Cleaning up 
> status update stream for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987814 17807 slave.cpp:3974] Task status update manager 
> successfully handled status update acknowledgement (UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) 

[jira] [Assigned] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-10 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8391:


Assignee: Andrei Budnik  (was: Gilbert Song)

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Assignee: Andrei Budnik
>Priority: Blocker
> Attachments: testing-log-2.tar.gz
>
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8422) Master's UpdateSlave handler not correctly updating terminated operations

2018-01-10 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320357#comment-16320357
 ] 

Benjamin Bannier commented on MESOS-8422:
-

The issue here seems to be that resources of the resource provider sent as part 
of the {{UpdateSlaveMessage}} after failover already had the operation applied; 
applying the same operation again when an offer operation status update is 
received cannot work.

We should update the master to transition newly terminal offer operations when 
it learns about that from {{UpdateSlaveMessage}}. Since the master will only 
update resource state in the offer operation status update handler when the 
operation is _newly_ terminal, this would prevent this setup from becoming a 
problem.

> Master's UpdateSlave handler not correctly updating terminated operations
> -
>
> Key: MESOS-8422
> URL: https://issues.apache.org/jira/browse/MESOS-8422
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>
> I created a test that verifies that operation status updates are resent to 
> the master after being dropped en route to it (MESOS-8420).
> The test does the following:
> # Creates a volume from a RAW disk resource.
> # Drops the first `UpdateOperationStatusMessage` message from the agent to 
> the master, so that it isn't acknowledged by the master.
> # Restarts the agent.
> # Verifies that the agent resends the operation status update.
> The good news are that the agent is resending the operation status update, 
> the bad news are that it triggers a CHECK failure that crashes the master.
> Here are the relevant sections of the log produced by the test:
> {noformat}
> [ RUN  ] 
> StorageLocalResourceProviderTest.ROOT_RetryOperationStatusUpdateAfterRecovery
> [...]
> I0109 16:36:08.515882 24106 master.cpp:4284] Processing ACCEPT call for 
> offers: [ 046b3f21-6e97-4a56-9a13-773f7d481efd-O0 ] on agent 
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 
> (core-dev) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) 
> at scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681
> I0109 16:36:08.516487 24106 master.cpp:5260] Processing CREATE_VOLUME 
> operation with source disk(allocated: storage)(reservations: 
> [(DYNAMIC,storage)])[RAW(,volume-default)]:4096 from framework 
> 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) at 
> scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 to agent 
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
> I0109 16:36:08.518704 24106 master.cpp:10622] Sending operation '' (uuid: 
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) to agent 
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
> I0109 16:36:08.521210 24130 provider.cpp:504] Received APPLY_OPERATION event
> I0109 16:36:08.521276 24130 provider.cpp:1368] Received CREATE_VOLUME 
> operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> I0109 16:36:08.523131 24432 test_csi_plugin.cpp:305] CreateVolumeRequest 
> '{"version":{"minor":1},"name":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","capacityRange":{"requiredBytes":"4294967296","limitBytes":"4294967296"},"volumeCapabilities":[{"mount":{},"accessMode":{"mode":"SINGLE_NODE_WRITER"}}]}'
> I0109 16:36:08.525806 24152 provider.cpp:2635] Applying conversion from 
> 'disk(allocated: storage)(reservations: 
> [(DYNAMIC,storage)])[RAW(,volume-default)]:4096' to 'disk(allocated: 
> storage)(reservations: 
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
>  for operation (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> I0109 16:36:08.528725 24134 status_update_manager_process.hpp:152] Received 
> operation status update OPERATION_FINISHED (Status UUID: 
> 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework 
> '046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.529207 24134 status_update_manager_process.hpp:929] 
> Checkpointing UPDATE for operation status update OPERATION_FINISHED (Status 
> UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework 
> '046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.573177 24150 http.cpp:1185] HTTP POST for 
> /slave(2)/api/v1/resource_provider from 10.0.49.2:53598
> I0109 16:36:08.573974 24139 slave.cpp:7065] Handling resource provider 
> message 'UPDATE_OPERATION_STATUS: (uuid: 
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 
> 

[jira] [Updated] (MESOS-8414) DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6

2018-01-10 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8414:
--
Description: 
You can find the verbose logs attached.

The most interesting part:
{code}
I0108 16:35:45.887037 17805 sched.cpp:897] Received 1 offers
I0108 16:35:45.887070 17805 sched.cpp:921] Scheduler::resourceOffers took 
12130ns
I0108 16:35:45.985957 17808 docker.cpp:349] Unable to detect IP Address at 
'NetworkSettings.Networks.host.IPAddress', attempting deprecated field
I0108 16:35:45.986428 17809 task_status_update_manager.cpp:328] Received task 
status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.986552 17809 task_status_update_manager.cpp:383] Forwarding task 
status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to the agent
I0108 16:35:45.986654 17809 slave.cpp:5209] Forwarding the update TASK_FAILED 
(Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- to master@172.16.10.110:37252
I0108 16:35:45.986795 17809 slave.cpp:5102] Task status update manager 
successfully handled status update TASK_FAILED (Status UUID: 
7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.986829 17809 slave.cpp:5118] Sending acknowledgement for status 
update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 
1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
executor(1)@172.16.10.110:38499
I0108 16:35:45.986901 17805 master.cpp:7890] Status update TASK_FAILED (Status 
UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- from agent 
f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
(ip-172-16-10-110.ec2.internal)
I0108 16:35:45.986928 17805 master.cpp:7946] Forwarding status update 
TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.986984 17805 master.cpp:10193] Updating the state of task 1 of 
framework f09c89e1-aa62-4662-bda8-15a2c87f412e- (latest state: TASK_FAILED, 
status update state: TASK_FAILED)
I0108 16:35:45.987047 17805 sched.cpp:990] Received status update TASK_FAILED 
(Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- from slave(1)@172.16.10.110:37252
I0108 16:35:45.987103 17805 sched.cpp:1029] Scheduler::statusUpdate took 30948ns
I0108 16:35:45.987112 17805 sched.cpp:1048] Sending ACK for status update 
TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
master@172.16.10.110:37252
I0108 16:35:45.987221 17805 master.cpp:5826] Processing ACKNOWLEDGE call 
7f544700-215b-4d27-ab43-b48e19592d00 for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- (default) at 
scheduler-4ad5073e-c1db-4c34-9c43-e656c280a724@172.16.10.110:37252 on agent 
f09c89e1-aa62-4662-bda8-15a2c87f412e-S0
I0108 16:35:45.987267 17805 master.cpp:10299] Removing task 1 with resources 
cpus(allocated: *):2; mem(allocated: *):1024; disk(allocated: *):1024; 
ports(allocated: *):[31000-32000] of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- on agent 
f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
(ip-172-16-10-110.ec2.internal)
I0108 16:35:45.987473 17807 task_status_update_manager.cpp:401] Received task 
status update acknowledgement (UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for 
task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.987561 17807 task_status_update_manager.cpp:538] Cleaning up 
status update stream for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.987814 17807 slave.cpp:3974] Task status update manager 
successfully handled status update acknowledgement (UUID: 
7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.987849 17807 slave.cpp:8935] Completing task 1
{code}

After further testing, 
https://github.com/apache/mesos/blob/51a3bd95bd2d740a39b55634251abeadb561e5c8/src/docker/docker.cpp#L384
 appears to never be reached as {{ipAddressValue->value}} is an empty string.

  was:
You can find the verbose logs attached.

The most interesting part:
{code}
I0108 16:35:45.887037 17805 sched.cpp:897] Received 1 offers
I0108 16:35:45.887070 17805 sched.cpp:921] Scheduler::resourceOffers took 
12130ns
I0108 16:35:45.985957 17808 docker.cpp:349] Unable to detect IP Address at 
'NetworkSettings.Networks.host.IPAddress', attempting deprecated field
I0108 16:35:45.986428 17809 

[jira] [Commented] (MESOS-8247) Executor registered message is lost

2018-01-10 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320309#comment-16320309
 ] 

Alexander Rukletsov commented on MESOS-8247:


https://reviews.apache.org/r/65048/ — not a fix, logging improvement.

> Executor registered message is lost
> ---
>
> Key: MESOS-8247
> URL: https://issues.apache.org/jira/browse/MESOS-8247
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>
> h3. Brief description of successful agent-executor communication.
> Executor sends `RegisterExecutorMessage` message to Agent during 
> initialization step. Agent sends a `ExecutorRegisteredMessage` message as a 
> response to the Executor in `registerExecutor()` method. Whenever executor 
> receives `ExecutorRegisteredMessage`, it prints a `Executor registered on 
> agent...` to stderr logs.
> h3. Problem description.
> The agent launches built-in docker executor, which is stuck in `STAGING` 
> state.
> stderr logs of the docker executor:
> {code}
> I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
> {code}
> It doesn't contain a message like `Executor registered on agent...`. At the 
> same time agent received `RegisterExecutorMessage` and sent `runTask` message 
> to the executor.
> stdout logs consists of the same repeating message:
> {code}
> Received killTask for task ...
> {code}
> Also, the docker executor process doesn't contain child processes.
> Currently, executor [doesn't 
> attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
>  to launch a task if it is not registered at the agent, while [task 
> killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
>  doesn't have such a check.
> It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8414) DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6

2018-01-10 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8414:
--
Attachment: docker-inspect.json

> DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6
> --
>
> Key: MESOS-8414
> URL: https://issues.apache.org/jira/browse/MESOS-8414
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 6, Docker version 1.7.1, build 786b29d
>Reporter: Armand Grillet
> Attachments: 
> centos6-ssl-DockerContainerizerTest.ROOT_DOCKER_Logs.txt, centos6-vlog2.txt, 
> docker-inspect.json, docker-logs.txt
>
>
> You can find the verbose logs attached.
> The most interesting part:
> {code}
> I0108 16:35:45.887037 17805 sched.cpp:897] Received 1 offers
> I0108 16:35:45.887070 17805 sched.cpp:921] Scheduler::resourceOffers took 
> 12130ns
> I0108 16:35:45.985957 17808 docker.cpp:349] Unable to detect IP Address at 
> 'NetworkSettings.Networks.host.IPAddress', attempting deprecated field
> I0108 16:35:45.986428 17809 task_status_update_manager.cpp:328] Received task 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986552 17809 task_status_update_manager.cpp:383] Forwarding 
> task status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to the agent
> I0108 16:35:45.986654 17809 slave.cpp:5209] Forwarding the update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to master@172.16.10.110:37252
> I0108 16:35:45.986795 17809 slave.cpp:5102] Task status update manager 
> successfully handled status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986829 17809 slave.cpp:5118] Sending acknowledgement for 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> executor(1)@172.16.10.110:38499
> I0108 16:35:45.986901 17805 master.cpp:7890] Status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.986928 17805 master.cpp:7946] Forwarding status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986984 17805 master.cpp:10193] Updating the state of task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- (latest state: 
> TASK_FAILED, status update state: TASK_FAILED)
> I0108 16:35:45.987047 17805 sched.cpp:990] Received status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from slave(1)@172.16.10.110:37252
> I0108 16:35:45.987103 17805 sched.cpp:1029] Scheduler::statusUpdate took 
> 30948ns
> I0108 16:35:45.987112 17805 sched.cpp:1048] Sending ACK for status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> master@172.16.10.110:37252
> I0108 16:35:45.987221 17805 master.cpp:5826] Processing ACKNOWLEDGE call 
> 7f544700-215b-4d27-ab43-b48e19592d00 for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- (default) at 
> scheduler-4ad5073e-c1db-4c34-9c43-e656c280a724@172.16.10.110:37252 on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0
> I0108 16:35:45.987267 17805 master.cpp:10299] Removing task 1 with resources 
> cpus(allocated: *):2; mem(allocated: *):1024; disk(allocated: *):1024; 
> ports(allocated: *):[31000-32000] of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.987473 17807 task_status_update_manager.cpp:401] Received task 
> status update acknowledgement (UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987561 17807 task_status_update_manager.cpp:538] Cleaning up 
> status update stream for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987814 17807 slave.cpp:3974] Task status update manager 
> successfully handled status update acknowledgement (UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> 

[jira] [Updated] (MESOS-8414) DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6

2018-01-10 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8414:
--
Description: 
You can find the verbose logs attached.

The most interesting part:
{code}
I0108 16:35:45.887037 17805 sched.cpp:897] Received 1 offers
I0108 16:35:45.887070 17805 sched.cpp:921] Scheduler::resourceOffers took 
12130ns
I0108 16:35:45.985957 17808 docker.cpp:349] Unable to detect IP Address at 
'NetworkSettings.Networks.host.IPAddress', attempting deprecated field
I0108 16:35:45.986428 17809 task_status_update_manager.cpp:328] Received task 
status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.986552 17809 task_status_update_manager.cpp:383] Forwarding task 
status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to the agent
I0108 16:35:45.986654 17809 slave.cpp:5209] Forwarding the update TASK_FAILED 
(Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- to master@172.16.10.110:37252
I0108 16:35:45.986795 17809 slave.cpp:5102] Task status update manager 
successfully handled status update TASK_FAILED (Status UUID: 
7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.986829 17809 slave.cpp:5118] Sending acknowledgement for status 
update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 
1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
executor(1)@172.16.10.110:38499
I0108 16:35:45.986901 17805 master.cpp:7890] Status update TASK_FAILED (Status 
UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- from agent 
f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
(ip-172-16-10-110.ec2.internal)
I0108 16:35:45.986928 17805 master.cpp:7946] Forwarding status update 
TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.986984 17805 master.cpp:10193] Updating the state of task 1 of 
framework f09c89e1-aa62-4662-bda8-15a2c87f412e- (latest state: TASK_FAILED, 
status update state: TASK_FAILED)
I0108 16:35:45.987047 17805 sched.cpp:990] Received status update TASK_FAILED 
(Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- from slave(1)@172.16.10.110:37252
I0108 16:35:45.987103 17805 sched.cpp:1029] Scheduler::statusUpdate took 30948ns
I0108 16:35:45.987112 17805 sched.cpp:1048] Sending ACK for status update 
TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
master@172.16.10.110:37252
I0108 16:35:45.987221 17805 master.cpp:5826] Processing ACKNOWLEDGE call 
7f544700-215b-4d27-ab43-b48e19592d00 for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- (default) at 
scheduler-4ad5073e-c1db-4c34-9c43-e656c280a724@172.16.10.110:37252 on agent 
f09c89e1-aa62-4662-bda8-15a2c87f412e-S0
I0108 16:35:45.987267 17805 master.cpp:10299] Removing task 1 with resources 
cpus(allocated: *):2; mem(allocated: *):1024; disk(allocated: *):1024; 
ports(allocated: *):[31000-32000] of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- on agent 
f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
(ip-172-16-10-110.ec2.internal)
I0108 16:35:45.987473 17807 task_status_update_manager.cpp:401] Received task 
status update acknowledgement (UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for 
task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.987561 17807 task_status_update_manager.cpp:538] Cleaning up 
status update stream for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.987814 17807 slave.cpp:3974] Task status update manager 
successfully handled status update acknowledgement (UUID: 
7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.987849 17807 slave.cpp:8935] Completing task 1
{code}

After further testing, 
https://github.com/apache/mesos/blob/51a3bd95bd2d740a39b55634251abeadb561e5c8/src/docker/docker.cpp#L384
 appears to never be reached.

  was:
You can find the verbose logs attached.

The most interesting part:
{code}
I0108 16:35:45.887037 17805 sched.cpp:897] Received 1 offers
I0108 16:35:45.887070 17805 sched.cpp:921] Scheduler::resourceOffers took 
12130ns
I0108 16:35:45.985957 17808 docker.cpp:349] Unable to detect IP Address at 
'NetworkSettings.Networks.host.IPAddress', attempting deprecated field
I0108 16:35:45.986428 17809 task_status_update_manager.cpp:328] Received task 
status 

[jira] [Created] (MESOS-8424) Test that operations are correctly reported following a master failover

2018-01-10 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-8424:
---

 Summary: Test that operations are correctly reported following a 
master failover
 Key: MESOS-8424
 URL: https://issues.apache.org/jira/browse/MESOS-8424
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Jan Schlicht
Assignee: Jan Schlicht


As the master keeps track of operations running on a resource provider, it 
needs to be updated on these operations when agents reregister after a master 
failover. E.g., an operation that has finished during the failover should be 
reported as finished by the master after the agent on which the resource 
provider is running has reregistered.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8424) Test that operations are correctly reported following a master failover

2018-01-10 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8424:

  Sprint: Mesosphere Sprint 72
Story Points: 3

> Test that operations are correctly reported following a master failover
> ---
>
> Key: MESOS-8424
> URL: https://issues.apache.org/jira/browse/MESOS-8424
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>
> As the master keeps track of operations running on a resource provider, it 
> needs to be updated on these operations when agents reregister after a master 
> failover. E.g., an operation that has finished during the failover should be 
> reported as finished by the master after the agent on which the resource 
> provider is running has reregistered.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8414) DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6

2018-01-10 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8414:
--
Attachment: docker-logs.txt

> DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6
> --
>
> Key: MESOS-8414
> URL: https://issues.apache.org/jira/browse/MESOS-8414
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 6, Docker version 1.7.1, build 786b29d
>Reporter: Armand Grillet
> Attachments: 
> centos6-ssl-DockerContainerizerTest.ROOT_DOCKER_Logs.txt, centos6-vlog2.txt, 
> docker-logs.txt
>
>
> You can find the verbose logs attached.
> The most interesting part:
> {code}
> I0108 16:35:45.887037 17805 sched.cpp:897] Received 1 offers
> I0108 16:35:45.887070 17805 sched.cpp:921] Scheduler::resourceOffers took 
> 12130ns
> I0108 16:35:45.985957 17808 docker.cpp:349] Unable to detect IP Address at 
> 'NetworkSettings.Networks.host.IPAddress', attempting deprecated field
> I0108 16:35:45.986428 17809 task_status_update_manager.cpp:328] Received task 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986552 17809 task_status_update_manager.cpp:383] Forwarding 
> task status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to the agent
> I0108 16:35:45.986654 17809 slave.cpp:5209] Forwarding the update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to master@172.16.10.110:37252
> I0108 16:35:45.986795 17809 slave.cpp:5102] Task status update manager 
> successfully handled status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986829 17809 slave.cpp:5118] Sending acknowledgement for 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> executor(1)@172.16.10.110:38499
> I0108 16:35:45.986901 17805 master.cpp:7890] Status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.986928 17805 master.cpp:7946] Forwarding status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986984 17805 master.cpp:10193] Updating the state of task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- (latest state: 
> TASK_FAILED, status update state: TASK_FAILED)
> I0108 16:35:45.987047 17805 sched.cpp:990] Received status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from slave(1)@172.16.10.110:37252
> I0108 16:35:45.987103 17805 sched.cpp:1029] Scheduler::statusUpdate took 
> 30948ns
> I0108 16:35:45.987112 17805 sched.cpp:1048] Sending ACK for status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> master@172.16.10.110:37252
> I0108 16:35:45.987221 17805 master.cpp:5826] Processing ACKNOWLEDGE call 
> 7f544700-215b-4d27-ab43-b48e19592d00 for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- (default) at 
> scheduler-4ad5073e-c1db-4c34-9c43-e656c280a724@172.16.10.110:37252 on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0
> I0108 16:35:45.987267 17805 master.cpp:10299] Removing task 1 with resources 
> cpus(allocated: *):2; mem(allocated: *):1024; disk(allocated: *):1024; 
> ports(allocated: *):[31000-32000] of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.987473 17807 task_status_update_manager.cpp:401] Received task 
> status update acknowledgement (UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987561 17807 task_status_update_manager.cpp:538] Cleaning up 
> status update stream for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987814 17807 slave.cpp:3974] Task status update manager 
> successfully handled status update acknowledgement (UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> 

[jira] [Updated] (MESOS-8414) DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6

2018-01-10 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8414:
--
Attachment: centos6-vlog2.txt

> DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6
> --
>
> Key: MESOS-8414
> URL: https://issues.apache.org/jira/browse/MESOS-8414
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 6, Docker version 1.7.1, build 786b29d
>Reporter: Armand Grillet
> Attachments: 
> centos6-ssl-DockerContainerizerTest.ROOT_DOCKER_Logs.txt, centos6-vlog2.txt
>
>
> You can find the verbose logs attached.
> The most interesting part:
> {code}
> I0108 16:35:45.887037 17805 sched.cpp:897] Received 1 offers
> I0108 16:35:45.887070 17805 sched.cpp:921] Scheduler::resourceOffers took 
> 12130ns
> I0108 16:35:45.985957 17808 docker.cpp:349] Unable to detect IP Address at 
> 'NetworkSettings.Networks.host.IPAddress', attempting deprecated field
> I0108 16:35:45.986428 17809 task_status_update_manager.cpp:328] Received task 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986552 17809 task_status_update_manager.cpp:383] Forwarding 
> task status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to the agent
> I0108 16:35:45.986654 17809 slave.cpp:5209] Forwarding the update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- to master@172.16.10.110:37252
> I0108 16:35:45.986795 17809 slave.cpp:5102] Task status update manager 
> successfully handled status update TASK_FAILED (Status UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986829 17809 slave.cpp:5118] Sending acknowledgement for 
> status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> executor(1)@172.16.10.110:38499
> I0108 16:35:45.986901 17805 master.cpp:7890] Status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.986928 17805 master.cpp:7946] Forwarding status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.986984 17805 master.cpp:10193] Updating the state of task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- (latest state: 
> TASK_FAILED, status update state: TASK_FAILED)
> I0108 16:35:45.987047 17805 sched.cpp:990] Received status update TASK_FAILED 
> (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- from slave(1)@172.16.10.110:37252
> I0108 16:35:45.987103 17805 sched.cpp:1029] Scheduler::statusUpdate took 
> 30948ns
> I0108 16:35:45.987112 17805 sched.cpp:1048] Sending ACK for status update 
> TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
> framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
> master@172.16.10.110:37252
> I0108 16:35:45.987221 17805 master.cpp:5826] Processing ACKNOWLEDGE call 
> 7f544700-215b-4d27-ab43-b48e19592d00 for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- (default) at 
> scheduler-4ad5073e-c1db-4c34-9c43-e656c280a724@172.16.10.110:37252 on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0
> I0108 16:35:45.987267 17805 master.cpp:10299] Removing task 1 with resources 
> cpus(allocated: *):2; mem(allocated: *):1024; disk(allocated: *):1024; 
> ports(allocated: *):[31000-32000] of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e- on agent 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
> (ip-172-16-10-110.ec2.internal)
> I0108 16:35:45.987473 17807 task_status_update_manager.cpp:401] Received task 
> status update acknowledgement (UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
> for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987561 17807 task_status_update_manager.cpp:538] Cleaning up 
> status update stream for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 16:35:45.987814 17807 slave.cpp:3974] Task status update manager 
> successfully handled status update acknowledgement (UUID: 
> 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
> f09c89e1-aa62-4662-bda8-15a2c87f412e-
> I0108 

[jira] [Commented] (MESOS-8078) Some fields went missing with no replacement in api/v1

2018-01-10 Thread Dmitrii Rozhkov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319957#comment-16319957
 ] 

Dmitrii Rozhkov commented on MESOS-8078:


Thanks Greg!

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Story
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Greg Mann
>Priority: Critical
>  Labels: mesosphere
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8423) Improving debug logging in Mesos Containerizer.

2018-01-10 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-8423:
---

 Summary: Improving debug logging in Mesos Containerizer.
 Key: MESOS-8423
 URL: https://issues.apache.org/jira/browse/MESOS-8423
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-10 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319920#comment-16319920
 ] 

Gilbert Song commented on MESOS-8391:
-

The root cause is found. This bug was from this patch 
https://reviews.apache.org/r/63887/

Basically, the marathon does have the correct task update since the master does 
not send it, while the master itself does not have the correct task status. 
Because the agent v1 api {{WAIT_NESTED_CONTAINER}} is called by the default 
executor and the agent forwards it to the composing containerizer, but the 
{{ComposingContainerizer::wait()}} skips it.
https://github.com/apache/mesos/blob/master/src/slave/containerizer/composing.cpp#L585~#L587

This bug in composing containerizer is only reproducible after the agent 
restarts and kill any task. If a task is killed, the hashmap {{containers_}} in 
composing containerizer is not maintained correctly and the termination future 
is returned instead. Before the patch r/63887, there is no such problem. 
Because the composing containerizer calls underlying containerizer::wait() 
directly.

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Assignee: Gilbert Song
>Priority: Blocker
> Attachments: testing-log-2.tar.gz
>
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)