[jira] [Created] (MESOS-9295) Nested container launch could fail if the agent upgrade with new cgroup subsystems.

2018-10-04 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9295:
---

 Summary: Nested container launch could fail if the agent upgrade 
with new cgroup subsystems.
 Key: MESOS-9295
 URL: https://issues.apache.org/jira/browse/MESOS-9295
 Project: Mesos
  Issue Type: Bug
Reporter: Gilbert Song


Nested container launch could fail if the agent upgrade with new cgroup 
subsystems, because the new cgroup subsystems do not exist on parent 
container's cgroup hierarchy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9283) Docker containerizer actor can get backlogged with large number of containers.

2018-10-04 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639075#comment-16639075
 ] 

Greg Mann commented on MESOS-9283:
--

This has been merged into master, but not yet backported. I plan to backport 
tomorrow.

> Docker containerizer actor can get backlogged with large number of containers.
> --
>
> Key: MESOS-9283
> URL: https://issues.apache.org/jira/browse/MESOS-9283
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Greg Mann
>Priority: Major
>  Labels: perfomance
> Fix For: 1.8.0
>
> Attachments: Screen Shot 2018-10-01 at 10.54.18 PM.png
>
>
> We observed during some scale testing that we do internally.
> When launching 300+ Docker containers on a single agent box, it's possible 
> that the Docker containerizer actor gets backlogged. As a result, API 
> processing like `GET_CONTAINERS` will become unresponsive. It'll also block 
> Mesos containerizer from launching containers if one specified 
> `--containers=docker,mesos` because Docker containerizer launch will be 
> invoked first by the composing containerizer (and queued).
> Profiling results show that the bottleneck is `os::killtree`, which will be 
> invoked when the Docker commands are discarded (e.g., client disconnect, 
> etc.).
> For this particular case, killtree is not really necessary because the docker 
> command does not fork additional subprocesses. If we use the argv version of 
> `subprocess` to launch docker commands, we can simply use os::kill instead. 
> We confirmed that, by switching to os::kill, the performance issues goes 
> away, and the agent can easily scale up to 300+ containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9294) Document which fields are ignored during framework reregistration

2018-10-04 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9294:


 Summary: Document which fields are ignored during framework 
reregistration
 Key: MESOS-9294
 URL: https://issues.apache.org/jira/browse/MESOS-9294
 Project: Mesos
  Issue Type: Documentation
  Components: scheduler api
Affects Versions: 1.7.0
Reporter: Greg Mann


When a framework reregisters, some fields in the {{FrameworkInfo}} are ignored, 
and the master simply uses whatever values were provided during the initial 
registration. We should document this behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9283) Docker containerizer actor can get backlogged with large number of containers.

2018-10-04 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639037#comment-16639037
 ] 

Greg Mann commented on MESOS-9283:
--

I tested the backport to 1.5.x, and the conflicts were not bad. I squashed both 
of the above patches into one, which I will merge and backport: 
https://reviews.apache.org/r/68923/

> Docker containerizer actor can get backlogged with large number of containers.
> --
>
> Key: MESOS-9283
> URL: https://issues.apache.org/jira/browse/MESOS-9283
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Greg Mann
>Priority: Major
>  Labels: perfomance
> Attachments: Screen Shot 2018-10-01 at 10.54.18 PM.png
>
>
> We observed during some scale testing that we do internally.
> When launching 300+ Docker containers on a single agent box, it's possible 
> that the Docker containerizer actor gets backlogged. As a result, API 
> processing like `GET_CONTAINERS` will become unresponsive. It'll also block 
> Mesos containerizer from launching containers if one specified 
> `--containers=docker,mesos` because Docker containerizer launch will be 
> invoked first by the composing containerizer (and queued).
> Profiling results show that the bottleneck is `os::killtree`, which will be 
> invoked when the Docker commands are discarded (e.g., client disconnect, 
> etc.).
> For this particular case, killtree is not really necessary because the docker 
> command does not fork additional subprocesses. If we use the argv version of 
> `subprocess` to launch docker commands, we can simply use os::kill instead. 
> We confirmed that, by switching to os::kill, the performance issues goes 
> away, and the agent can easily scale up to 300+ containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9288) Allow operators to limit the number of Docker containers on a host

2018-10-04 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639018#comment-16639018
 ] 

Greg Mann edited comment on MESOS-9288 at 10/4/18 11:13 PM:


It's worth noting that the Mesos master already does provide a 
{{max_executors_per_agent}} flag, but only when configured with 
{{with-network-isolator}}.


was (Author: greggomann):
It's worth noting that the Mesos master already does provide a 
{{--max_executors_per_agent}} flag, but only when configured with 
{{--with-network-isolator}}.

> Allow operators to limit the number of Docker containers on a host
> --
>
> Key: MESOS-9288
> URL: https://issues.apache.org/jira/browse/MESOS-9288
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.7.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: containerizer, docker, mesosphere
>
> Since we have observed performance issues on machines where a large number 
> (hundreds) of Docker containers are running simultaneously, we should 
> consider adding a Mesos configuration flag to allow operators to limit the 
> number of Docker containers that can be launched on a single host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9288) Allow operators to limit the number of Docker containers on a host

2018-10-04 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639018#comment-16639018
 ] 

Greg Mann commented on MESOS-9288:
--

It's worth noting that the Mesos master already does provide a 
{{--max_executors_per_agent}} flag, but only when configured with 
{{--with-network-isolator}}.

> Allow operators to limit the number of Docker containers on a host
> --
>
> Key: MESOS-9288
> URL: https://issues.apache.org/jira/browse/MESOS-9288
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.7.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: containerizer, docker, mesosphere
>
> Since we have observed performance issues on machines where a large number 
> (hundreds) of Docker containers are running simultaneously, we should 
> consider adding a Mesos configuration flag to allow operators to limit the 
> number of Docker containers that can be launched on a single host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9293) OperationStatus messages sent to framework should include both agent ID and resource provider ID

2018-10-04 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9293:
-

Assignee: Gastón Kleiman

> OperationStatus messages sent to framework should include both agent ID and 
> resource provider ID
> 
>
> Key: MESOS-9293
> URL: https://issues.apache.org/jira/browse/MESOS-9293
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0
>Reporter: James DeFelice
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: mesosphere, operation-feedback
>
> Normally, frameworks are expected to checkpoint agent ID and resource 
> provider ID before accepting an offer with an OfferOperation. From this 
> expectation comes the requirement in the v1 scheduler API that a framework 
> must provide the agent ID and resource provider ID when acknowledging an 
> offer operation status update. However, this expectation breaks down:
> 1. the framework might lose its checkpointed data; it no longer remembers the 
> agent ID or the resource provider ID
> 2. even if the framework checkpoints data, it could be sent a stale update: 
> maybe the original ACK it sent to Mesos was lost, and it needs to re-ACK. If 
> a framework deleted its checkpointed data after sending the ACK (that's 
> dropped) then upon replay of the status update it no longer has the agent ID 
> or resource provider ID for the operation.
> An easy remedy would be to add the agent ID and resource provider ID to the 
> OperationStatus message received by the scheduler so that a framework can 
> build a proper ACK for the update, even if it doesn't have access to its 
> previously checkpointed information.
> I'm filing this as a BUG because there's no way to reliably use the offer 
> operation status API until this has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9293) OperationStatus messages sent to framework should include both agent ID and resource provider ID

2018-10-04 Thread James DeFelice (JIRA)
James DeFelice created MESOS-9293:
-

 Summary: OperationStatus messages sent to framework should include 
both agent ID and resource provider ID
 Key: MESOS-9293
 URL: https://issues.apache.org/jira/browse/MESOS-9293
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.7.0
Reporter: James DeFelice


Normally, frameworks are expected to checkpoint agent ID and resource provider 
ID before accepting an offer with an OfferOperation. From this expectation 
comes the requirement in the v1 scheduler API that a framework must provide the 
agent ID and resource provider ID when acknowledging an offer operation status 
update. However, this expectation breaks down:

1. the framework might lose its checkpointed data; it no longer remembers the 
agent ID or the resource provider ID

2. even if the framework checkpoints data, it could be sent a stale update: 
maybe the original ACK it sent to Mesos was lost, and it needs to re-ACK. If a 
framework deleted its checkpointed data after sending the ACK (that's dropped) 
then upon replay of the status update it no longer has the agent ID or resource 
provider ID for the operation.

An easy remedy would be to add the agent ID and resource provider ID to the 
OperationStatus message received by the scheduler so that a framework can build 
a proper ACK for the update, even if it doesn't have access to its previously 
checkpointed information.

I'm filing this a BUG because there's no way to reliably use the offer 
operation status API until this has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9292) Rejected quotas should include a reason in their error message

2018-10-04 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9292:
--

 Summary: Rejected quotas should include a reason in their error 
message
 Key: MESOS-9292
 URL: https://issues.apache.org/jira/browse/MESOS-9292
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


If we reject a quota request due to not having enough available resources, we 
fail with the following error:
{noformat}
Not enough available cluster capacity to reasonably satisfy quota
request; the force flag can be used to override this check
{noformat}

but we don't print *which* resource was not available. This can be confusing to 
operators when the quota was attempted to be set for multiple resources at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9274) v1 JAVA scheduler library can drop TEARDOWN upon destruction.

2018-10-04 Thread Alexander Rukletsov (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630527#comment-16630527
 ] 

Alexander Rukletsov edited comment on MESOS-9274 at 10/4/18 9:01 AM:
-

I see several possible solutions here:
* Ensure the JAVA scheduler library is not destructed after {{TEARDOWN}} is 
sent. This is out of our control hence does not seem like a good solution or 
user experience
* Add {{sleep(5)}} in 
[{{V1Mesos::finalize()}}|https://github.com/apache/mesos/blob/270c4cb62f5680bcf952bfb7ec8dfc10843f21e0/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L258].
 This is a hacky solution but it [_follows the 
pattern_|https://github.com/apache/mesos/blob/86653356d763fee79e9467cf7b07bebb449e8aff/src/launcher/default_executor.cpp#L1082]
 ;).
* Use {{Mesos::call()}} instead of {{Mesos::send()}} and wait for the response 
in {{v1Mesos::send()}}. This seems like the cleanest solution.


was (Author: alexr):
I see several possible solutions here:
* Ensure the JAVA scheduler library is not destructed after {{TEARDOWN}} is 
sent. This is out of our control hence does not seem like a good solution or 
user experience
* Add {{sleep(5)}} in 
[{{V1Mesos::finalize()}}|https://github.com/apache/mesos/blob/270c4cb62f5680bcf952bfb7ec8dfc10843f21e0/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L258].
 This is a hacky solution but it [_follows the 
pattern_|https://github.com/apache/mesos/blob/86653356d763fee79e9467cf7b07bebb449e8aff/src/launcher/default_executor.cpp#L1082]
 ;).
* Use {[Mesos::call()}} instead of {{Mesos::send()}} and wait for the response 
in {{v1Mesos::send()}}. This seems like the cleanest solution.

> v1 JAVA scheduler library can drop TEARDOWN upon destruction.
> -
>
> Key: MESOS-9274
> URL: https://issues.apache.org/jira/browse/MESOS-9274
> Project: Mesos
>  Issue Type: Bug
>  Components: java api, scheduler driver
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: api, mesosphere, scheduler
>
> Currently the v1 JAVA scheduler library neither ensures {{Call}} s are sent 
> to the master nor waits for responses. This can be problematic if the library 
> is destroyed (or garbage collected) right after sending a {{TEARDOWN}} call: 
> destruction of the underlying {{Mesos}} actor races with sending the call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)