[jira] [Created] (MESOS-8417) Mesos can get "stuck" when a Process throws an exception.

2018-01-08 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-8417:
--

 Summary: Mesos can get "stuck" when a Process throws an exception.
 Key: MESOS-8417
 URL: https://issues.apache.org/jira/browse/MESOS-8417
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


When a {{Process}} throws an exception, we log it, terminate the throwing 
{{Process}}, and continue to run. However, currently there exists no known 
user-level code that I'm aware of that handles the unexpected termination due 
to an uncaught exception.

Generally, this means that when an exception is thrown (e.g. a bad call to 
{{std::map::at}}), the {{Process}} terminates with a log message but things get 
"stuck" and the user has to debug what is wrong / kill the process.

Libprocess would likely need to provide some primitives to better support 
handling unexpected termination of a {{Process}} in order for us to provide a 
strategy where we continue running.

In the short term, it would be prudent to abort libprocess if any {{Process}} 
throws an exception so that users can observe the issue and we can get it fixed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8279) Persistent volumes are not visible in Mesos UI using default executor on Linux.

2018-01-08 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317514#comment-16317514
 ] 

Qian Zhang commented on MESOS-8279:
---

[~vinodkone] proposed a solution: calling {{Files::attach()}} to attach the 
executor's PV directory to the task's PV directory, so when users browse
task's volume directory in Mesos UI, what they actually browse is the 
executor's volume directory.

Here is the RR:
https://reviews.apache.org/r/64978/

> Persistent volumes are not visible in Mesos UI using default executor on 
> Linux.
> ---
>
> Key: MESOS-8279
> URL: https://issues.apache.org/jira/browse/MESOS-8279
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.2, 1.3.1, 1.4.1
>Reporter: Jie Yu
>Assignee: Qian Zhang
>
> The reason is because on Linux, if multiple containers in a default executor 
> want to share a persistent volume, it'll use SANDBOX_PATH volume source with 
> type PARENT. This will be translated into a bind mount in the nested 
> container's mount namespace, thus not visible in the host mount namespace. 
> Mesos UI operates in the host mount namespace.
> One potential solution for that is to create a symlink (instead of just a 
> mkdir) in the sandbox. The symlink will be shadowed by the bind mount in the 
> nested container, but in the host mount namespace, it'll points to the 
> corresponding persistent volume.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8279) Persistent volumes are not visible in Mesos UI using default executor on Linux.

2018-01-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8279:
--
Shepherd: Vinod Kone
Story Points: 3

> Persistent volumes are not visible in Mesos UI using default executor on 
> Linux.
> ---
>
> Key: MESOS-8279
> URL: https://issues.apache.org/jira/browse/MESOS-8279
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.2, 1.3.1, 1.4.1
>Reporter: Jie Yu
>Assignee: Qian Zhang
>
> The reason is because on Linux, if multiple containers in a default executor 
> want to share a persistent volume, it'll use SANDBOX_PATH volume source with 
> type PARENT. This will be translated into a bind mount in the nested 
> container's mount namespace, thus not visible in the host mount namespace. 
> Mesos UI operates in the host mount namespace.
> One potential solution for that is to create a symlink (instead of just a 
> mkdir) in the sandbox. The symlink will be shadowed by the bind mount in the 
> nested container, but in the host mount namespace, it'll points to the 
> corresponding persistent volume.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8279) Persistent volumes are not visible in Mesos UI using default executor on Linux.

2018-01-08 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang updated MESOS-8279:
--
Sprint: Mesosphere Sprint 72

> Persistent volumes are not visible in Mesos UI using default executor on 
> Linux.
> ---
>
> Key: MESOS-8279
> URL: https://issues.apache.org/jira/browse/MESOS-8279
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.2, 1.3.1, 1.4.1
>Reporter: Jie Yu
>Assignee: Qian Zhang
>
> The reason is because on Linux, if multiple containers in a default executor 
> want to share a persistent volume, it'll use SANDBOX_PATH volume source with 
> type PARENT. This will be translated into a bind mount in the nested 
> container's mount namespace, thus not visible in the host mount namespace. 
> Mesos UI operates in the host mount namespace.
> One potential solution for that is to create a symlink (instead of just a 
> mkdir) in the sandbox. The symlink will be shadowed by the bind mount in the 
> nested container, but in the host mount namespace, it'll points to the 
> corresponding persistent volume.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8279) Persistent volumes are not visible in Mesos UI using default executor on Linux.

2018-01-08 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-8279:
-

Assignee: Qian Zhang

> Persistent volumes are not visible in Mesos UI using default executor on 
> Linux.
> ---
>
> Key: MESOS-8279
> URL: https://issues.apache.org/jira/browse/MESOS-8279
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.2, 1.3.1, 1.4.1
>Reporter: Jie Yu
>Assignee: Qian Zhang
>
> The reason is because on Linux, if multiple containers in a default executor 
> want to share a persistent volume, it'll use SANDBOX_PATH volume source with 
> type PARENT. This will be translated into a bind mount in the nested 
> container's mount namespace, thus not visible in the host mount namespace. 
> Mesos UI operates in the host mount namespace.
> One potential solution for that is to create a symlink (instead of just a 
> mkdir) in the sandbox. The symlink will be shadowed by the bind mount in the 
> nested container, but in the host mount namespace, it'll points to the 
> corresponding persistent volume.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8411) Killing a queued task can lead to the command executor never terminating.

2018-01-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8411:
--
Priority: Critical  (was: Major)

> Killing a queued task can lead to the command executor never terminating.
> -
>
> Key: MESOS-8411
> URL: https://issues.apache.org/jira/browse/MESOS-8411
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Critical
>
> If a task is killed while the executor is re-registering, we will remove it 
> from queued tasks and shut down the executor if all the its initial tasks 
> could not be delivered. However, there is a case (within {{Slave::___run}}) 
> where we leave the executor running, the race is:
> # Command-executor task launched.
> # Command executor sends registration message. Agent tells containerizer to 
> update the resources before it sends the tasks to the executor.
> # Kill arrives, and we synchronously remove the task from queued tasks.
> # Containerizer finishes updating the resources, and in {{Slave::___run}} the 
> killed task is ignored.
> # Command executor stays running!
> Executors could have a timeout to handle this case, but it's not clear that 
> all executors will implement this correctly. It would be better to have a 
> defensive policy that will shut down an executor if all of its initial batch 
> of tasks were killed prior to delivery.
> In order to implement this, one approach discussed with [~vinodkone] is to 
> look at the running + terminated but unacked + completed tasks, and if empty, 
> shut the executor down in the {{Slave::___run}} path. This will require us to 
> check that the completed task cache size is set to at least 1, and this also 
> assumes that the completed tasks are not cleared based on time or during 
> agent recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8411) Killing a queued task can lead to the command executor never terminating.

2018-01-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8411:
-

Shepherd: Benjamin Mahler
Assignee: Meng Zhu
Story Points: 3
  Sprint: Mesosphere Sprint 72

> Killing a queued task can lead to the command executor never terminating.
> -
>
> Key: MESOS-8411
> URL: https://issues.apache.org/jira/browse/MESOS-8411
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>
> If a task is killed while the executor is re-registering, we will remove it 
> from queued tasks and shut down the executor if all the its initial tasks 
> could not be delivered. However, there is a case (within {{Slave::___run}}) 
> where we leave the executor running, the race is:
> # Command-executor task launched.
> # Command executor sends registration message. Agent tells containerizer to 
> update the resources before it sends the tasks to the executor.
> # Kill arrives, and we synchronously remove the task from queued tasks.
> # Containerizer finishes updating the resources, and in {{Slave::___run}} the 
> killed task is ignored.
> # Command executor stays running!
> Executors could have a timeout to handle this case, but it's not clear that 
> all executors will implement this correctly. It would be better to have a 
> defensive policy that will shut down an executor if all of its initial batch 
> of tasks were killed prior to delivery.
> In order to implement this, one approach discussed with [~vinodkone] is to 
> look at the running + terminated but unacked + completed tasks, and if empty, 
> shut the executor down in the {{Slave::___run}} path. This will require us to 
> check that the completed task cache size is set to at least 1, and this also 
> assumes that the completed tasks are not cleared based on time or during 
> agent recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8416) CHECK failure if trying to recover nested containers but the framework checkpointing is not enabled.

2018-01-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8416:
--
Priority: Critical  (was: Major)

> CHECK failure if trying to recover nested containers but the framework 
> checkpointing is not enabled.
> 
>
> Key: MESOS-8416
> URL: https://issues.apache.org/jira/browse/MESOS-8416
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Priority: Critical
>  Labels: containerizer
>
> {noformat}
> I0108 23:05:25.313344 31743 slave.cpp:620] Agent attributes: [  ]
> I0108 23:05:25.313832 31743 slave.cpp:629] Agent hostname: 
> vagrant-ubuntu-wily-64
> I0108 23:05:25.314916 31763 task_status_update_manager.cpp:181] Pausing 
> sending task status updates
> I0108 23:05:25.323496 31766 state.cpp:66] Recovering state from 
> '/var/lib/mesos/slave/meta'
> I0108 23:05:25.323639 31766 state.cpp:724] No committed checkpointed 
> resources found at '/var/lib/mesos/slave/meta/resources/resources.info'
> I0108 23:05:25.326169 31760 task_status_update_manager.cpp:207] Recovering 
> task status update manager
> I0108 23:05:25.326954 31759 containerizer.cpp:674] Recovering containerizer
> F0108 23:05:25.331529 31759 containerizer.cpp:919] 
> CHECK_SOME(container->directory): is NONE 
> *** Check failure stack trace: ***
> @ 0x7f769dbc98bd  google::LogMessage::Fail()
> @ 0x7f769dbc8c8e  google::LogMessage::SendToLog()
> @ 0x7f769dbc958d  google::LogMessage::Flush()
> @ 0x7f769dbcca08  google::LogMessageFatal::~LogMessageFatal()
> @ 0x556cb4c2b937  _CheckFatal::~_CheckFatal()
> @ 0x7f769c5ac653  
> mesos::internal::slave::MesosContainerizerProcess::recover()
> {noformat}
> If the framework does not enable the checkpointing. It means there is no 
> slave state checkpointed. But containers are still checkpointed at the 
> runtime dir, which mean recovering a nested container would cause the CHECK 
> failure due to its parent's sandbox dir is unknown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8416) CHECK failure if trying to recover nested containers but the framework checkpointing is not enabled.

2018-01-08 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-8416:
---

 Summary: CHECK failure if trying to recover nested containers but 
the framework checkpointing is not enabled.
 Key: MESOS-8416
 URL: https://issues.apache.org/jira/browse/MESOS-8416
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Gilbert Song


{noformat}
I0108 23:05:25.313344 31743 slave.cpp:620] Agent attributes: [  ]
I0108 23:05:25.313832 31743 slave.cpp:629] Agent hostname: 
vagrant-ubuntu-wily-64
I0108 23:05:25.314916 31763 task_status_update_manager.cpp:181] Pausing sending 
task status updates
I0108 23:05:25.323496 31766 state.cpp:66] Recovering state from 
'/var/lib/mesos/slave/meta'
I0108 23:05:25.323639 31766 state.cpp:724] No committed checkpointed resources 
found at '/var/lib/mesos/slave/meta/resources/resources.info'
I0108 23:05:25.326169 31760 task_status_update_manager.cpp:207] Recovering task 
status update manager
I0108 23:05:25.326954 31759 containerizer.cpp:674] Recovering containerizer
F0108 23:05:25.331529 31759 containerizer.cpp:919] 
CHECK_SOME(container->directory): is NONE 
*** Check failure stack trace: ***
@ 0x7f769dbc98bd  google::LogMessage::Fail()
@ 0x7f769dbc8c8e  google::LogMessage::SendToLog()
@ 0x7f769dbc958d  google::LogMessage::Flush()
@ 0x7f769dbcca08  google::LogMessageFatal::~LogMessageFatal()
@ 0x556cb4c2b937  _CheckFatal::~_CheckFatal()
@ 0x7f769c5ac653  
mesos::internal::slave::MesosContainerizerProcess::recover()
{noformat}

If the framework does not enable the checkpointing. It means there is no slave 
state checkpointed. But containers are still checkpointed at the runtime dir, 
which mean recovering a nested container would cause the CHECK failure due to 
its parent's sandbox dir is unknown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8362) Verify end-to-end operation status update retry after RP failover

2018-01-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-8362:
--
Story Points: 5

> Verify end-to-end operation status update retry after RP failover
> -
>
> Key: MESOS-8362
> URL: https://issues.apache.org/jira/browse/MESOS-8362
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8405) Update master task loss handling.

2018-01-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8405:
--
Description: 
>From [~vinodkone] in [r/64940|https://reviews.apache.org/r/64940/]:

{quote}
Ideally, we want terminal but unacknowledged tasks to still be marked 
unreachable in some way, either via task state being TASK_UNREACHABLE or task 
being present in unreachableTasks. This allows, for example, the WebUI to not 
show sandbox links for unreachable tasks irrespective of whether they were 
terminal or not before going unreachable. 

But doing this is tricky for various reasons:

--> updateTask() doesn't allow a terminal state to be transitioned to 
TASK_UNREACHABLE. Right now when we call updateTask for a terminal task, it 
adds TASK_UNREACHABLE status to Task.statuses and also sends it to operator API 
stream subscribers which looks incorrect. The fact that updateTask internally 
deals with already terminal tasks is a bad design decision in retrospect. I 
think the callers shouldn't call it for terminal tasks instead.

--> It's not clear to our users what a completed task means. The intention was 
for this to hold a cache of terminal and acknowledged tasks for storing recent 
history. The users of the WebUI probably equate "Completed Tasks" to terminal 
tasks irrespective of their acknowledgement status, which is why it is 
confusing for them to see terminal but unacknowledged tasks in the "Active 
tasks" section in the WebUI.

--> When a framework reconciles the state of a task on an unreachable agent, 
master replies with TASK_UNREACHABLE irrespective of whether the task was in a 
non-terminal state or terminal but un-acknowledged state or terminal and 
acknowledged state when the agent went unreachable.  

I think the direction we want to go towards is

--> Completed tasks should consist of terminal unacknowledged and terminal 
acknowled tasks, likely in two different data structures.
--> Unreachable tasks should consist of all non-complete tasks on an 
unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
state.
{quote}

  was:
>From [~agentvindo.dev] in [r/64940|https://reviews.apache.org/r/64940/]:

{quote}
Ideally, we want terminal but unacknowledged tasks to still be marked 
unreachable in some way, either via task state being TASK_UNREACHABLE or task 
being present in unreachableTasks. This allows, for example, the WebUI to not 
show sandbox links for unreachable tasks irrespective of whether they were 
terminal or not before going unreachable. 

But doing this is tricky for various reasons:

--> updateTask() doesn't allow a terminal state to be transitioned to 
TASK_UNREACHABLE. Right now when we call updateTask for a terminal task, it 
adds TASK_UNREACHABLE status to Task.statuses and also sends it to operator API 
stream subscribers which looks incorrect. The fact that updateTask internally 
deals with already terminal tasks is a bad design decision in retrospect. I 
think the callers shouldn't call it for terminal tasks instead.

--> It's not clear to our users what a completed task means. The intention was 
for this to hold a cache of terminal and acknowledged tasks for storing recent 
history. The users of the WebUI probably equate "Completed Tasks" to terminal 
tasks irrespective of their acknowledgement status, which is why it is 
confusing for them to see terminal but unacknowledged tasks in the "Active 
tasks" section in the WebUI.

--> When a framework reconciles the state of a task on an unreachable agent, 
master replies with TASK_UNREACHABLE irrespective of whether the task was in a 
non-terminal state or terminal but un-acknowledged state or terminal and 
acknowledged state when the agent went unreachable.  

I think the direction we want to go towards is

--> Completed tasks should consist of terminal unacknowledged and terminal 
acknowled tasks, likely in two different data structures.
--> Unreachable tasks should consist of all non-complete tasks on an 
unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
state.
{quote}


> Update master task loss handling.
> -
>
> Key: MESOS-8405
> URL: https://issues.apache.org/jira/browse/MESOS-8405
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>
> From [~vinodkone] in [r/64940|https://reviews.apache.org/r/64940/]:
> {quote}
> Ideally, we want terminal but unacknowledged tasks to still be marked 
> unreachable in some way, either via task state being TASK_UNREACHABLE or task 
> being present in unreachableTasks. This allows, for example, the WebUI to not 
> show sandbox links for unreachable tasks irrespective of whether they were 
> terminal or not before going unreachable. 
> But doing this is tricky for various reasons:
> --> updateTask() doesn't allow a terminal state to be transitioned to 

[jira] [Commented] (MESOS-8362) Verify end-to-end operation status update retry after RP failover

2018-01-08 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317061#comment-16317061
 ] 

Adam B commented on MESOS-8362:
---

Story points, please?

> Verify end-to-end operation status update retry after RP failover
> -
>
> Key: MESOS-8362
> URL: https://issues.apache.org/jira/browse/MESOS-8362
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8382) Master should bookkeep local resource providers.

2018-01-08 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317060#comment-16317060
 ] 

Adam B commented on MESOS-8382:
---

Story points, please?

> Master should bookkeep local resource providers.
> 
>
> Key: MESOS-8382
> URL: https://issues.apache.org/jira/browse/MESOS-8382
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Benjamin Bannier
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> This will simplify the handling of `UpdateSlaveMessage`. ALso, it'll simplify 
> the endpoint serving.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8357) Example frameworks have an inconsistent UX.

2018-01-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-8357:
--
Sprint: Mesosphere Sprint 71, Mesosphere Sprint 72  (was: Mesosphere Sprint 
71)

> Example frameworks have an inconsistent UX.
> ---
>
> Key: MESOS-8357
> URL: https://issues.apache.org/jira/browse/MESOS-8357
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>Priority: Minor
>  Labels: mesosphere
>
> Our example frameworks are a bit inconsistent when it comes to specifying 
> things like the framework principal / secret etc.. 
> Many of these examples have great value in testing a Mesos cluster. Unifying 
> the parameterizing would improve the user experience when testing Mesos.
> {{MESOS_AUTHENTICATE_FRAMEWORKS}} is being used by many examples for enabling 
> / disabling authentication. {{load_generator_framework}} as one example 
> however uses {{MESOS_AUTHENTICATE}} for that purpose. The credentials 
> themselves are most commonly expected in environment variables 
> {{DEFAULT_PRINCIPAL}} and {{DEFAULT_SECRET}} while in some cases we chose to 
> use {{MESOS_PRINCIPAL}}, {{MESOS_SECRET}} instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5362) Add authentication to example frameworks

2018-01-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5362:
--
Sprint: Mesosphere Sprint 71, Mesosphere Sprint 72  (was: Mesosphere Sprint 
71)

> Add authentication to example frameworks
> 
>
> Key: MESOS-5362
> URL: https://issues.apache.org/jira/browse/MESOS-5362
> Project: Mesos
>  Issue Type: Improvement
>  Components: security
>Reporter: Greg Mann
>Assignee: Till Toenshoff
>  Labels: authentication, mesosphere, security
>
> Some example frameworks do not have the ability to authenticate with the 
> master. Adding authentication to the example frameworks that don't already 
> have it implemented would allow us to use these frameworks for testing in 
> authenticated/authorized scenarios.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8375) Use protobuf reflection to simplify upgrading of resources.

2018-01-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-8375:
--
Sprint: Mesosphere Sprint 71, Mesosphere Sprint 72  (was: Mesosphere Sprint 
71)

> Use protobuf reflection to simplify upgrading of resources.
> ---
>
> Key: MESOS-8375
> URL: https://issues.apache.org/jira/browse/MESOS-8375
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Park
>Assignee: Michael Park
>Priority: Blocker
>
> This is the {{upgradeResources}} half of the protobuf-reflection-based 
> upgrade/downgrade of resources: 
> https://issues.apache.org/jira/browse/MESOS-8221
> We will also add {{state::read}} to complement {{state::checkpoint}} which 
> will be used to read protobufs from disk rather than {{protobuf::read}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8373) Test reconciliation after operation is dropped en route to agent

2018-01-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-8373:
--
Sprint: Mesosphere Sprint 71, Mesosphere Sprint 72  (was: Mesosphere Sprint 
71)

> Test reconciliation after operation is dropped en route to agent
> 
>
> Key: MESOS-8373
> URL: https://issues.apache.org/jira/browse/MESOS-8373
> Project: Mesos
>  Issue Type: Task
>  Components: agent, master
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
>
> Since new code paths were added to handle operations on resources in 1.5, we 
> should test that such operations are reconciled correctly after an operation 
> is dropped on the way from the master to the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5333) GET /master/maintenance/schedule/ produces 404.

2018-01-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5333:
--
Sprint: Mesosphere Sprint 70, Mesosphere Sprint 71, Mesosphere Sprint 72  
(was: Mesosphere Sprint 70, Mesosphere Sprint 71)

> GET /master/maintenance/schedule/ produces 404.
> ---
>
> Key: MESOS-5333
> URL: https://issues.apache.org/jira/browse/MESOS-5333
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, libprocess
>Reporter: Nathan Handler
>Assignee: Alexander Rukletsov
>Priority: Minor
>  Labels: mesosphere
>
> Attempts to make a GET request to /master/maintenance/schedule/ result in a 
> 404. However, if I make a GET request to /master/maintenance/schedule 
> (without the trailing /), it works. My current (untested) theory is that this 
> might be related to the fact that there is also a 
> /master/maintenance/schedule/status endpoint (an endpoint built on top of a 
> functioning endpoint), as requests to /help and /help/ (with and without the 
> trailing slash) produce the same functioning result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8291) Add documentation about fault domains

2018-01-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-8291:
--
Sprint: Mesosphere Sprint 70, Mesosphere Sprint 71, Mesosphere Sprint 72  
(was: Mesosphere Sprint 70, Mesosphere Sprint 71)

> Add documentation about fault domains
> -
>
> Key: MESOS-8291
> URL: https://issues.apache.org/jira/browse/MESOS-8291
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Vinod Kone
>Assignee: Benno Evers
>
> We need some user docs for fault domains.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7790) Design hierarchical quota allocation.

2018-01-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7790:
--
Sprint: Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68, 
Mesosphere Sprint 69, Mesosphere Sprint 70, Mesosphere Sprint 71, Mesosphere 
Sprint 72  (was: Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 
68, Mesosphere Sprint 69, Mesosphere Sprint 70, Mesosphere Sprint 71)

> Design hierarchical quota allocation.
> -
>
> Key: MESOS-7790
> URL: https://issues.apache.org/jira/browse/MESOS-7790
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Michael Park
>  Labels: multitenancy
>
> When quota is assigned in the role hierarchy (see MESOS-6375), it's possible 
> for there to be "undelegated" quota for a role. For example:
> {noformat}
> ^
>   /   \
> /   \
>eng (90 cpus)   sales (10 cpus)
>  ^
>/   \
>  /   \
>  ads (50 cpus)   build (10 cpus)
> {noformat}
> Here, the "eng" role has 60 of its 90 cpus of quota delegated to its 
> children, and 30 cpus remain undelegated. We need to design how to allocate 
> these 30 cpus undelegated cpus. Are they allocated entirely to the "eng" 
> role? Are they allocated to the "eng" role tree? If so, how do we determine 
> how much is allocated to each role in the "eng" tree (i.e. "eng", "eng/ads", 
> "eng/build").



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8382) Master should bookkeep local resource providers.

2018-01-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-8382:
--
Sprint: Mesosphere Sprint 71, Mesosphere Sprint 72  (was: Mesosphere Sprint 
71)

> Master should bookkeep local resource providers.
> 
>
> Key: MESOS-8382
> URL: https://issues.apache.org/jira/browse/MESOS-8382
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Benjamin Bannier
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> This will simplify the handling of `UpdateSlaveMessage`. ALso, it'll simplify 
> the endpoint serving.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8361) Example frameworks to support launching mesos-local.

2018-01-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-8361:
--
Sprint: Mesosphere Sprint 71, Mesosphere Sprint 72  (was: Mesosphere Sprint 
71)

> Example frameworks to support launching mesos-local.
> 
>
> Key: MESOS-8361
> URL: https://issues.apache.org/jira/browse/MESOS-8361
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework
>Affects Versions: 1.5.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>Priority: Minor
>  Labels: mesosphere
>
> The scheduler driver and library support implicit launching of mesos-local 
> for a convenient test setup. Some of our example frameworks account for this 
> in supporting implicit ACL rendering and more. 
> We should unify the experience by documenting this behaviour and adding it to 
> all example frameworks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8415) Add an SLRP test for agent reboot.

2018-01-08 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8415:
--

 Summary: Add an SLRP test for agent  reboot.
 Key: MESOS-8415
 URL: https://issues.apache.org/jira/browse/MESOS-8415
 Project: Mesos
  Issue Type: Task
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


We should add a test for the following scenario: when an agent is rebooted, all 
previously published CSI volumes would become unmounted. So SLRP should remount 
them when a task is going to use the volumes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8356:
-

Assignee: Jie Yu

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-08 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7742:


Assignee: Andrei Budnik

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

2018-01-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8125:
--
Target Version/s: 1.6.0

> Agent should properly handle recovering an executor when its pid is reused
> --
>
> Key: MESOS-8125
> URL: https://issues.apache.org/jira/browse/MESOS-8125
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Priority: Critical
>
> We know that all executors will be gone once the host on which an agent is 
> running is rebooted, so there's no need to try to recover these executors.
> Trying to recover stopped executors can lead to problems if another process 
> is assigned the same pid that the executor had before the reboot. In this 
> case the agent will unsuccessfully try to reregister with the executor, and 
> then transition it to a {{TERMINATING}} state. The executor will sadly get 
> stuck in that state, and the tasks that it started will get stuck in whatever 
> state they were in at the time of the reboot.
> One way of getting rid of stuck executors is to remove the {{latest}} symlink 
> under {{work_dir/meta/slaves/latest/frameworks/ id>/executors//runs}.
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen 
> with the command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta 
> directory, e.g., 
> {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
>  I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

2018-01-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8125:
--
Summary: Agent should properly handle recovering an executor when its pid 
is reused  (was: Agent shouldn't try to recover executors after a reboot)

> Agent should properly handle recovering an executor when its pid is reused
> --
>
> Key: MESOS-8125
> URL: https://issues.apache.org/jira/browse/MESOS-8125
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>
> We know that all executors will be gone once the host on which an agent is 
> running is rebooted, so there's no need to try to recover these executors.
> Trying to recover stopped executors can lead to problems if another process 
> is assigned the same pid that the executor had before the reboot. In this 
> case the agent will unsuccessfully try to reregister with the executor, and 
> then transition it to a {{TERMINATING}} state. The executor will sadly get 
> stuck in that state, and the tasks that it started will get stuck in whatever 
> state they were in at the time of the reboot.
> One way of getting rid of stuck executors is to remove the {{latest}} symlink 
> under {{work_dir/meta/slaves/latest/frameworks/ id>/executors//runs}.
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen 
> with the command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta 
> directory, e.g., 
> {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
>  I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

2018-01-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8125:
--
Priority: Critical  (was: Major)

> Agent should properly handle recovering an executor when its pid is reused
> --
>
> Key: MESOS-8125
> URL: https://issues.apache.org/jira/browse/MESOS-8125
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Priority: Critical
>
> We know that all executors will be gone once the host on which an agent is 
> running is rebooted, so there's no need to try to recover these executors.
> Trying to recover stopped executors can lead to problems if another process 
> is assigned the same pid that the executor had before the reboot. In this 
> case the agent will unsuccessfully try to reregister with the executor, and 
> then transition it to a {{TERMINATING}} state. The executor will sadly get 
> stuck in that state, and the tasks that it started will get stuck in whatever 
> state they were in at the time of the reboot.
> One way of getting rid of stuck executors is to remove the {{latest}} symlink 
> under {{work_dir/meta/slaves/latest/frameworks/ id>/executors//runs}.
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen 
> with the command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta 
> directory, e.g., 
> {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
>  I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6893) Track total docker image layer size in store

2018-01-08 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-6893:
-
   Priority: Minor  (was: Major)
Description: We want to give cluster operator some insights on total size 
of docker image layers in store so we can use it for monitoring purpose.
Component/s: containerization
 Issue Type: Improvement  (was: Task)
Summary: Track total docker image layer size in store  (was: Track 
docker layer size and access time)

> Track total docker image layer size in store
> 
>
> Key: MESOS-6893
> URL: https://issues.apache.org/jira/browse/MESOS-6893
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Minor
>
> We want to give cluster operator some insights on total size of docker image 
> layers in store so we can use it for monitoring purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4945) Garbage collect unused docker layers in the store.

2018-01-08 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316816#comment-16316816
 ] 

Zhitao Li commented on MESOS-4945:
--

That one is not necessarily part of this epic. I'll move it out.

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>Assignee: Zhitao Li
>  Labels: Mesosphere
> Fix For: 1.5.0
>
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7358) Ensure checks and health checks work on Windows.

2018-01-08 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-7358:
---

Assignee: Andrew Schwartzmeyer

> Ensure checks and health checks work on Windows.
> 
>
> Key: MESOS-7358
> URL: https://issues.apache.org/jira/browse/MESOS-7358
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, executor
>Reporter: Alexander Rukletsov
>Assignee: Andrew Schwartzmeyer
>  Labels: check, health-check, mesosphere
>
> As we improve Windows support, we should ensure all checks and health checks 
> features that make sense on Windows work there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6733) Windows: Enable authentication to the master

2018-01-08 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-6733:
---

Assignee: John Kordich  (was: Jeff Coffler)

> Windows: Enable authentication to the master
> 
>
> Key: MESOS-6733
> URL: https://issues.apache.org/jira/browse/MESOS-6733
> Project: Mesos
>  Issue Type: Task
>  Components: agent, master
>Reporter: Alex Clemmer
>Assignee: John Kordich
>  Labels: microsoft, windows-mvp
> Fix For: 1.5.0
>
>
> It is critical for Windows agents to support authenticating with the master. 
> Right now, we don't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8375) Use protobuf reflection to simplify upgrading of resources.

2018-01-08 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316694#comment-16316694
 ] 

Michael Park commented on MESOS-8375:
-

{noformat}
commit 8be2b0f585f0e00d0ab62547fccd0cd160454a07
Author: Michael Park 
Date:   Wed Jan 3 10:58:20 2018 -0800

Introduced `upgradeResources` to complement `downgradeResources`.

Review: https://reviews.apache.org/r/64920
{noformat}
{noformat}
commit 02eefa2c74e19708a0fe3b9e1d0011b152e01ea6
Author: Michael Park 
Date:   Wed Jan 3 10:56:32 2018 -0800

Updated the comment for `precomputeResourcesContainment`.

`precomputeResourcesContainment` used return a `bool`, but was modified
to return `void` since the `bool` is already included in the `result`.
This fixes the corresponding comment that was not adjusted accordingly.

Review: https://reviews.apache.org/r/64919
{noformat}

> Use protobuf reflection to simplify upgrading of resources.
> ---
>
> Key: MESOS-8375
> URL: https://issues.apache.org/jira/browse/MESOS-8375
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Park
>Assignee: Michael Park
>Priority: Blocker
>
> This is the {{upgradeResources}} half of the protobuf-reflection-based 
> upgrade/downgrade of resources: 
> https://issues.apache.org/jira/browse/MESOS-8221
> We will also add {{state::read}} to complement {{state::checkpoint}} which 
> will be used to read protobufs from disk rather than {{protobuf::read}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8414) DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6

2018-01-08 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8414:
--
Description: 
You can find the verbose logs attached.

The most interesting part:
{code}
I0108 16:35:45.887037 17805 sched.cpp:897] Received 1 offers
I0108 16:35:45.887070 17805 sched.cpp:921] Scheduler::resourceOffers took 
12130ns
I0108 16:35:45.985957 17808 docker.cpp:349] Unable to detect IP Address at 
'NetworkSettings.Networks.host.IPAddress', attempting deprecated field
I0108 16:35:45.986428 17809 task_status_update_manager.cpp:328] Received task 
status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.986552 17809 task_status_update_manager.cpp:383] Forwarding task 
status update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) 
for task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to the agent
I0108 16:35:45.986654 17809 slave.cpp:5209] Forwarding the update TASK_FAILED 
(Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- to master@172.16.10.110:37252
I0108 16:35:45.986795 17809 slave.cpp:5102] Task status update manager 
successfully handled status update TASK_FAILED (Status UUID: 
7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.986829 17809 slave.cpp:5118] Sending acknowledgement for status 
update TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 
1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
executor(1)@172.16.10.110:38499
I0108 16:35:45.986901 17805 master.cpp:7890] Status update TASK_FAILED (Status 
UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- from agent 
f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
(ip-172-16-10-110.ec2.internal)
I0108 16:35:45.986928 17805 master.cpp:7946] Forwarding status update 
TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.986984 17805 master.cpp:10193] Updating the state of task 1 of 
framework f09c89e1-aa62-4662-bda8-15a2c87f412e- (latest state: TASK_FAILED, 
status update state: TASK_FAILED)
I0108 16:35:45.987047 17805 sched.cpp:990] Received status update TASK_FAILED 
(Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- from slave(1)@172.16.10.110:37252
I0108 16:35:45.987103 17805 sched.cpp:1029] Scheduler::statusUpdate took 30948ns
I0108 16:35:45.987112 17805 sched.cpp:1048] Sending ACK for status update 
TASK_FAILED (Status UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of 
framework f09c89e1-aa62-4662-bda8-15a2c87f412e- to 
master@172.16.10.110:37252
I0108 16:35:45.987221 17805 master.cpp:5826] Processing ACKNOWLEDGE call 
7f544700-215b-4d27-ab43-b48e19592d00 for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- (default) at 
scheduler-4ad5073e-c1db-4c34-9c43-e656c280a724@172.16.10.110:37252 on agent 
f09c89e1-aa62-4662-bda8-15a2c87f412e-S0
I0108 16:35:45.987267 17805 master.cpp:10299] Removing task 1 with resources 
cpus(allocated: *):2; mem(allocated: *):1024; disk(allocated: *):1024; 
ports(allocated: *):[31000-32000] of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e- on agent 
f09c89e1-aa62-4662-bda8-15a2c87f412e-S0 at slave(1)@172.16.10.110:37252 
(ip-172-16-10-110.ec2.internal)
I0108 16:35:45.987473 17807 task_status_update_manager.cpp:401] Received task 
status update acknowledgement (UUID: 7f544700-215b-4d27-ab43-b48e19592d00) for 
task 1 of framework f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.987561 17807 task_status_update_manager.cpp:538] Cleaning up 
status update stream for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.987814 17807 slave.cpp:3974] Task status update manager 
successfully handled status update acknowledgement (UUID: 
7f544700-215b-4d27-ab43-b48e19592d00) for task 1 of framework 
f09c89e1-aa62-4662-bda8-15a2c87f412e-
I0108 16:35:45.987849 17807 slave.cpp:8935] Completing task 1
{code}

  was:
You can find the verbose logs attached.

The interesting part:


> DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6
> --
>
> Key: MESOS-8414
> URL: https://issues.apache.org/jira/browse/MESOS-8414
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 6, Docker version 1.7.1, build 786b29d
>Reporter: Armand Grillet
> Attachments: centos6-ssl-DockerContainerizerTest.ROOT_DOCKER_Logs.txt
>
>
> You can find the verbose logs attached.
> Th

[jira] [Updated] (MESOS-8414) DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6

2018-01-08 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8414:
---
Environment: CentOS 6, Docker version 1.7.1, build 786b29d

> DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6
> --
>
> Key: MESOS-8414
> URL: https://issues.apache.org/jira/browse/MESOS-8414
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 6, Docker version 1.7.1, build 786b29d
>Reporter: Armand Grillet
> Attachments: centos6-ssl-DockerContainerizerTest.ROOT_DOCKER_Logs.txt
>
>
> You can find the verbose logs attached.
> The interesting part:



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8414) DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6

2018-01-08 Thread Armand Grillet (JIRA)
Armand Grillet created MESOS-8414:
-

 Summary: DockerContainerizerTest.ROOT_DOCKER_Logs fails on CentOS 6
 Key: MESOS-8414
 URL: https://issues.apache.org/jira/browse/MESOS-8414
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Armand Grillet
 Attachments: centos6-ssl-DockerContainerizerTest.ROOT_DOCKER_Logs.txt

You can find the verbose logs attached.

The interesting part:



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8335) ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2 fails on Debian 9 and CentOS 6.

2018-01-08 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8335:
--
Summary: ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2 fails on 
Debian 9  and CentOS 6.  (was: ProvisionerDockerTest fails on Debian 9  and 
CentOS 6.)

> ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2 fails on Debian 9  
> and CentOS 6.
> -
>
> Key: MESOS-8335
> URL: https://issues.apache.org/jira/browse/MESOS-8335
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Armand Grillet
>Assignee: Armand Grillet
> Attachments: centos-6-curl-7.19.7.txt, centos-6-curl-7.57.txt
>
>
> Version of Docker used: Docker version 17.11.0-ce, build 1caf76c
> Version of Curl used: curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 
> OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) 
> libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3
> Error:
> {code}
> [ RUN  ] 
> ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
> I1215 00:09:28.694677 19343 cluster.cpp:172] Creating default 'local' 
> authorizer
> I1215 00:09:28.697144 30867 master.cpp:456] Master 
> 75b48a47-7b6b-4e60-82d3-dfdc0cf8bff3 (ip-172-16-10-160.ec2.internal) started 
> on 127.0.1.1:35029
> I1215 00:09:28.697163 30867 master.cpp:458] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/4RYdF1/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/4RYdF1/master" 
> --zk_session_timeout="10secs"
> I1215 00:09:28.697413 30867 master.cpp:507] Master only allowing 
> authenticated frameworks to register
> I1215 00:09:28.697422 30867 master.cpp:513] Master only allowing 
> authenticated agents to register
> I1215 00:09:28.697427 30867 master.cpp:519] Master only allowing 
> authenticated HTTP frameworks to register
> I1215 00:09:28.697433 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/4RYdF1/credentials'
> I1215 00:09:28.697654 30867 master.cpp:563] Using default 'crammd5' 
> authenticator
> I1215 00:09:28.697806 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1215 00:09:28.697962 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1215 00:09:28.698076 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1215 00:09:28.698194 30867 master.cpp:642] Authorization enabled
> I1215 00:09:28.698468 30864 hierarchical.cpp:175] Initialized hierarchical 
> allocator process
> I1215 00:09:28.698563 30864 whitelist_watcher.cpp:77] No whitelist given
> I1215 00:09:28.701695 30871 master.cpp:2209] Elected as the leading master!
> I1215 00:09:28.701723 30871 master.cpp:1689] Recovering from registrar
> I1215 00:09:28.701859 30869 registrar.cpp:347] Recovering registrar
> I1215 00:09:28.702401 30869 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 507904ns
> I1215 00:09:28.702495 30869 registrar.cpp:495] Applied 1 operations in 
> 28977ns; attempting to update the registry
> I1215 00:09:28.702997 30869 registrar.cpp:552] Successfully updated the 
> registry in 464896ns
> I1215 00:09:28.703086 30869 registrar.cpp:424] Successfully recovered 
> registrar
> I1215 00:09:28.703640 30865 master.cpp:1802] Recovered 0 agents from the 
> registry (167B); allowing 10mins for agents to re-register
> I1215 00:09:28.703661 30869 hierarchical.cpp:213] Skipping recovery of 
> hierarchical allocator: nothing to recover
> W1215 00:09:28.706816 19343 process.cpp:2756] Attempted to spawn 

[jira] [Commented] (MESOS-8359) Health checks are flapping for all tasks on the slave if one task has no enough resources to run

2018-01-08 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316519#comment-16316519
 ] 

Benno Evers commented on MESOS-8359:


>From what I gather, the following conditions need to be met to reproduce:

- The other tasks on the slave need to be health-checked by a `COMMAND`-type 
health check
- Docker executor must be used for all launched tasks

I'm also wondering which command was actually used for the command health 
check, and if the executor and/or master logs at the time the bug is observed 
show anything interesting? 

Finally, since I'm not very experienced with Marathon, can you give some more 
details on what exactly it means to "create a marathon application from your 
image"?

> Health checks are flapping for all tasks on the slave if one task has no 
> enough resources to run
> 
>
> Key: MESOS-8359
> URL: https://issues.apache.org/jira/browse/MESOS-8359
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.2
>Reporter: Viacheslav Valyavskiy
> Attachments: logs2
>
>
> I have attached some logs from the affected 
> slave(newappmv_qagame_testapp.green_csahttp - name of the 'bad' application)
> Steps to reproduce:
> 1. Run multiple tasks on the slave
> 2. Create marathon application from our image ( docker pull 
> vvalyavskiy/csa-http ) and set memory limit to 16MB for it.
> 3. Wait some time and then observe flapping of all tasks on the slave where 
> our task is started



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8413) Zookeeper configuration passwords are shown in clear text

2018-01-08 Thread Alexander Rojas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas reassigned MESOS-8413:
--

Assignee: Alexander Rojas

> Zookeeper configuration passwords are shown in clear text
> -
>
> Key: MESOS-8413
> URL: https://issues.apache.org/jira/browse/MESOS-8413
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>  Labels: mesosphere, security
>
> No matter how one configures mesos, either by passing the ZooKeeper flags in 
> the command line or using a file, as follows:
> {noformat}
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log 
> --zk=zk://${zk_username}:${zk_password}@${zk_addr}/mesos --quorum=1
> {noformat}
> {noformat}
> echo "zk://${zk_username}:${zk_password}@${zk_addr}/mesos" > 
> /tmp/${USER}/mesos/zk_config.txt
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log --zk=/tmp/${USER}/mesos/zk_config.txt
> {noformat}
> both the logs and the results of the {{/flags}} endpoint will resolve to the 
> contents of the flags, i.e.:
> {noformat}
> I0108 10:12:50.387522 28579 master.cpp:458] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --filter_gpu_resources="true" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --log_dir="/tmp/user/mesos/master/log" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --quorum="1" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="20secs" 
> --registry_strict="false" --require_agent_domain="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/home/user/mesos/build/../src/webui" 
> --work_dir="/tmp/user/mesos/master" 
> --zk="zk://user@passwd:127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
> {noformat}
> {noformat}
> HTTP/1.1 200 OK
> Content-Encoding: gzip
> Content-Length: 591
> Content-Type: application/json
> Date: Mon, 08 Jan 2018 15:12:53 GMT
> {
> "flags": {
> "agent_ping_timeout": "15secs",
> "agent_reregister_timeout": "10mins",
> "allocation_interval": "1secs",
> "allocator": "HierarchicalDRF",
> "authenticate_agents": "false",
> "authenticate_frameworks": "false",
> "authenticate_http_frameworks": "false",
> "authenticate_http_readonly": "false",
> "authenticate_http_readwrite": "false",
> "authenticators": "crammd5",
> "authorizers": "local",
> "filter_gpu_resources": "true",
> "framework_sorter": "drf",
> "help": "false",
> "hostname_lookup": "true",
> "http_authenticators": "basic",
> "initialize_driver_logging": "true",
> "log_auto_initialize": "true",
> "log_dir": "/tmp/user/mesos/master/log",
> "logbufsecs": "0",
> "logging_level": "INFO",
> "max_agent_ping_timeouts": "5",
> "max_completed_frameworks": "50",
> "max_completed_tasks_per_framework": "1000",
> "max_unreachable_tasks_per_framework": "1000",
> "port": "5050",
> "quiet": "false",
> "quorum": "1",
> "recovery_agent_removal_limit": "100%",
> "registry": "replicated_log",
> "registry_fetch_timeout": "1mins",
> "registry_gc_interval": "15mins",
> "registry_max_agent_age": "2weeks",
> "registry_max_agent_count": "102400",
> "registry_store_timeout": "20secs",
> "registry_strict": "false",
> "require_agent_domain": "false",
> "root_submissions": "true",
> "user_sorter": "drf",
> "version": "false",
> "webui_dir": "/home/user/mesos/build/../src/webui",
> "work_dir": "/tmp/user/mesos/master",
> "zk": "zk://user@passwd:127.0.0.1:2181/mesos",
> "zk_session_timeout": "10secs"
> }
> }
> {noformat}
> Which leads to having no effective way to prevent the pa

[jira] [Commented] (MESOS-8413) Zookeeper configuration passwords are shown in clear text

2018-01-08 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316503#comment-16316503
 ] 

Alexander Rojas commented on MESOS-8413:


After doing some research, I can conclude this is a bug. The other flags which 
store passwords ({{--credentials}} in master and {{--credential}} and 
{{--credentials}} in the agent) all of them have the behavior where, if a file 
is given, the file path will be shown instead of the contents of the file.

I see no reason why {{--zk}} should behave differently.

> Zookeeper configuration passwords are shown in clear text
> -
>
> Key: MESOS-8413
> URL: https://issues.apache.org/jira/browse/MESOS-8413
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
>Reporter: Alexander Rojas
>  Labels: mesosphere, security
>
> No matter how one configures mesos, either by passing the ZooKeeper flags in 
> the command line or using a file, as follows:
> {noformat}
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log 
> --zk=zk://${zk_username}:${zk_password}@${zk_addr}/mesos --quorum=1
> {noformat}
> {noformat}
> echo "zk://${zk_username}:${zk_password}@${zk_addr}/mesos" > 
> /tmp/${USER}/mesos/zk_config.txt
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log --zk=/tmp/${USER}/mesos/zk_config.txt
> {noformat}
> both the logs and the results of the {{/flags}} endpoint will resolve to the 
> contents of the flags, i.e.:
> {noformat}
> I0108 10:12:50.387522 28579 master.cpp:458] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --filter_gpu_resources="true" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --log_dir="/tmp/user/mesos/master/log" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --quorum="1" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="20secs" 
> --registry_strict="false" --require_agent_domain="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/home/user/mesos/build/../src/webui" 
> --work_dir="/tmp/user/mesos/master" 
> --zk="zk://user@passwd:127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
> {noformat}
> {noformat}
> HTTP/1.1 200 OK
> Content-Encoding: gzip
> Content-Length: 591
> Content-Type: application/json
> Date: Mon, 08 Jan 2018 15:12:53 GMT
> {
> "flags": {
> "agent_ping_timeout": "15secs",
> "agent_reregister_timeout": "10mins",
> "allocation_interval": "1secs",
> "allocator": "HierarchicalDRF",
> "authenticate_agents": "false",
> "authenticate_frameworks": "false",
> "authenticate_http_frameworks": "false",
> "authenticate_http_readonly": "false",
> "authenticate_http_readwrite": "false",
> "authenticators": "crammd5",
> "authorizers": "local",
> "filter_gpu_resources": "true",
> "framework_sorter": "drf",
> "help": "false",
> "hostname_lookup": "true",
> "http_authenticators": "basic",
> "initialize_driver_logging": "true",
> "log_auto_initialize": "true",
> "log_dir": "/tmp/user/mesos/master/log",
> "logbufsecs": "0",
> "logging_level": "INFO",
> "max_agent_ping_timeouts": "5",
> "max_completed_frameworks": "50",
> "max_completed_tasks_per_framework": "1000",
> "max_unreachable_tasks_per_framework": "1000",
> "port": "5050",
> "quiet": "false",
> "quorum": "1",
> "recovery_agent_removal_limit": "100%",
> "registry": "replicated_log",
> "registry_fetch_timeout": "1mins",
> "registry_gc_interval": "15mins",
> "registry_max_agent_age": "2weeks",
> "registry_max_agent_count": "102400",
> "registry_store_timeout": "20secs",
> "registry_strict": "false",
> "require_agent_domain": "false",
> "root_submissions": "true",
>   

[jira] [Comment Edited] (MESOS-8413) Zookeeper configuration passwords are shown in clear text

2018-01-08 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316503#comment-16316503
 ] 

Alexander Rojas edited comment on MESOS-8413 at 1/8/18 3:46 PM:


After doing some research, I can conclude this is a bug. The other flags which 
store passwords ({{\-\-credentials}} in master and {{\-\-credential}} and 
{{\-\-credentials}} in the agent) all of them have the behavior where, if a 
file is given, the file path will be shown instead of the contents of the file.

I see no reason why {{\-\-zk}} should behave differently.


was (Author: arojas):
After doing some research, I can conclude this is a bug. The other flags which 
store passwords ({{--credentials}} in master and {{--credential}} and 
{{--credentials}} in the agent) all of them have the behavior where, if a file 
is given, the file path will be shown instead of the contents of the file.

I see no reason why {{--zk}} should behave differently.

> Zookeeper configuration passwords are shown in clear text
> -
>
> Key: MESOS-8413
> URL: https://issues.apache.org/jira/browse/MESOS-8413
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
>Reporter: Alexander Rojas
>  Labels: mesosphere, security
>
> No matter how one configures mesos, either by passing the ZooKeeper flags in 
> the command line or using a file, as follows:
> {noformat}
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log 
> --zk=zk://${zk_username}:${zk_password}@${zk_addr}/mesos --quorum=1
> {noformat}
> {noformat}
> echo "zk://${zk_username}:${zk_password}@${zk_addr}/mesos" > 
> /tmp/${USER}/mesos/zk_config.txt
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log --zk=/tmp/${USER}/mesos/zk_config.txt
> {noformat}
> both the logs and the results of the {{/flags}} endpoint will resolve to the 
> contents of the flags, i.e.:
> {noformat}
> I0108 10:12:50.387522 28579 master.cpp:458] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --filter_gpu_resources="true" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --log_dir="/tmp/user/mesos/master/log" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --quorum="1" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="20secs" 
> --registry_strict="false" --require_agent_domain="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/home/user/mesos/build/../src/webui" 
> --work_dir="/tmp/user/mesos/master" 
> --zk="zk://user@passwd:127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
> {noformat}
> {noformat}
> HTTP/1.1 200 OK
> Content-Encoding: gzip
> Content-Length: 591
> Content-Type: application/json
> Date: Mon, 08 Jan 2018 15:12:53 GMT
> {
> "flags": {
> "agent_ping_timeout": "15secs",
> "agent_reregister_timeout": "10mins",
> "allocation_interval": "1secs",
> "allocator": "HierarchicalDRF",
> "authenticate_agents": "false",
> "authenticate_frameworks": "false",
> "authenticate_http_frameworks": "false",
> "authenticate_http_readonly": "false",
> "authenticate_http_readwrite": "false",
> "authenticators": "crammd5",
> "authorizers": "local",
> "filter_gpu_resources": "true",
> "framework_sorter": "drf",
> "help": "false",
> "hostname_lookup": "true",
> "http_authenticators": "basic",
> "initialize_driver_logging": "true",
> "log_auto_initialize": "true",
> "log_dir": "/tmp/user/mesos/master/log",
> "logbufsecs": "0",
> "logging_level": "INFO",
> "max_agent_ping_timeouts": "5",
> "max_completed_frameworks": "50",
> "max_completed_tasks_per_framework": "1000",
> "max_unreachable_tasks_per_framework": "1000",
> "port": "5050",
> "quiet": "false",
> "quorum": "1",
>   

[jira] [Created] (MESOS-8413) Zookeeper configuration passwords are shown in clear text

2018-01-08 Thread Alexander Rojas (JIRA)
Alexander Rojas created MESOS-8413:
--

 Summary: Zookeeper configuration passwords are shown in clear text
 Key: MESOS-8413
 URL: https://issues.apache.org/jira/browse/MESOS-8413
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.4.1
Reporter: Alexander Rojas


No matter how one configures mesos, either by passing the ZooKeeper flags in 
the command line or using a file, as follows:

{noformat}
./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
--log_dir=/tmp/$USER/mesos/master/log 
--zk=zk://${zk_username}:${zk_password}@${zk_addr}/mesos --quorum=1
{noformat}

{noformat}
echo "zk://${zk_username}:${zk_password}@${zk_addr}/mesos" > 
/tmp/${USER}/mesos/zk_config.txt
./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
--log_dir=/tmp/$USER/mesos/master/log --zk=/tmp/${USER}/mesos/zk_config.txt
{noformat}

both the logs and the results of the {{/flags}} endpoint will resolve to the 
contents of the flags, i.e.:

{noformat}
I0108 10:12:50.387522 28579 master.cpp:458] Flags at startup: 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="false" --authenticate_frameworks="false" 
--authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticators="crammd5" 
--authorizers="local" --filter_gpu_resources="true" --framework_sorter="drf" 
--help="false" --hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--log_dir="/tmp/user/mesos/master/log" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--quorum="1" --recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="20secs" --registry_strict="false" 
--require_agent_domain="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/home/user/mesos/build/../src/webui" 
--work_dir="/tmp/user/mesos/master" 
--zk="zk://user@passwd:127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
{noformat}

{noformat}
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 591
Content-Type: application/json
Date: Mon, 08 Jan 2018 15:12:53 GMT

{
"flags": {
"agent_ping_timeout": "15secs",
"agent_reregister_timeout": "10mins",
"allocation_interval": "1secs",
"allocator": "HierarchicalDRF",
"authenticate_agents": "false",
"authenticate_frameworks": "false",
"authenticate_http_frameworks": "false",
"authenticate_http_readonly": "false",
"authenticate_http_readwrite": "false",
"authenticators": "crammd5",
"authorizers": "local",
"filter_gpu_resources": "true",
"framework_sorter": "drf",
"help": "false",
"hostname_lookup": "true",
"http_authenticators": "basic",
"initialize_driver_logging": "true",
"log_auto_initialize": "true",
"log_dir": "/tmp/user/mesos/master/log",
"logbufsecs": "0",
"logging_level": "INFO",
"max_agent_ping_timeouts": "5",
"max_completed_frameworks": "50",
"max_completed_tasks_per_framework": "1000",
"max_unreachable_tasks_per_framework": "1000",
"port": "5050",
"quiet": "false",
"quorum": "1",
"recovery_agent_removal_limit": "100%",
"registry": "replicated_log",
"registry_fetch_timeout": "1mins",
"registry_gc_interval": "15mins",
"registry_max_agent_age": "2weeks",
"registry_max_agent_count": "102400",
"registry_store_timeout": "20secs",
"registry_strict": "false",
"require_agent_domain": "false",
"root_submissions": "true",
"user_sorter": "drf",
"version": "false",
"webui_dir": "/home/user/mesos/build/../src/webui",
"work_dir": "/tmp/user/mesos/master",
"zk": "zk://user@passwd:127.0.0.1:2181/mesos",
"zk_session_timeout": "10secs"
}
}
{noformat}

Which leads to having no effective way to prevent the passwords to be shown if 
someone can get the hands in either of the previous methods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8412) Example frameworks should support passing options via both command line flags or environment variables

2018-01-08 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-8412:
---

 Summary: Example frameworks should support passing options via 
both command line flags or environment variables
 Key: MESOS-8412
 URL: https://issues.apache.org/jira/browse/MESOS-8412
 Project: Mesos
  Issue Type: Bug
  Components: documentation, test
Affects Versions: 1.5.0
Reporter: Benjamin Bannier


Some options to the Mesos example frameworks can be passed as command line 
flags while others can only be passed via environment variables.

This is inconsistent with how e.g., {{mesos-master}} or {{mesos-agent}} can be 
configured where either path can be picked to inject information into the 
executable.

We should update the example frameworks to also allow injection of 
configurations via the command line or the process environment only.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)