[jira] [Commented] (MESOS-907) Add Kerberos Authentication support

2016-12-08 Thread haijiang.chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734385#comment-15734385
 ] 

haijiang.chen commented on MESOS-907:
-

Is this issue in the fixing progress?

> Add Kerberos Authentication support
> ---
>
> Key: MESOS-907
> URL: https://issues.apache.org/jira/browse/MESOS-907
> Project: Mesos
>  Issue Type: Story
>  Components: general
>Reporter: Adam B
>Assignee: Tim Anderegg
>  Labels: security, twitter
>
> MESOS-704 added basic authentication support using CRAM-MD5 through SASL. Now 
> we should integrate Kerberos authentication using GSS-API, which is already 
> supported by SASL. Kerberos is a widely-used industry standard authentication 
> service, and integration with Mesos will make it easier for customers to 
> integrate their existing security process with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6082) Add scheduler Call and Event based metrics to the master.

2016-12-08 Thread Abhishek Dasgupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734306#comment-15734306
 ] 

Abhishek Dasgupta commented on MESOS-6082:
--

PR: 
https://reviews.apache.org/r/54572/
https://reviews.apache.org/r/54573/
https://reviews.apache.org/r/54574/

> Add scheduler Call and Event based metrics to the master.
> -
>
> Key: MESOS-6082
> URL: https://issues.apache.org/jira/browse/MESOS-6082
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Mahler
>Assignee: Abhishek Dasgupta
>
> Currently, the master only has metrics for the old-style messages and these 
> are re-used for calls unfortunately:
> {code}
>   // Messages from schedulers.
>   process::metrics::Counter messages_register_framework;
>   process::metrics::Counter messages_reregister_framework;
>   process::metrics::Counter messages_unregister_framework;
>   process::metrics::Counter messages_deactivate_framework;
>   process::metrics::Counter messages_kill_task;
>   process::metrics::Counter messages_status_update_acknowledgement;
>   process::metrics::Counter messages_resource_request;
>   process::metrics::Counter messages_launch_tasks;
>   process::metrics::Counter messages_decline_offers;
>   process::metrics::Counter messages_revive_offers;
>   process::metrics::Counter messages_suppress_offers;
>   process::metrics::Counter messages_reconcile_tasks;
>   process::metrics::Counter messages_framework_to_executor;
> {code}
> Now that we've introduced the Call/Event based API, we should have metrics 
> that reflect this. For example:
> {code}
> {
>   scheduler/calls: 100
>   scheduler/calls/decline: 90,
>   scheduler/calls/accept: 10,
>   scheduler/calls/accept/operations/create: 1,
>   scheduler/calls/accept/operations/destroy: 0,
>   scheduler/calls/accept/operations/launch: 4,
>   scheduler/calls/accept/operations/launch_group: 2,
>   scheduler/calls/accept/operations/reserve: 1,
>   scheduler/calls/accept/operations/unreserve: 0,
>   scheduler/calls/kill: 0,
>   // etc
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4217) Mesos sandbox UI doesn't follow symlinks

2016-12-08 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734292#comment-15734292
 ] 

haosdent commented on MESOS-4217:
-

Yep, for security reason, we would not show those entries if they are out of 
scope.

> Mesos sandbox UI doesn't follow symlinks
> 
>
> Key: MESOS-4217
> URL: https://issues.apache.org/jira/browse/MESOS-4217
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Mohit Soni
>Priority: Minor
>
> Current Mesos sandbox UI doesn't follow symlinks. Right now this prevents a 
> user to browse a persistent volume, which is symlinked inside the sandbox 
> directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6766) HealthChecker launches helper binary each during health check attempt.

2016-12-08 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-6766:
--

 Summary: HealthChecker launches helper binary each during health 
check attempt.
 Key: MESOS-6766
 URL: https://issues.apache.org/jira/browse/MESOS-6766
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 1.1.0, 1.2.0
Reporter: Alexander Rukletsov


Currently the HealthChecker library launches a helper binary ({{curl}} for 
{{HTTP}}, {{mesos-tcp-connect}} for {{TCP}}, {{}} for 
{{COMMAND}}) each time it attempts a health check. While it is unavoidable for 
{{COMMAND}} health checks, for {{HTTP}} and {{TCP}} we can do better. We 
probably can't send requests and try TCP handshakes directly from the library, 
i.e., executor process, because appropriate namespaces should be entered, but 
we can launch the binary only once and keep it running alongside the task 
process till the task is running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6765) Consider making the Resources wrapper "copy-on-write" to improve performance.

2016-12-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-6765:
---
Labels: performance  (was: )

> Consider making the Resources wrapper "copy-on-write" to improve performance.
> -
>
> Key: MESOS-6765
> URL: https://issues.apache.org/jira/browse/MESOS-6765
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>  Labels: performance
>
> Resources currently directly stores the underlying resource objects:
> {code}
> class Resources
> {
>   ...
>   std::vector resources;
> };
> {code}
> What this means is that copying of Resources (which occurs frequently) is 
> expensive since copying a {{Resource}} object is relatively heavy-weight.
> One strategy, in MESOS-4770, is to avoid protobuf in favor of C++ types (i.e. 
> replace {{Value::Scalar}}, {{Value::Set}}, and {{Value::Ranges}} with C++ 
> equivalents). However, metadata like reservations, disk info, etc, is still 
> fairly expensive to copy even if avoiding protobufs.
> An approach to reduce copying would be to only copy the resource objects upon 
> writing, when there are multiple references to the resource object. If there 
> is a single reference to the resource object we could safely mutate it 
> without copying. E.g.
> {code}
> class Resource
> {
>   ...
>   std::vector resources;
> };
> // Mutation function:
> void Resources::mutate(size_t index)
> {
>   // Copy if there are multiple references.
>   if (resources[i].use_count() > 1) {
> resources[i] = copy(resources[i]);
>   }
>   // Mutate safely.
>   resources[i].some_mutation();
> }
> {code}
> On the other hand, this introduces a additional level of pointer chasing. So 
> we would need to weigh the approaches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6765) Consider making the Resources wrapper "copy-on-write" to improve performance.

2016-12-08 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-6765:
--

 Summary: Consider making the Resources wrapper "copy-on-write" to 
improve performance.
 Key: MESOS-6765
 URL: https://issues.apache.org/jira/browse/MESOS-6765
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler


Resources currently directly stores the underlying resource objects:

{code}
class Resources
{
  ...
  std::vector resources;
};
{code}

What this means is that copying of Resources (which occurs frequently) is 
expensive since copying a {{Resource}} object is relatively heavy-weight.

One strategy, in MESOS-4770, is to avoid protobuf in favor of C++ types (i.e. 
replace {{Value::Scalar}}, {{Value::Set}}, and {{Value::Ranges}} with C++ 
equivalents). However, metadata like reservations, disk info, etc, is still 
fairly expensive to copy even if avoiding protobufs.

An approach to reduce copying would be to only copy the resource objects upon 
writing, when there are multiple references to the resource object. If there is 
a single reference to the resource object we could safely mutate it without 
copying. E.g.

{code}
class Resource
{
  ...
  std::vector resources;
};

// Mutation function:
void Resources::mutate(size_t index)
{
  // Copy if there are multiple references.
  if (resources[i].use_count() > 1) {
resources[i] = copy(resources[i]);
  }

  // Mutate safely.
  resources[i].some_mutation();
}
{code}

On the other hand, this introduces a additional level of pointer chasing. So we 
would need to weigh the approaches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6759) IOSwitchboardServerTest.AttachOutput has CHECK failure if run it multiple times.

2016-12-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733977#comment-15733977
 ] 

Joseph Wu commented on MESOS-6759:
--

That sounds very similar to the reasoning behind this code in libprocess's 
finalize logic:
https://github.com/apache/mesos/blob/1d8d5c20709f97e6893e156be75057e34cbd97a9/3rdparty/libprocess/src/process.cpp#L1260-L1268

> IOSwitchboardServerTest.AttachOutput has CHECK failure if run it multiple 
> times.
> 
>
> Key: MESOS-6759
> URL: https://issues.apache.org/jira/browse/MESOS-6759
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> I can easily repo this issue on my dev centos7 box with the following command:
> {noformat}
> GLOG_v=1 bin/mesos-tests.sh 
> --gtest_filter=IOSwitchboardServerTest.AttachOutput --verbose --gtest_repeat=2
> 
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from IOSwitchboardServerTest
> [ RUN  ] IOSwitchboardServerTest.AttachOutput
> I1208 10:46:31.574084 41813 poll_socket.cpp:209] Socket error while sending: 
> Broken pipe
> /home/jie/workspace/mesos/src/tests/containerizer/io_switchboard_tests.cpp:265:
>  Failure
> (response).failure(): Disconnected
> /home/jie/workspace/mesos/src/tests/containerizer/io_switchboard_tests.cpp:266:
>  Failure
> (response).failure(): Disconnected
> F1208 10:46:31.574919 41751 future.hpp:1137] Check failed: !isFailed() 
> Future::get() but state == FAILED: Disconnected
> *** Check failure stack trace: ***
> @ 0x7fc3f35a633a  google::LogMessage::Fail()
> @ 0x7fc3f35a6299  google::LogMessage::SendToLog()
> @ 0x7fc3f35a5caa  google::LogMessage::Flush()
> @ 0x7fc3f35a89de  google::LogMessageFatal::~LogMessageFatal()
> @   0xb6a352  process::Future<>::get()
> @  0x1a050fe  
> mesos::internal::tests::IOSwitchboardServerTest_AttachOutput_Test::TestBody()
> @  0x1c54ce2  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c4fe00  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1c31491  testing::Test::Run()
> @  0x1c31c14  testing::TestInfo::Run()
> @  0x1c3225a  testing::TestCase::Run()
> @  0x1c38b34  testing::internal::UnitTestImpl::RunAllTests()
> @  0x1c55907  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c50948  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1c3787a  testing::UnitTest::Run()
> @  0x11cc653  RUN_ALL_TESTS()
> @  0x11cc209  main
> @ 0x7fc3ecb61b15  __libc_start_main
> @   0xab5e89  (unknown)
> Aborted (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6759) IOSwitchboardServerTest.AttachOutput has CHECK failure if run it multiple times.

2016-12-08 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733955#comment-15733955
 ] 

Jie Yu edited comment on MESOS-6759 at 12/9/16 1:32 AM:


Current hypothesis:

1) we don't discard the socket.accept() if the io switchboard server 
terminates, which is a bug
2) after discard the future returned by socket.accept() in finalize, the tests 
passed
3) I suspect that Linux has a bug when there are multiple thread listening on 
different domain sockets. `accept` on one domain socket might accidentally 
accept a connection trying to connect to the other one. Also, io:poll on both 
sockets will return (which is not correct). (PS: I cannot repro this bug on OSX)


was (Author: jieyu):
Current hypothesis:

1) we don't discard the socket.accept() if the io switchboard server 
terminates, which is a bug
2) after discard the future returned by socket.accept() in finalize, the tests 
passed
3) I suspect that Linux has a bug when there are multiple thread listening on 
different domain sockets. `accept` on one domain socket might accidentally 
accept a connection trying to connect to the other one. Also, io:poll on both 
sockets will return (which is not correct).

> IOSwitchboardServerTest.AttachOutput has CHECK failure if run it multiple 
> times.
> 
>
> Key: MESOS-6759
> URL: https://issues.apache.org/jira/browse/MESOS-6759
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> I can easily repo this issue on my dev centos7 box with the following command:
> {noformat}
> GLOG_v=1 bin/mesos-tests.sh 
> --gtest_filter=IOSwitchboardServerTest.AttachOutput --verbose --gtest_repeat=2
> 
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from IOSwitchboardServerTest
> [ RUN  ] IOSwitchboardServerTest.AttachOutput
> I1208 10:46:31.574084 41813 poll_socket.cpp:209] Socket error while sending: 
> Broken pipe
> /home/jie/workspace/mesos/src/tests/containerizer/io_switchboard_tests.cpp:265:
>  Failure
> (response).failure(): Disconnected
> /home/jie/workspace/mesos/src/tests/containerizer/io_switchboard_tests.cpp:266:
>  Failure
> (response).failure(): Disconnected
> F1208 10:46:31.574919 41751 future.hpp:1137] Check failed: !isFailed() 
> Future::get() but state == FAILED: Disconnected
> *** Check failure stack trace: ***
> @ 0x7fc3f35a633a  google::LogMessage::Fail()
> @ 0x7fc3f35a6299  google::LogMessage::SendToLog()
> @ 0x7fc3f35a5caa  google::LogMessage::Flush()
> @ 0x7fc3f35a89de  google::LogMessageFatal::~LogMessageFatal()
> @   0xb6a352  process::Future<>::get()
> @  0x1a050fe  
> mesos::internal::tests::IOSwitchboardServerTest_AttachOutput_Test::TestBody()
> @  0x1c54ce2  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c4fe00  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1c31491  testing::Test::Run()
> @  0x1c31c14  testing::TestInfo::Run()
> @  0x1c3225a  testing::TestCase::Run()
> @  0x1c38b34  testing::internal::UnitTestImpl::RunAllTests()
> @  0x1c55907  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c50948  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1c3787a  testing::UnitTest::Run()
> @  0x11cc653  RUN_ALL_TESTS()
> @  0x11cc209  main
> @ 0x7fc3ecb61b15  __libc_start_main
> @   0xab5e89  (unknown)
> Aborted (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6759) IOSwitchboardServerTest.AttachOutput has CHECK failure if run it multiple times.

2016-12-08 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733955#comment-15733955
 ] 

Jie Yu commented on MESOS-6759:
---

Current hypothesis:

1) we don't discard the socket.accept() if the io switchboard server 
terminates, which is a bug
2) after discard the future returned by socket.accept() in finalize, the tests 
passed
3) I suspect that Linux has a bug when there are multiple thread listening on 
different domain sockets. `accept` on one domain socket might accidentally 
accept a connection trying to connect to the other one. Also, io:poll on both 
sockets will return (which is not correct).

> IOSwitchboardServerTest.AttachOutput has CHECK failure if run it multiple 
> times.
> 
>
> Key: MESOS-6759
> URL: https://issues.apache.org/jira/browse/MESOS-6759
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> I can easily repo this issue on my dev centos7 box with the following command:
> {noformat}
> GLOG_v=1 bin/mesos-tests.sh 
> --gtest_filter=IOSwitchboardServerTest.AttachOutput --verbose --gtest_repeat=2
> 
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from IOSwitchboardServerTest
> [ RUN  ] IOSwitchboardServerTest.AttachOutput
> I1208 10:46:31.574084 41813 poll_socket.cpp:209] Socket error while sending: 
> Broken pipe
> /home/jie/workspace/mesos/src/tests/containerizer/io_switchboard_tests.cpp:265:
>  Failure
> (response).failure(): Disconnected
> /home/jie/workspace/mesos/src/tests/containerizer/io_switchboard_tests.cpp:266:
>  Failure
> (response).failure(): Disconnected
> F1208 10:46:31.574919 41751 future.hpp:1137] Check failed: !isFailed() 
> Future::get() but state == FAILED: Disconnected
> *** Check failure stack trace: ***
> @ 0x7fc3f35a633a  google::LogMessage::Fail()
> @ 0x7fc3f35a6299  google::LogMessage::SendToLog()
> @ 0x7fc3f35a5caa  google::LogMessage::Flush()
> @ 0x7fc3f35a89de  google::LogMessageFatal::~LogMessageFatal()
> @   0xb6a352  process::Future<>::get()
> @  0x1a050fe  
> mesos::internal::tests::IOSwitchboardServerTest_AttachOutput_Test::TestBody()
> @  0x1c54ce2  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c4fe00  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1c31491  testing::Test::Run()
> @  0x1c31c14  testing::TestInfo::Run()
> @  0x1c3225a  testing::TestCase::Run()
> @  0x1c38b34  testing::internal::UnitTestImpl::RunAllTests()
> @  0x1c55907  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c50948  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1c3787a  testing::UnitTest::Run()
> @  0x11cc653  RUN_ALL_TESTS()
> @  0x11cc209  main
> @ 0x7fc3ecb61b15  __libc_start_main
> @   0xab5e89  (unknown)
> Aborted (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1718) Command executor can overcommit the agent.

2016-12-08 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733949#comment-15733949
 ] 

Klaus Ma commented on MESOS-1718:
-

[~alexr] , anyway, I think exception is always bad :). It's better for us to 
align CLI executor to others.

> Command executor can overcommit the agent.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Priority: Critical
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6764) Add a grace period for terminating the I/O switchboard server.

2016-12-08 Thread Jie Yu (JIRA)
Jie Yu created MESOS-6764:
-

 Summary: Add a grace period for terminating the I/O switchboard 
server.
 Key: MESOS-6764
 URL: https://issues.apache.org/jira/browse/MESOS-6764
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu
Assignee: Jie Yu


We want to give the I/O switchboard server a grace period to wait for the 
connection from the containerizer. This is for the case where the container 
itself is short lived (e.g., a DEBUG container does an 'ls' and exits). For 
that case, we still want the subsequent attach output call to get the output 
from that container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6763) Add heartbeats to both input/output connections in IOSwitchboard

2016-12-08 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-6763:
--

 Summary: Add heartbeats to both input/output connections in 
IOSwitchboard
 Key: MESOS-6763
 URL: https://issues.apache.org/jira/browse/MESOS-6763
 Project: Mesos
  Issue Type: Improvement
Reporter: Kevin Klues
Assignee: Kevin Klues


Some networks will kill idle connections if no data is transfered over them 
within a set amount of time. For example, using AWS's Elastic Load Balancer 
(ELB), the default time to kill a connection is only 60s! Because of this, we 
need a way to send application level heartbeats to keep these connections alive 
for any long running LAUNCH_NESTED_CONTAINER_SESSION, ATTACH_CONTAINER_INPUT 
and ATTACH_CONTAINER_OUTPUT calls. 

We should serve these heartbeats from the IOSwitchboard server rather than the 
agent handlers since the agent essentially acts as a proxy and the heartbeats 
should originate from the actual server being communicated with.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4217) Mesos sandbox UI doesn't follow symlinks

2016-12-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733932#comment-15733932
 ] 

Joseph Wu commented on MESOS-4217:
--

The agent (more specifically, the {{Files}} actor: 
https://github.com/apache/mesos/blob/1.1.x/src/files/files.hpp#L67-L73 ) keeps 
a list of directories that have been "exposed", and will only allow access to 
those directories.

> Mesos sandbox UI doesn't follow symlinks
> 
>
> Key: MESOS-4217
> URL: https://issues.apache.org/jira/browse/MESOS-4217
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Mohit Soni
>Priority: Minor
>
> Current Mesos sandbox UI doesn't follow symlinks. Right now this prevents a 
> user to browse a persistent volume, which is symlinked inside the sandbox 
> directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4217) Mesos sandbox UI doesn't follow symlinks

2016-12-08 Thread Tobias Pfeiffer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733898#comment-15733898
 ] 

Tobias Pfeiffer edited comment on MESOS-4217 at 12/9/16 1:01 AM:
-

I am not sure if the original author had the same issue, but at least for me 
the problem still exists when the symlink points to a directory *outside* the 
sandbox.

When ssh-ing to the node that executes the task:

{noformat}
$ cd /var/lib/mesos/slaves/.../frameworks/.../executors/.../runs/..
$ ll
total 36K
lrwxrwxrwx 1 1000 users   45 12月  9 09:52 results -> /var/lib/otherdir
-rw-r--r-- 1 1000 users 5.5K 12月  9 09:53 stderr
-rw-r--r-- 1 1000 users 2.8K 12月  9 09:53 stdout
$ ll results/
total 824K
-rw-r--r-- 1 1000 1000  31K 12月  9 09:52 0.bin
-rw-r--r-- 1 1000 1000  42K 12月  9 09:53 1000.bin
-rw-r--r-- 1 1000 1000  31K 12月  9 09:53 1200.bin
...
{noformat}

However, in the sandbox, while the "results" symlink is displayed with the 
directory icon, clicking the link shows only an empty list (as if it was an 
empty directory). I don't know exactly if this is maybe intended or necessary 
for security reasons, but in that case an error message rather than an empty 
listing would be preferrable?


was (Author: tgpfeiffer):
I am not sure if the original author had the same issue, but at least for me 
the problem still exists when the symlink points to a directory *outside* the 
sandbox.

When ssh-ing to the node that executes the task:
$ cd /var/lib/mesos/slaves/.../frameworks/.../executors/.../runs/..
$ ll
total 36K
lrwxrwxrwx 1 1000 users   45 12月  9 09:52 results -> /var/lib/otherdir
-rw-r--r-- 1 1000 users 5.5K 12月  9 09:53 stderr
-rw-r--r-- 1 1000 users 2.8K 12月  9 09:53 stdout
$ ll results/
total 824K
-rw-r--r-- 1 1000 1000  31K 12月  9 09:52 0.bin
-rw-r--r-- 1 1000 1000  42K 12月  9 09:53 1000.bin
-rw-r--r-- 1 1000 1000  31K 12月  9 09:53 1200.bin
...

However, in the sandbox, while the "results" symlink is displayed with the 
directory icon, clicking the link shows only an empty list (as if it was an 
empty directory). I don't know exactly if this is maybe intended or necessary 
for security reasons, but in that case an error message rather than an empty 
listing would be preferrable?

> Mesos sandbox UI doesn't follow symlinks
> 
>
> Key: MESOS-4217
> URL: https://issues.apache.org/jira/browse/MESOS-4217
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Mohit Soni
>Priority: Minor
>
> Current Mesos sandbox UI doesn't follow symlinks. Right now this prevents a 
> user to browse a persistent volume, which is symlinked inside the sandbox 
> directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4217) Mesos sandbox UI doesn't follow symlinks

2016-12-08 Thread Tobias Pfeiffer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733898#comment-15733898
 ] 

Tobias Pfeiffer commented on MESOS-4217:


I am not sure if the original author had the same issue, but at least for me 
the problem still exists when the symlink points to a directory *outside* the 
sandbox.

When ssh-ing to the node that executes the task:
$ cd /var/lib/mesos/slaves/.../frameworks/.../executors/.../runs/..
$ ll
total 36K
lrwxrwxrwx 1 1000 users   45 12月  9 09:52 results -> /var/lib/otherdir
-rw-r--r-- 1 1000 users 5.5K 12月  9 09:53 stderr
-rw-r--r-- 1 1000 users 2.8K 12月  9 09:53 stdout
$ ll results/
total 824K
-rw-r--r-- 1 1000 1000  31K 12月  9 09:52 0.bin
-rw-r--r-- 1 1000 1000  42K 12月  9 09:53 1000.bin
-rw-r--r-- 1 1000 1000  31K 12月  9 09:53 1200.bin
...

However, in the sandbox, while the "results" symlink is displayed with the 
directory icon, clicking the link shows only an empty list (as if it was an 
empty directory). I don't know exactly if this is maybe intended or necessary 
for security reasons, but in that case an error message rather than an empty 
listing would be preferrable?

> Mesos sandbox UI doesn't follow symlinks
> 
>
> Key: MESOS-4217
> URL: https://issues.apache.org/jira/browse/MESOS-4217
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Mohit Soni
>Priority: Minor
>
> Current Mesos sandbox UI doesn't follow symlinks. Right now this prevents a 
> user to browse a persistent volume, which is symlinked inside the sandbox 
> directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1718) Command executor can overcommit the slave.

2016-12-08 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-1718:
---
Priority: Critical  (was: Major)

> Command executor can overcommit the slave.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Priority: Critical
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1718) Command executor can overcommit the agent.

2016-12-08 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-1718:
---
Summary: Command executor can overcommit the agent.  (was: Command executor 
can overcommit the slave.)

> Command executor can overcommit the agent.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Priority: Critical
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1718) Command executor can overcommit the slave.

2016-12-08 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733469#comment-15733469
 ] 

Alexander Rukletsov commented on MESOS-1718:


 If {{--cgroups_enable_cfs}} is enabled, the overcommit because of unaccounted 
executor CPU resources may effectively defeat the CFS limits, depeniding on the 
ratio of task's CPU to executor's.

Consider the following scenario. A framework attempts to run numerous small 
command-based computationally intensive tasks with {{0.1}} CPU requirement 
each. Since the allocator does not account extra resources, it is possible to 
schedule {{10}} such tasks per physical CPU on one agent. However those extra 
resource do increase the CFS quota, which is effectively {{0.2 × CFS period}} 
per task. I would argue this can be surprising and undesirable in some cases.

> Command executor can overcommit the slave.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5849) Agent sandboxes on Windows surpass the 260 character path length limit

2016-12-08 Thread Alex Clemmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer updated MESOS-5849:

Shepherd: Joseph Wu

> Agent sandboxes on Windows surpass the 260 character path length limit
> --
>
> Key: MESOS-5849
> URL: https://issues.apache.org/jira/browse/MESOS-5849
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
> Environment: Windows Server 2012, Windows Server 2016 RC
>Reporter: Lokendra Malik
>Assignee: Alex Clemmer
>Priority: Blocker
>  Labels: microsoft, tech-debt, windows
> Attachments: Pasted image at 2016_07_14 09_02 PM.png, mesoscrash.jpg
>
>
> When I tried to deploy an application on mesos-agent(windows), the moment 
> application is deployed mesos agent service on windows node is crashed and in 
> logs I can see error:
> I0714 07:20:09.788785  5640 containerizer.cpp:781] Starting container 
> '031878d5-32fa-41ed-8b23-d0d91fe34f05' for executor 
> 'windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf' of framework 
> '5c83c39f-75a0-4f38-9e47-633767b47976-'
> F0714 07:20:09.797576  5480 slave.cpp:6174] 
> CHECK_SOME(state::checkpoint(path, t)): Failed to create directory 
> 'E:\agentlogs\meta\slaves\803264d5-8f2d-46bb-8019-de0f9565c971-S5\frameworks\5c83c39f-75a0-4f38-9e47-633767b47976-\executors\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf\runs\031878d5-32fa-41ed-8b23-d0d91fe34f05\tasks\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf':
>  No such file or directory
> We debug the issue and found issue with fine name reached to max filepath 
> length: 
> E:\agentlogs\meta\slaves\803264d5-8f2d-46bb-8019-de0f9565c971-S5\frameworks\5c83c39f-75a0-4f38-9e47-633767b47976-\executors\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf\runs\031878d5-32fa-41ed-8b23-d0d91fe34f05\tasks\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf
> I think path length limit in windows is 256 which is revoked and this made 
> service to be crashed while this will work fine for linux mesos agents so we 
> may have to control current UUID.toString() method to be shorter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5849) Agent sandboxes on Windows surpass the 260 character path length limit

2016-12-08 Thread Alex Clemmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer updated MESOS-5849:

Priority: Blocker  (was: Major)

> Agent sandboxes on Windows surpass the 260 character path length limit
> --
>
> Key: MESOS-5849
> URL: https://issues.apache.org/jira/browse/MESOS-5849
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
> Environment: Windows Server 2012, Windows Server 2016 RC
>Reporter: Lokendra Malik
>Assignee: Alex Clemmer
>Priority: Blocker
>  Labels: microsoft, tech-debt, windows
> Attachments: Pasted image at 2016_07_14 09_02 PM.png, mesoscrash.jpg
>
>
> When I tried to deploy an application on mesos-agent(windows), the moment 
> application is deployed mesos agent service on windows node is crashed and in 
> logs I can see error:
> I0714 07:20:09.788785  5640 containerizer.cpp:781] Starting container 
> '031878d5-32fa-41ed-8b23-d0d91fe34f05' for executor 
> 'windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf' of framework 
> '5c83c39f-75a0-4f38-9e47-633767b47976-'
> F0714 07:20:09.797576  5480 slave.cpp:6174] 
> CHECK_SOME(state::checkpoint(path, t)): Failed to create directory 
> 'E:\agentlogs\meta\slaves\803264d5-8f2d-46bb-8019-de0f9565c971-S5\frameworks\5c83c39f-75a0-4f38-9e47-633767b47976-\executors\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf\runs\031878d5-32fa-41ed-8b23-d0d91fe34f05\tasks\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf':
>  No such file or directory
> We debug the issue and found issue with fine name reached to max filepath 
> length: 
> E:\agentlogs\meta\slaves\803264d5-8f2d-46bb-8019-de0f9565c971-S5\frameworks\5c83c39f-75a0-4f38-9e47-633767b47976-\executors\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf\runs\031878d5-32fa-41ed-8b23-d0d91fe34f05\tasks\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf
> I think path length limit in windows is 256 which is revoked and this made 
> service to be crashed while this will work fine for linux mesos agents so we 
> may have to control current UUID.toString() method to be shorter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5849) Agent sandboxes on Windows surpass the 260 character path length limit

2016-12-08 Thread Alex Clemmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer reassigned MESOS-5849:
---

Assignee: Alex Clemmer  (was: Daniel Pravat)

> Agent sandboxes on Windows surpass the 260 character path length limit
> --
>
> Key: MESOS-5849
> URL: https://issues.apache.org/jira/browse/MESOS-5849
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
> Environment: Windows Server 2012, Windows Server 2016 RC
>Reporter: Lokendra Malik
>Assignee: Alex Clemmer
>  Labels: microsoft, tech-debt, windows
> Attachments: Pasted image at 2016_07_14 09_02 PM.png, mesoscrash.jpg
>
>
> When I tried to deploy an application on mesos-agent(windows), the moment 
> application is deployed mesos agent service on windows node is crashed and in 
> logs I can see error:
> I0714 07:20:09.788785  5640 containerizer.cpp:781] Starting container 
> '031878d5-32fa-41ed-8b23-d0d91fe34f05' for executor 
> 'windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf' of framework 
> '5c83c39f-75a0-4f38-9e47-633767b47976-'
> F0714 07:20:09.797576  5480 slave.cpp:6174] 
> CHECK_SOME(state::checkpoint(path, t)): Failed to create directory 
> 'E:\agentlogs\meta\slaves\803264d5-8f2d-46bb-8019-de0f9565c971-S5\frameworks\5c83c39f-75a0-4f38-9e47-633767b47976-\executors\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf\runs\031878d5-32fa-41ed-8b23-d0d91fe34f05\tasks\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf':
>  No such file or directory
> We debug the issue and found issue with fine name reached to max filepath 
> length: 
> E:\agentlogs\meta\slaves\803264d5-8f2d-46bb-8019-de0f9565c971-S5\frameworks\5c83c39f-75a0-4f38-9e47-633767b47976-\executors\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf\runs\031878d5-32fa-41ed-8b23-d0d91fe34f05\tasks\windemo.10cc3e54-49ce-11e6-a2a2-08002786cbbf
> I think path length limit in windows is 256 which is revoked and this made 
> service to be crashed while this will work fine for linux mesos agents so we 
> may have to control current UUID.toString() method to be shorter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6762) Update release notes for multi-role changes

2016-12-08 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-6762:
---

 Summary: Update release notes for multi-role changes
 Key: MESOS-6762
 URL: https://issues.apache.org/jira/browse/MESOS-6762
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Bannier


When adding support for multi-role frameworks we should call out a number of 
potential issues in the changelog/release notes.

This ticket collects potential pitfalls.

h6. Changes in master and agent endpoints

When rendering the {{FrameworkInfo}} of multi-role enabled frameworks in master 
or agent endpoints the {{role}} field will not be supported anymore; instead 
the {{roles}} field should be used. Any tooling parsing endpoint information 
and relying on the {{role}} field needs to be updated before multi-role enabled 
frameworks can be run in the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6717) Add Windows support to agent test harness

2016-12-08 Thread Alex Clemmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer updated MESOS-6717:

Priority: Blocker  (was: Major)

> Add Windows support to agent test harness
> -
>
> Key: MESOS-6717
> URL: https://issues.apache.org/jira/browse/MESOS-6717
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>Priority: Blocker
>  Labels: microsoft, windows-mvp
>
> Of particular interest is in `src/tests/CMakeLists.txt` is support enough of 
> the following that we can successfully run agent tests:
> TEST_HELPER_SRC
> MESOS_TESTS_UTILS_SRC



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6761) Implement `os::user` on Windows

2016-12-08 Thread Alex Clemmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer updated MESOS-6761:

Priority: Blocker  (was: Critical)

> Implement `os::user` on Windows
> ---
>
> Key: MESOS-6761
> URL: https://issues.apache.org/jira/browse/MESOS-6761
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Daniel Pravat
>Priority: Blocker
>  Labels: microsoft, stout
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6761) Implement `os::user` on Windows

2016-12-08 Thread Alex Clemmer (JIRA)
Alex Clemmer created MESOS-6761:
---

 Summary: Implement `os::user` on Windows
 Key: MESOS-6761
 URL: https://issues.apache.org/jira/browse/MESOS-6761
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Alex Clemmer
Assignee: Daniel Pravat






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6698) Port `command_executor_tests.cpp`

2016-12-08 Thread Alex Clemmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer updated MESOS-6698:

Shepherd: Joseph Wu

> Port `command_executor_tests.cpp`
> -
>
> Key: MESOS-6698
> URL: https://issues.apache.org/jira/browse/MESOS-6698
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: microsoft, windows-mvp
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6701) Port `recordio_tests.cpp`

2016-12-08 Thread Alex Clemmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer updated MESOS-6701:

Shepherd: Joseph Wu

> Port `recordio_tests.cpp`
> -
>
> Key: MESOS-6701
> URL: https://issues.apache.org/jira/browse/MESOS-6701
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: microsoft, windows-mvp
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6676) Always re-link with scheduler during re-registration

2016-12-08 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6676:
---
Target Version/s: 1.1.1, 1.2.0, 1.0.3

> Always re-link with scheduler during re-registration
> 
>
> Key: MESOS-6676
> URL: https://issues.apache.org/jira/browse/MESOS-6676
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> Scenario:
> # Framework registers with master using a non-zero {{failover_timeout}} and 
> is assigned a FrameworkID.
> # The master sees an {{ExitedEvent}} for the master->scheduler link. This 
> could happen due to some transient network error, e.g., 1-way partition. The 
> master sends a {{FrameworkErrorMessage}} to the framework. The master marks 
> the framework as disconnected, but keeps the {{Framework*}} for it around in 
> {{frameworks.registered}}.
> # The framework doesn't receive the {{FrameworkErrorMessage}} because it is 
> dropped by the network.
> # The scheduler might receive an {{ExitedEvent}} for the scheduler -> master 
> link, but it ignores this anyway (see MESOS-887).
> # The scheduler sees a new-master-detected event and re-registers with the 
> master. It doesn _not_ set the {{force}} flag. This means we follow [this 
> code 
> path|https://github.com/apache/mesos/blob/a6bab9015cd63121081495b8291635f386b95a92/src/master/master.cpp#L2771]
>  in the master, which does _not_ relink with the scheduler.
> The result is that scheduler re-registration succeds, but the master -> 
> scheduler link is never re-established.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1791) Introduce Master / Offer Resource Reservations aka Quota.

2016-12-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-1791:
--
Fix Version/s: (was: 1.2.0)
   1.0.0

> Introduce Master / Offer Resource Reservations aka Quota.
> -
>
> Key: MESOS-1791
> URL: https://issues.apache.org/jira/browse/MESOS-1791
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation, master, replicated log
>Reporter: Tom Arnfeld
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> Currently Mesos supports the ability to reserve resources (for a given role) 
> on a per-slave basis, as introduced in MESOS-505. This allows you to almost 
> statically partition off a set of resources on a set of machines, to 
> guarantee certain types of frameworks get some resources.
> This is very useful, though it is also very useful to be able to control 
> these reservations through the master (instead of per-slave) for when I don't 
> care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of 
> (X,Y).
> I'm not sure what structure this could take, but apparently it has already 
> been discussed. Would this be a CLI flag? Could there be a (authenticated) 
> web interface to control these reservations?
> Follow up epic: MESOS-6514.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6760) Make the scheduler heartbeat interval configurable

2016-12-08 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-6760:
-

 Summary: Make the scheduler heartbeat interval configurable
 Key: MESOS-6760
 URL: https://issues.apache.org/jira/browse/MESOS-6760
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar


Currently, the heartbeats sent by the master to the scheduler are hard-coded to 
a constant default to 15 seconds. We should think about configuring this value 
either as a master flag or make the scheduler pick an appropriate value via the 
{{Subscribe}} call. 

This might be useful for some clusters where the default value might be too 
frequent or if they want a even smaller value (rarer case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6759) IOSwitchboardServerTest.AttachOutput has CHECK failure if run it multiple time.

2016-12-08 Thread Jie Yu (JIRA)
Jie Yu created MESOS-6759:
-

 Summary: IOSwitchboardServerTest.AttachOutput has CHECK failure if 
run it multiple time.
 Key: MESOS-6759
 URL: https://issues.apache.org/jira/browse/MESOS-6759
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu


I can easily repo this issue on my dev centos7 box with the following command:
{noformat}
GLOG_v=1 bin/mesos-tests.sh --gtest_filter=IOSwitchboardServerTest.AttachOutput 
--verbose --gtest_repeat=2

[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from IOSwitchboardServerTest
[ RUN  ] IOSwitchboardServerTest.AttachOutput
I1208 10:46:31.574084 41813 poll_socket.cpp:209] Socket error while sending: 
Broken pipe
/home/jie/workspace/mesos/src/tests/containerizer/io_switchboard_tests.cpp:265: 
Failure
(response).failure(): Disconnected
/home/jie/workspace/mesos/src/tests/containerizer/io_switchboard_tests.cpp:266: 
Failure
(response).failure(): Disconnected
F1208 10:46:31.574919 41751 future.hpp:1137] Check failed: !isFailed() 
Future::get() but state == FAILED: Disconnected
*** Check failure stack trace: ***
@ 0x7fc3f35a633a  google::LogMessage::Fail()
@ 0x7fc3f35a6299  google::LogMessage::SendToLog()
@ 0x7fc3f35a5caa  google::LogMessage::Flush()
@ 0x7fc3f35a89de  google::LogMessageFatal::~LogMessageFatal()
@   0xb6a352  process::Future<>::get()
@  0x1a050fe  
mesos::internal::tests::IOSwitchboardServerTest_AttachOutput_Test::TestBody()
@  0x1c54ce2  
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@  0x1c4fe00  
testing::internal::HandleExceptionsInMethodIfSupported<>()
@  0x1c31491  testing::Test::Run()
@  0x1c31c14  testing::TestInfo::Run()
@  0x1c3225a  testing::TestCase::Run()
@  0x1c38b34  testing::internal::UnitTestImpl::RunAllTests()
@  0x1c55907  
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@  0x1c50948  
testing::internal::HandleExceptionsInMethodIfSupported<>()
@  0x1c3787a  testing::UnitTest::Run()
@  0x11cc653  RUN_ALL_TESTS()
@  0x11cc209  main
@ 0x7fc3ecb61b15  __libc_start_main
@   0xab5e89  (unknown)
Aborted (core dumped)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6759) IOSwitchboardServerTest.AttachOutput has CHECK failure if run it multiple times.

2016-12-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6759:
--
Summary: IOSwitchboardServerTest.AttachOutput has CHECK failure if run it 
multiple times.  (was: IOSwitchboardServerTest.AttachOutput has CHECK failure 
if run it multiple time.)

> IOSwitchboardServerTest.AttachOutput has CHECK failure if run it multiple 
> times.
> 
>
> Key: MESOS-6759
> URL: https://issues.apache.org/jira/browse/MESOS-6759
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> I can easily repo this issue on my dev centos7 box with the following command:
> {noformat}
> GLOG_v=1 bin/mesos-tests.sh 
> --gtest_filter=IOSwitchboardServerTest.AttachOutput --verbose --gtest_repeat=2
> 
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from IOSwitchboardServerTest
> [ RUN  ] IOSwitchboardServerTest.AttachOutput
> I1208 10:46:31.574084 41813 poll_socket.cpp:209] Socket error while sending: 
> Broken pipe
> /home/jie/workspace/mesos/src/tests/containerizer/io_switchboard_tests.cpp:265:
>  Failure
> (response).failure(): Disconnected
> /home/jie/workspace/mesos/src/tests/containerizer/io_switchboard_tests.cpp:266:
>  Failure
> (response).failure(): Disconnected
> F1208 10:46:31.574919 41751 future.hpp:1137] Check failed: !isFailed() 
> Future::get() but state == FAILED: Disconnected
> *** Check failure stack trace: ***
> @ 0x7fc3f35a633a  google::LogMessage::Fail()
> @ 0x7fc3f35a6299  google::LogMessage::SendToLog()
> @ 0x7fc3f35a5caa  google::LogMessage::Flush()
> @ 0x7fc3f35a89de  google::LogMessageFatal::~LogMessageFatal()
> @   0xb6a352  process::Future<>::get()
> @  0x1a050fe  
> mesos::internal::tests::IOSwitchboardServerTest_AttachOutput_Test::TestBody()
> @  0x1c54ce2  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c4fe00  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1c31491  testing::Test::Run()
> @  0x1c31c14  testing::TestInfo::Run()
> @  0x1c3225a  testing::TestCase::Run()
> @  0x1c38b34  testing::internal::UnitTestImpl::RunAllTests()
> @  0x1c55907  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c50948  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1c3787a  testing::UnitTest::Run()
> @  0x11cc653  RUN_ALL_TESTS()
> @  0x11cc209  main
> @ 0x7fc3ecb61b15  __libc_start_main
> @   0xab5e89  (unknown)
> Aborted (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5902) CMake should generate protobuf definitions for Java

2016-12-08 Thread Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15732977#comment-15732977
 ] 

Srinivas commented on MESOS-5902:
-

Added 2 of 9 patches intended for review.
https://reviews.apache.org/r/50414
https://reviews.apache.org/r/50415

> CMake should generate protobuf definitions for Java
> ---
>
> Key: MESOS-5902
> URL: https://issues.apache.org/jira/browse/MESOS-5902
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
> Environment: CMake
>Reporter: Srinivas
>Assignee: Srinivas
>  Labels: microsoft
>
> Currently Java protobuf bindings require java protobuf library to generate 
> and compile the sources. We should build protobuf-java-2.6.1.jar from the 
> protobuf sources just like we build the mesos protobuf for C++.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-6476) Build a Mock HTTP Server that implements the new Debugging API calls

2016-12-08 Thread Steven Locke (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Locke updated MESOS-6476:

Comment: was deleted

(was: Work is taking place in this repo.)

> Build a Mock HTTP Server that implements the new Debugging API calls
> 
>
> Key: MESOS-6476
> URL: https://issues.apache.org/jira/browse/MESOS-6476
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Steven Locke
>  Labels: debugging, mesosphere
>
> The mock server should simply launch a process to run whatever command is 
> passed to it, rather than attempt to launch an actual nested container in 
> mesos. However, it should do everything necessary to deal with attaching a 
> {{pty}}  / redirecting {{stdin/stdout/stderr}} properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6741) Authorize v1 SET_LOGGING_LEVEL call

2016-12-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6741:
--
Sprint: Mesosphere Sprint 48

> Authorize v1 SET_LOGGING_LEVEL call
> ---
>
> Key: MESOS-6741
> URL: https://issues.apache.org/jira/browse/MESOS-6741
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, security
>Reporter: Adam B
>Assignee: Alexander Rojas
>Priority: Minor
>  Labels: security
>
> We need to add authz to this call to prevent unauthorized users from cranking 
> the log level way up to take down an agent/master.
> In the v0 API, we protected the /logging/toggle endpoint with a 
> "coarse-grained" GET_ENDPOINT_WITH_PATH ACL, but that cannot be reused 
> (directly) in the v1 API.
> We could add an analagous coarse-grained V1_CALL_WITH_ACTION ACL, but we're 
> probably better off just adding a trivial SET_LOG_LEVEL Authorization::Action 
> and ACL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6740) Authorize v1 GET_FLAGS call

2016-12-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6740:
--
Sprint: Mesosphere Sprint 48

> Authorize v1 GET_FLAGS call
> ---
>
> Key: MESOS-6740
> URL: https://issues.apache.org/jira/browse/MESOS-6740
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, security
>Reporter: Adam B
>Assignee: Alexander Rojas
>  Labels: security
>
> We already have a VIEW_FLAGS ACL that we use for /flags and the flags part of 
> /state. Let's add authz to the v1 GET_FLAGS API call (on agent and master) 
> and reuse that ACL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6504) Use 'geteuid()' for the root privileges check.

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6504:
--
Sprint: Mesosphere Sprint 47, Mesosphere Sprint 48  (was: Mesosphere Sprint 
47)

> Use 'geteuid()' for the root privileges check.
> --
>
> Key: MESOS-6504
> URL: https://issues.apache.org/jira/browse/MESOS-6504
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: backend, isolator, mesosphere, user
>
> Currently, parts of code in Mesos check the root privileges using os::user() 
> to compare to "root", which is not sufficient, since it compares the real 
> user. When people change the mesos binary by 'setuid root', the process may 
> not have the right permission to execute.
> We should check the effective user id instead in our code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6665) io::redirect might cause stack overflow.

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6665:
--
Sprint: Mesosphere Sprint 48

> io::redirect might cause stack overflow.
> 
>
> Key: MESOS-6665
> URL: https://issues.apache.org/jira/browse/MESOS-6665
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Benjamin Hindman
>
> Can reproduce this on macOS sierra:
> {noformat}
> [--] 6 tests from IOTest
> [ RUN  ] IOTest.Poll
> [   OK ] IOTest.Poll (0 ms)
> [ RUN  ] IOTest.Read
> [   OK ] IOTest.Read (3 ms)
> [ RUN  ] IOTest.BufferedRead
> [   OK ] IOTest.BufferedRead (5 ms)
> [ RUN  ] IOTest.Write
> [   OK ] IOTest.Write (1 ms)
> [ RUN  ] IOTest.Redirect
> make[6]: *** [check-local] Illegal instruction: 4
> make[5]: *** [check-am] Error 2
> make[4]: *** [check-recursive] Error 1
> make[3]: *** [check] Error 2
> make[2]: *** [check-recursive] Error 1
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> (reverse-i-search)`k': make check -j3
> Jies-MacBook-Pro:build jie$ lldb 3rdparty/libprocess/libprocess-tests
> (lldb) target create "3rdparty/libprocess/libprocess-tests"
> Current executable set to '3rdparty/libprocess/libprocess-tests' (x86_64).
> (lldb) run --gtest_filter=IOTest.Redirect
> Process 26064 launched: 
> '/Users/jie/workspace/dist/mesos/build/3rdparty/libprocess/libprocess-tests' 
> (x86_64)
> Note: Google Test filter = IOTest.Redirect
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from IOTest
> [ RUN  ] IOTest.Redirect
> Process 26064 stopped
> * thread #2: tid = 0x152c5c, 0x7fffd6d463e0 
> libsystem_malloc.dylib`szone_malloc_should_clear + 78, stop reason = 
> EXC_BAD_ACCESS (code=2, address=0x7eb16ff8)
> frame #0: 0x7fffd6d463e0 
> libsystem_malloc.dylib`szone_malloc_should_clear + 78
> libsystem_malloc.dylib`szone_malloc_should_clear:
> ->  0x7fffd6d463e0 <+78>: movq   %rax, -0x78(%rbp)
> 0x7fffd6d463e4 <+82>: movq   0x10f0(%r12), %r13
> 0x7fffd6d463ec <+90>: leaq   (%rax,%rax,4), %r14
> 0x7fffd6d463f0 <+94>: shlq   $0x9, %r14
> (lldb) bt
> .
> frame #2794: 0x7fffd6ddb221 libsystem_pthread.dylib`thread_start + 13
> {noformat}
> Change the test to redirect just 1KB data will hide the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6664) Force cleanup of IOSwitchboard server if it does not terminate after the container terminates.

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6664:
--
Sprint: Mesosphere Sprint 48

> Force cleanup of IOSwitchboard server if it does not terminate after the 
> container terminates.
> --
>
> Key: MESOS-6664
> URL: https://issues.apache.org/jira/browse/MESOS-6664
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Kevin Klues
>
> In normal case, IOSwitchboard server will terminate after container 
> terminates. However, we should be more defensive and always cleanup the 
> IOSwitchboard server if it does not terminate within a reasonable grace 
> period. 
> The reason for the grace period is to allow the IOSwitchboard server to 
> finish redirecting the stdout/stderr to the logger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6475) Mesos Container Attach/Exec Unit Tests

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6475:
--
Sprint: Mesosphere Sprint 48

> Mesos Container Attach/Exec Unit Tests
> --
>
> Key: MESOS-6475
> URL: https://issues.apache.org/jira/browse/MESOS-6475
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: debugging, mesosphere
>
> Ideally, all unit tests should be written as the individual tasks that make 
> up this Epic are completed. However, sometime this doesn't always happen as 
> planned. 
> This ticket should not be closed and the Epic should not be considered 
> complete until all unit tests for all components have been written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5966) Add libprocess HTTP tests with SSL support

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5966:
--
Sprint: Mesosphere Sprint 40, Mesosphere Sprint 41, Mesosphere Sprint 42, 
Mesosphere Sprint 44, Mesosphere Sprint 45, Mesosphere Sprint 46, Mesosphere 
Sprint 47, Mesosphere Sprint 48  (was: Mesosphere Sprint 40, Mesosphere Sprint 
41, Mesosphere Sprint 42, Mesosphere Sprint 44, Mesosphere Sprint 45, 
Mesosphere Sprint 46, Mesosphere Sprint 47)

> Add libprocess HTTP tests with SSL support
> --
>
> Key: MESOS-5966
> URL: https://issues.apache.org/jira/browse/MESOS-5966
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
>
> Libprocess contains SSL unit tests which test our SSL support using simple 
> sockets. We should add tests which also make use of libprocess's various HTTP 
> classes and helpers in a variety of SSL configurations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6476) Build a Mock HTTP Server that implements the new Debugging API calls

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6476:
--
Sprint: Mesosphere Sprint 46, Mesosphere Sprint 47, Mesosphere Sprint 48  
(was: Mesosphere Sprint 46, Mesosphere Sprint 47)

> Build a Mock HTTP Server that implements the new Debugging API calls
> 
>
> Key: MESOS-6476
> URL: https://issues.apache.org/jira/browse/MESOS-6476
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Steven Locke
>  Labels: debugging, mesosphere
>
> The mock server should simply launch a process to run whatever command is 
> passed to it, rather than attempt to launch an actual nested container in 
> mesos. However, it should do everything necessary to deal with attaching a 
> {{pty}}  / redirecting {{stdin/stdout/stderr}} properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6184) Health checks should use a general mechanism to enter namespaces of the task.

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6184:
--
Sprint: Mesosphere Sprint 44, Mesosphere Sprint 46, Mesosphere Sprint 47, 
Mesosphere Sprint 48  (was: Mesosphere Sprint 44, Mesosphere Sprint 46, 
Mesosphere Sprint 47)

> Health checks should use a general mechanism to enter namespaces of the task.
> -
>
> Key: MESOS-6184
> URL: https://issues.apache.org/jira/browse/MESOS-6184
> Project: Mesos
>  Issue Type: Improvement
>Reporter: haosdent
>Assignee: haosdent
>Priority: Blocker
>  Labels: health-check, mesosphere
>
> To perform health checks for tasks, we need to enter the corresponding 
> namespaces of the container. For now health check use custom clone to 
> implement this
> {code}
>   return process::defaultClone([=]() -> int {
> if (taskPid.isSome()) {
>   foreach (const string& ns, namespaces) {
> Try setns = ns::setns(taskPid.get(), ns);
> if (setns.isError()) {
>   ...
> }
>   }
> }
> return func();
>   });
> {code}
> After the childHooks patches merged, we could change the health check to use 
> childHooks to call {{setns}} and make {{process::defaultClone}} private 
> again.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3753) Test the HTTP Scheduler library with SSL enabled

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-3753:
--
Sprint: Mesosphere Sprint 39, Mesosphere Sprint 40, Mesosphere Sprint 41, 
Mesosphere Sprint 42, Mesosphere Sprint 44, Mesosphere Sprint 45, Mesosphere 
Sprint 46, Mesosphere Sprint 47, Mesosphere Sprint 48  (was: Mesosphere Sprint 
39, Mesosphere Sprint 40, Mesosphere Sprint 41, Mesosphere Sprint 42, 
Mesosphere Sprint 44, Mesosphere Sprint 45, Mesosphere Sprint 46, Mesosphere 
Sprint 47)

> Test the HTTP Scheduler library with SSL enabled
> 
>
> Key: MESOS-3753
> URL: https://issues.apache.org/jira/browse/MESOS-3753
> Project: Mesos
>  Issue Type: Story
>  Components: framework, HTTP API, test
>Reporter: Joseph Wu
>Assignee: Greg Mann
>  Labels: mesosphere, security
>
> Currently, the HTTP Scheduler library does not support SSL-enabled Mesos.  
> (You can manually test this by spinning up an SSL-enabled master and attempt 
> to run the event-call framework example against it.)
> We need to add tests that check the HTTP Scheduler library against 
> SSL-enabled Mesos:
> * with downgrade support,
> * with required framework/client-side certifications,
> * with/without verification of certificates (master-side),
> * with/without verification of certificates (framework-side),
> * with a custom certificate authority (CA)
> These options should be controlled by the same environment variables found on 
> the [SSL user doc|http://mesos.apache.org/documentation/latest/ssl/].
> Note: This issue will be broken down into smaller sub-issues as bugs/problems 
> are discovered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5931) Support auto backend in Unified Containerizer.

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5931:
--
Sprint: Mesosphere Sprint 41, Mesosphere Sprint 42, Mesosphere Sprint 47, 
Mesosphere Sprint 48  (was: Mesosphere Sprint 41, Mesosphere Sprint 42, 
Mesosphere Sprint 47)

> Support auto backend in Unified Containerizer.
> --
>
> Key: MESOS-5931
> URL: https://issues.apache.org/jira/browse/MESOS-5931
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: backend, containerizer, mesosphere
>
> Currently in Unified Containerizer, copy backend will be selected by default. 
> This is not ideal, especially for production environment. It would take a 
> long time to prepare an huge container image to copy it from the store to 
> provisioner.
> Ideally, we should support `auto backend`, which would 
> automatically/intelligently select the best/optimal backend for image 
> provisioner if user does not specify one from the agent flag.
> We should have a logic design first in this ticket, to determine how we want 
> to choose the right backend (e.g., overlayfs or aufs should be preferred if 
> available from the kernel).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6335) Add user doc for task group tasks

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6335:
--
Sprint: Mesosphere Sprint 44, Mesosphere Sprint 45, Mesosphere Sprint 46, 
Mesosphere Sprint 47, Mesosphere Sprint 48  (was: Mesosphere Sprint 44, 
Mesosphere Sprint 45, Mesosphere Sprint 46, Mesosphere Sprint 47)

> Add user doc for task group tasks
> -
>
> Key: MESOS-6335
> URL: https://issues.apache.org/jira/browse/MESOS-6335
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Vinod Kone
>Assignee: Gilbert Song
> Fix For: 1.2.0
>
>
> Committed some basic documentation. So moving this to pods-improvements epic 
> and targeting this for 1.2.0. I would like this to track the more 
> comprehensive documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6477) Build a standalone python client for connecting to our Mock HTTP Server that implements the new Debug APIs

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6477:
--
Sprint: Mesosphere Sprint 46, Mesosphere Sprint 47, Mesosphere Sprint 48  
(was: Mesosphere Sprint 46, Mesosphere Sprint 47)

> Build a standalone python client for connecting to our Mock HTTP Server that 
> implements the new Debug APIs
> --
>
> Key: MESOS-6477
> URL: https://issues.apache.org/jira/browse/MESOS-6477
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Steven Locke
>  Labels: debugging, mesosphere
>
> This client prototype should have a similar CLI to what we eventually want to 
> build into the Mesos or DC/OS CLI.
> {noformat}
> Streaming HTTP Client
> Usage:
>   client task exec [--tty] [--interactive]   [...]
>   client task attach [--tty] [--interactive] 
> Options:
>   --tty  Allocate a tty on the server before
>  attaching to the container.
>   --interactive  Connect the stdin of the client to
>  the stdin of the container.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6001) Aufs backend cannot support the image with numerous layers.

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6001:
--
Sprint: Mesosphere Sprint 47, Mesosphere Sprint 48  (was: Mesosphere Sprint 
47)

> Aufs backend cannot support the image with numerous layers.
> ---
>
> Key: MESOS-6001
> URL: https://issues.apache.org/jira/browse/MESOS-6001
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14, Ubuntu 12
> Or any other os with aufs module
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: aufs, backend, containerizer
>
> This issue was exposed in this unit test 
> `ROOT_CURL_INTERNET_DockerDefaultEntryptRegistryPuller` by manually 
> specifying the `bind` backend. Most likely mounting the aufs with specific 
> options is limited by string length.
> {noformat}
> [20:13:07] :   [Step 10/10] [ RUN  ] 
> DockerRuntimeIsolatorTest.ROOT_CURL_INTERNET_DockerDefaultEntryptRegistryPuller
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.615844 23416 cluster.cpp:155] 
> Creating default 'local' authorizer
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.624106 23416 leveldb.cpp:174] 
> Opened db in 8.148813ms
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.627252 23416 leveldb.cpp:181] 
> Compacted db in 3.126629ms
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.627275 23416 leveldb.cpp:196] 
> Created db iterator in 4410ns
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.627282 23416 leveldb.cpp:202] 
> Seeked to beginning of db in 763ns
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.627287 23416 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 491ns
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.627301 23416 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.627563 23434 recover.cpp:451] 
> Starting replica recovery
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.627800 23437 recover.cpp:477] 
> Replica is in EMPTY status
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.628113 23431 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(5852)@172.30.2.138:44256
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.628243 23430 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.628365 23437 recover.cpp:568] 
> Updating replica status to STARTING
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.628744 23432 master.cpp:375] 
> Master dd755a55-0dd1-4d2d-9a49-812a666015cb (ip-172-30-2-138.mesosphere.io) 
> started on 172.30.2.138:44256
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.628758 23432 master.cpp:377] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/OZHDIQ/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/OZHDIQ/master" --zk_session_timeout="10secs"
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.628893 23432 master.cpp:427] 
> Master only allowing authenticated frameworks to register
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.628900 23432 master.cpp:441] 
> Master only allowing authenticated agents to register
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.628902 23432 master.cpp:454] 
> Master only allowing authenticated HTTP frameworks to register
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.628906 23432 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/OZHDIQ/credentials'
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.628999 23432 master.cpp:499] Using 
> default 'crammd5' authenticator
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.629041 23432 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [20:13:07]W:   [Step 10/10] I0805 20:13:07.629114 23432 http.cpp:883] Using 
> default 'basic' HTTP authenticator for 

[jira] [Updated] (MESOS-6193) Make the docker/volume isolator nesting aware.

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6193:
--
Sprint: Mesosphere Sprint 44, Mesosphere Sprint 45, Mesosphere Sprint 46, 
Mesosphere Sprint 47, Mesosphere Sprint 48  (was: Mesosphere Sprint 44, 
Mesosphere Sprint 45, Mesosphere Sprint 46, Mesosphere Sprint 47)

> Make the docker/volume isolator nesting aware.
> --
>
> Key: MESOS-6193
> URL: https://issues.apache.org/jira/browse/MESOS-6193
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Gilbert Song
>  Labels: isolator, mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6653) Overlayfs backend may fail to mount the rootfs if both container image and image volume are specified.

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6653:
--
Sprint: Mesosphere Sprint 47, Mesosphere Sprint 48  (was: Mesosphere Sprint 
47)

> Overlayfs backend may fail to mount the rootfs if both container image and 
> image volume are specified.
> --
>
> Key: MESOS-6653
> URL: https://issues.apache.org/jira/browse/MESOS-6653
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: backend, containerizer, overlayfs
>
> Depending on MESOS-6000, we use symlink to shorten the overlayfs mounting 
> arguments. However, if more than one image need to be provisioned (e.g., a 
> container image is specified while image volumes are specified for the same 
> container), the symlink .../backends/overlay/links would fail to be created 
> since it exists already.
> Here is a simple log when we hard code overlayfs as our default backend:
> {noformat}
> [07:02:45] :   [Step 10/10] [ RUN  ] 
> Nesting/VolumeImageIsolatorTest.ROOT_ImageInVolumeWithRootFilesystem/0
> [07:02:46] :   [Step 10/10] I1127 07:02:46.416021  2919 
> containerizer.cpp:207] Using isolation: 
> filesystem/linux,volume/image,docker/runtime,network/cni
> [07:02:46] :   [Step 10/10] I1127 07:02:46.419312  2919 
> linux_launcher.cpp:150] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
> for the Linux launcher
> [07:02:46] :   [Step 10/10] E1127 07:02:46.425336  2919 shell.hpp:107] 
> Command 'hadoop version 2>&1' failed; this is the output:
> [07:02:46] :   [Step 10/10] sh: 1: hadoop: not found
> [07:02:46] :   [Step 10/10] I1127 07:02:46.425379  2919 fetcher.cpp:69] 
> Skipping URI fetcher plugin 'hadoop' as it could not be created: Failed to 
> create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was 
> either not found or exited with a non-zero exit status: 127
> [07:02:46] :   [Step 10/10] I1127 07:02:46.425452  2919 local_puller.cpp:94] 
> Creating local puller with docker registry '/tmp/R6OUei/registry'
> [07:02:46] :   [Step 10/10] I1127 07:02:46.427258  2934 
> containerizer.cpp:956] Starting container 
> 9af6c98a-d9f7-4c89-a5ed-fc7ae2fa1330 for executor 'test_executor' of 
> framework 
> [07:02:46] :   [Step 10/10] I1127 07:02:46.427592  2938 
> metadata_manager.cpp:167] Looking for image 'test_image_rootfs'
> [07:02:46] :   [Step 10/10] I1127 07:02:46.427774  2936 local_puller.cpp:147] 
> Untarring image 'test_image_rootfs' from 
> '/tmp/R6OUei/registry/test_image_rootfs.tar' to 
> '/tmp/R6OUei/store/staging/9krDz2'
> [07:02:46] :   [Step 10/10] I1127 07:02:46.512070  2933 local_puller.cpp:167] 
> The repositories JSON file for image 'test_image_rootfs' is 
> '{"test_image_rootfs":{"latest":"815b809d588c80fd6ddf4d6ac244ad1c01ae4cbe0f91cc7480e306671ee9c346"}}'
> [07:02:46] :   [Step 10/10] I1127 07:02:46.512279  2933 local_puller.cpp:295] 
> Extracting layer tar ball 
> '/tmp/R6OUei/store/staging/9krDz2/815b809d588c80fd6ddf4d6ac244ad1c01ae4cbe0f91cc7480e306671ee9c346/layer.tar
>  to rootfs 
> '/tmp/R6OUei/store/staging/9krDz2/815b809d588c80fd6ddf4d6ac244ad1c01ae4cbe0f91cc7480e306671ee9c346/rootfs'
> [07:02:46] :   [Step 10/10] I1127 07:02:46.617442  2937 
> metadata_manager.cpp:155] Successfully cached image 'test_image_rootfs'
> [07:02:46] :   [Step 10/10] I1127 07:02:46.617908  2938 provisioner.cpp:286] 
> Image layers: 1
> [07:02:46] :   [Step 10/10] I1127 07:02:46.617925  2938 provisioner.cpp:296] 
> Should hit here
> [07:02:46] :   [Step 10/10] I1127 07:02:46.617949  2938 provisioner.cpp:315] 
> : bind
> [07:02:46] :   [Step 10/10] I1127 07:02:46.617959  2938 provisioner.cpp:315] 
> : overlay
> [07:02:46] :   [Step 10/10] I1127 07:02:46.617967  2938 provisioner.cpp:315] 
> : copy
> [07:02:46] :   [Step 10/10] I1127 07:02:46.617974  2938 provisioner.cpp:318] 
> Provisioning image rootfs 
> '/mnt/teamcity/temp/buildTmp/Nesting_VolumeImageIsolatorTest_ROOT_ImageInVolumeWithRootFilesystem_0_1fMo0c/provisioner/containers/9af6c98a-d9f7-4c89-a5ed-fc7ae2fa1330/backends/overlay/rootfses/c71e83d2-5dbe-4eb7-a2fc-b8cc826771f7'
>  for container 9af6c98a-d9f7-4c89-a5ed-fc7ae2fa1330 using overlay backend
> [07:02:46] :   [Step 10/10] I1127 07:02:46.618408  2936 overlay.cpp:175] 
> Created symlink 
> '/mnt/teamcity/temp/buildTmp/Nesting_VolumeImageIsolatorTest_ROOT_ImageInVolumeWithRootFilesystem_0_1fMo0c/provisioner/containers/9af6c98a-d9f7-4c89-a5ed-fc7ae2fa1330/backends/overlay/links'
>  -> '/tmp/DQ3blT'
> [07:02:46] :   [Step 10/10] I1127 07:02:46.618472  2936 overlay.cpp:203] 
> Provisioning image rootfs with overlayfs: 
> 

[jira] [Updated] (MESOS-6739) Authorize v1 GET_CONTAINERS call

2016-12-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6739:
--
Sprint: Mesosphere Sprint 48

> Authorize v1 GET_CONTAINERS call
> 
>
> Key: MESOS-6739
> URL: https://issues.apache.org/jira/browse/MESOS-6739
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, security
>Reporter: Adam B
>Assignee: Alexander Rojas
>Priority: Critical
>  Labels: security
>
> We need some kind of authorization for GET_CONTAINERS.
> a. Coarse-grained like we already did for /containers. With this you could 
> say that Alice can GET_CONTAINERS for any/all containers on the cluster, but 
> Bob cannot see any containers' info.
> b. Fine-grained authz like we have for /state and /tasks. With this you could 
> say that Alice can GET_CONTAINERS and see filtered results where user=alice, 
> but Bob can only see filtered results where user=bob. It would be nice to 
> port this to /containers as well if/when we add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6654) Duplicate image layer ids may make the backend failed to mount rootfs.

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6654:
--
Sprint: Mesosphere Sprint 47, Mesosphere Sprint 48  (was: Mesosphere Sprint 
47)

> Duplicate image layer ids may make the backend failed to mount rootfs.
> --
>
> Key: MESOS-6654
> URL: https://issues.apache.org/jira/browse/MESOS-6654
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: aufs, backend, containerizer
>
> Some images (e.g., 'mesosphere/inky') may contain duplicate layer ids in 
> manifest, which may cause some backends unable to mount the rootfs (e.g., 
> 'aufs' backend). We should make sure that each layer path returned in 
> 'ImageInfo' is unique.
> Here is an example manifest from 'mesosphere/inky':
> {noformat}
> [20:13:08]W:   [Step 10/10]"name": "mesosphere/inky",
> [20:13:08]W:   [Step 10/10]"tag": "latest",
> [20:13:08]W:   [Step 10/10]"architecture": "amd64",
> [20:13:08]W:   [Step 10/10]"fsLayers": [
> [20:13:08]W:   [Step 10/10]   {
> [20:13:08]W:   [Step 10/10]  "blobSum": 
> "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4"
> [20:13:08]W:   [Step 10/10]   },
> [20:13:08]W:   [Step 10/10]   {
> [20:13:08]W:   [Step 10/10]  "blobSum": 
> "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4"
> [20:13:08]W:   [Step 10/10]   },
> [20:13:08]W:   [Step 10/10]   {
> [20:13:08]W:   [Step 10/10]  "blobSum": 
> "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4"
> [20:13:08]W:   [Step 10/10]   },
> [20:13:08]W:   [Step 10/10]   {
> [20:13:08]W:   [Step 10/10]  "blobSum": 
> "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4"
> [20:13:08]W:   [Step 10/10]   },
> [20:13:08]W:   [Step 10/10]   {
> [20:13:08]W:   [Step 10/10]  "blobSum": 
> "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4"
> [20:13:08]W:   [Step 10/10]   },
> [20:13:08]W:   [Step 10/10]   {
> [20:13:08]W:   [Step 10/10]  "blobSum": 
> "sha256:1db09adb5ddd7f1a07b6d585a7db747a51c7bd17418d47e91f901bdf420abd66"
> [20:13:08]W:   [Step 10/10]   },
> [20:13:08]W:   [Step 10/10]   {
> [20:13:08]W:   [Step 10/10]  "blobSum": 
> "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4"
> [20:13:08]W:   [Step 10/10]   },
> [20:13:08]W:   [Step 10/10]   {
> [20:13:08]W:   [Step 10/10]  "blobSum": 
> "sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4"
> [20:13:08]W:   [Step 10/10]   }
> [20:13:08]W:   [Step 10/10]],
> [20:13:08]W:   [Step 10/10]"history": [
> [20:13:08]W:   [Step 10/10]   {
> [20:13:08]W:   [Step 10/10]  "v1Compatibility": 
> "{\"id\":\"e28617c6dd2169bfe2b10017dfaa04bd7183ff840c4f78ebe73fca2a89effeb6\",\"parent\":\"be4ce2753831b8952a5b797cf45b2230e1befead6f5db0630bcb24a5f554255e\",\"created\":\"2014-08-15T00:31:36.407713553Z\",\"container\":\"5d55401ff99c7508c9d546926b711c78e3ccb36e39a848024b623b2aef4c2c06\",\"container_config\":{\"Hostname\":\"f7d939e68b5a\",\"Domainname\":\"\",\"User\":\"\",\"AttachStdin\":false,\"AttachStdout\":false,\"AttachStderr\":false,\"PortSpecs\":null,\"ExposedPorts\":null,\"Tty\":false,\"OpenStdin\":false,\"StdinOnce\":false,\"Env\":[\"HOME=/\",\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\"],\"Cmd\":[\"/bin/sh\",\"-c\",\"#(nop)
>  ENTRYPOINT 
> [echo]\"],\"Image\":\"be4ce2753831b8952a5b797cf45b2230e1befead6f5db0630bcb24a5f554255e\",\"Volumes\":null,\"VolumeDriver\":\"\",\"WorkingDir\":\"\",\"Entrypoint\":[\"echo\"],\"NetworkDisabled\":false,\"MacAddress\":\"\",\"OnBuild\":[],\"Labels\":null},\"docker_version\":\"1.1.2\",\"author\":\"supp...@mesosphere.io\",\"config\":{\"Hostname\":\"f7d939e68b5a\",\"Domainname\":\"\",\"User\":\"\",\"AttachStdin\":false,\"AttachStdout\":false,\"AttachStderr\":false,\"PortSpecs\":null,\"ExposedPorts\":null,\"Tty\":false,\"OpenStdin\":false,\"StdinOnce\":false,\"Env\":[\"HOME=/\",\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\"],\"Cmd\":[\"inky\"],\"Image\":\"be4ce2753831b8952a5b797cf45b2230e1befead6f5db0630bcb24a5f554255e\",\"Volumes\":null,\"VolumeDriver\":\"\",\"WorkingDir\":\"\",\"Entrypoint\":[\"echo\"],\"NetworkDisabled\":false,\"MacAddress\":\"\",\"OnBuild\":[],\"Labels\":null},\"architecture\":\"amd64\",\"os\":\"linux\",\"Size\":0}\n"
> [20:13:08]W:   [Step 10/10]   },
> [20:13:08]W:   [Step 10/10]   {
> [20:13:08]W:   [Step 10/10]  "v1Compatibility": 
> 

[jira] [Updated] (MESOS-6348) Allow `network/cni` isolator unit-tests to run with CNI plugins

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6348:
--
Sprint: Mesosphere Sprint 44, Mesosphere Sprint 45, Mesosphere Sprint 46, 
Mesosphere Sprint 47, Mesosphere Sprint 48  (was: Mesosphere Sprint 44, 
Mesosphere Sprint 45, Mesosphere Sprint 46, Mesosphere Sprint 47)

> Allow `network/cni` isolator unit-tests to run with CNI plugins 
> 
>
> Key: MESOS-6348
> URL: https://issues.apache.org/jira/browse/MESOS-6348
> Project: Mesos
>  Issue Type: Task
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> Currently, we don't have any infrastructure to allow for CNI plugins to be 
> used in `network/cni` isolator unit-tests. This forces us to mock CNI plugins 
> that don't use new network namespaces leading to very restricting form of 
> unit-tests. 
> Especially for port-mapper plugin, in order to test its DNAT functionality it 
> will be very useful if we run the containers in separate network namespace 
> requiring an actual CNI plugin.
> The proposal is there to introduce a test filter called CNIPLUGIN, that gets 
> set when CNI_PATH env var is set. Tests using the CNIPLUGIN filter can then 
> use actual CNI plugins in their tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6305) Add authorization support for nested container calls

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6305:
--
Sprint: Mesosphere Sprint 47, Mesosphere Sprint 48  (was: Mesosphere Sprint 
47)

> Add authorization support for nested container calls
> 
>
> Key: MESOS-6305
> URL: https://issues.apache.org/jira/browse/MESOS-6305
> Project: Mesos
>  Issue Type: Improvement
>  Components: security
>Reporter: Galen Pewtherer
>Assignee: Alexander Rojas
>
> We need to authorize {LAUNCH, KILL, WAIT}_NESTED_CONTAINER API calls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6291) Add unit tests for nested container case for filesystem/linux isolator.

2016-12-08 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6291:
--
Sprint: Mesosphere Sprint 44, Mesosphere Sprint 45, Mesosphere Sprint 46, 
Mesosphere Sprint 47, Mesosphere Sprint 48  (was: Mesosphere Sprint 44, 
Mesosphere Sprint 45, Mesosphere Sprint 46, Mesosphere Sprint 47)

> Add unit tests for nested container case for filesystem/linux isolator.
> ---
>
> Key: MESOS-6291
> URL: https://issues.apache.org/jira/browse/MESOS-6291
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: isolator, mesosphere
>
> Parameterize the existing tests so that all works for both top level 
> container and nested container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6740) Authorize v1 GET_FLAGS call

2016-12-08 Thread Alexander Rojas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas reassigned MESOS-6740:
--

Assignee: Alexander Rojas

> Authorize v1 GET_FLAGS call
> ---
>
> Key: MESOS-6740
> URL: https://issues.apache.org/jira/browse/MESOS-6740
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, security
>Reporter: Adam B
>Assignee: Alexander Rojas
>  Labels: security
>
> We already have a VIEW_FLAGS ACL that we use for /flags and the flags part of 
> /state. Let's add authz to the v1 GET_FLAGS API call (on agent and master) 
> and reuse that ACL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6670) Authz for Agent v1 operator API

2016-12-08 Thread Alexander Rojas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas reassigned MESOS-6670:
--

Assignee: Alexander Rojas

> Authz for Agent v1 operator API
> ---
>
> Key: MESOS-6670
> URL: https://issues.apache.org/jira/browse/MESOS-6670
> Project: Mesos
>  Issue Type: Epic
>  Components: security
>Reporter: Adam B
>Assignee: Alexander Rojas
>  Labels: security
> Fix For: 1.2.0
>
>
> Of the agent's current v1 operator Calls,
> - Some don't need authz:
> GET_HEALTH = 1
> GET_VERSION = 3;
> GET_METRICS = 4;
> GET_LOGGING_LEVEL = 5;
> - Some already have authz:
> LIST_FILES = 7;
> READ_FILE = 8;
> LAUNCH_NESTED_CONTAINER = 14;
> WAIT_NESTED_CONTAINER = 15;
> KILL_NESTED_CONTAINER = 16;
> - Some probably have authz (filtering), but we need to test/verify
> GET_STATE = 9;
> GET_FRAMEWORKS = 11;
> GET_EXECUTORS = 12;
> GET_TASKS = 13;
> - Some don't have authz, but need it
> GET_FLAGS = 2;
> SET_LOGGING_LEVEL = 6;
> GET_CONTAINERS = 10;
> - Some are brand new, and their authz is covered by MESOS-6474
> LAUNCH_NESTED_CONTAINER_SESSION = 17;
> ATTACH_CONTAINER_INPUT = 18;
> ATTACH_CONTAINER_OUTPUT = 19;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6739) Authorize v1 GET_CONTAINERS call

2016-12-08 Thread Alexander Rojas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas reassigned MESOS-6739:
--

Assignee: Alexander Rojas

> Authorize v1 GET_CONTAINERS call
> 
>
> Key: MESOS-6739
> URL: https://issues.apache.org/jira/browse/MESOS-6739
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, security
>Reporter: Adam B
>Assignee: Alexander Rojas
>Priority: Critical
>  Labels: security
>
> We need some kind of authorization for GET_CONTAINERS.
> a. Coarse-grained like we already did for /containers. With this you could 
> say that Alice can GET_CONTAINERS for any/all containers on the cluster, but 
> Bob cannot see any containers' info.
> b. Fine-grained authz like we have for /state and /tasks. With this you could 
> say that Alice can GET_CONTAINERS and see filtered results where user=alice, 
> but Bob can only see filtered results where user=bob. It would be nice to 
> port this to /containers as well if/when we add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6741) Authorize v1 SET_LOGGING_LEVEL call

2016-12-08 Thread Alexander Rojas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas reassigned MESOS-6741:
--

Assignee: Alexander Rojas

> Authorize v1 SET_LOGGING_LEVEL call
> ---
>
> Key: MESOS-6741
> URL: https://issues.apache.org/jira/browse/MESOS-6741
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, security
>Reporter: Adam B
>Assignee: Alexander Rojas
>Priority: Minor
>  Labels: security
>
> We need to add authz to this call to prevent unauthorized users from cranking 
> the log level way up to take down an agent/master.
> In the v0 API, we protected the /logging/toggle endpoint with a 
> "coarse-grained" GET_ENDPOINT_WITH_PATH ACL, but that cannot be reused 
> (directly) in the v1 API.
> We could add an analagous coarse-grained V1_CALL_WITH_ACTION ACL, but we're 
> probably better off just adding a trivial SET_LOG_LEVEL Authorization::Action 
> and ACL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6758) Support 'Basic' auth docker private registry on Unified Containerizer.

2016-12-08 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-6758:
---

 Summary: Support 'Basic' auth docker private registry on Unified 
Containerizer.
 Key: MESOS-6758
 URL: https://issues.apache.org/jira/browse/MESOS-6758
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song
Assignee: Gilbert Song


Currently, the Unified Containerizer only supports the private docker registry 
with 'Bearer' authorization (token is needed from the auth server). We should 
support the 'Basic' auth registry as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6623) Re-enable tests impacted by request streaming support

2016-12-08 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6623:
--
Sprint:   (was: Mesosphere Sprint 47)

> Re-enable tests impacted by request streaming support
> -
>
> Key: MESOS-6623
> URL: https://issues.apache.org/jira/browse/MESOS-6623
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> We added support for HTTP request streaming in libprocess as part of 
> MESOS-6466. However, this broke a few tests that relied on HTTP request 
> filtering since the handlers no longer have access to the body of the request 
> when {{visit()}} is invoked. We would need to revisit how we do HTTP request 
> filtering and then re-enable these tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6745) MesosContainerizer/DefaultExecutorTest.KillTask/0 is flaky

2016-12-08 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-6745:
---

Assignee: Benjamin Bannier

> MesosContainerizer/DefaultExecutorTest.KillTask/0 is flaky
> --
>
> Key: MESOS-6745
> URL: https://issues.apache.org/jira/browse/MESOS-6745
> Project: Mesos
>  Issue Type: Bug
> Environment: Recent Arch Linux VM
>Reporter: Neil Conway
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>
> This repros consistently for me (< 20 test iterations), using {{master}} as 
> of {{ab79d58c9df0ffb8ad35f6662541e7a5c3ea4a80}}. Test log:
> {noformat}
> [--] 1 test from MesosContainerizer/DefaultExecutorTest
> [ RUN  ] MesosContainerizer/DefaultExecutorTest.KillTask/0
> I1208 03:32:34.943745 29285 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1208 03:32:34.944695 29285 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1208 03:32:34.945287 29306 recover.cpp:451] Starting replica recovery
> I1208 03:32:34.945431 29306 recover.cpp:477] Replica is in EMPTY status
> I1208 03:32:34.946542 29300 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(127)@10.0.2.15:36807
> I1208 03:32:34.946768 29301 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1208 03:32:34.947377 29299 recover.cpp:568] Updating replica status to 
> STARTING
> I1208 03:32:34.947746 29306 replica.cpp:320] Persisted replica status to 
> STARTING
> I1208 03:32:34.947887 29306 recover.cpp:477] Replica is in STARTING status
> I1208 03:32:34.948559 29306 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from __req_res__(128)@10.0.2.15:36807
> I1208 03:32:34.948771 29299 recover.cpp:197] Received a recover response from 
> a replica in STARTING status
> I1208 03:32:34.949097 29302 recover.cpp:568] Updating replica status to VOTING
> I1208 03:32:34.949385 29306 replica.cpp:320] Persisted replica status to 
> VOTING
> I1208 03:32:34.949467 29306 recover.cpp:582] Successfully joined the Paxos 
> group
> I1208 03:32:34.971436 29301 master.cpp:380] Master 
> 67de7bda-9b5b-4fe9-aede-390ec9ca7290 (archlinux.vagrant.vm) started on 
> 10.0.2.15:36807
> I1208 03:32:34.971519 29301 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/8oMk6W/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/8oMk6W/master" --zk_session_timeout="10secs"
> I1208 03:32:34.971824 29301 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1208 03:32:34.971832 29301 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1208 03:32:34.971837 29301 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1208 03:32:34.971842 29301 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/8oMk6W/credentials'
> I1208 03:32:34.972051 29301 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1208 03:32:34.972198 29301 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1208 03:32:34.972327 29301 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1208 03:32:34.972436 29301 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1208 03:32:34.972561 29301 master.cpp:584] Authorization enabled
> I1208 03:32:34.974555 29300 master.cpp:2043] Elected as the leading master!
> I1208 03:32:34.974586 29300 master.cpp:1566] Recovering from registrar
> I1208 03:32:34.975244 29306 log.cpp:553] Attempting to start the writer
> 

[jira] [Assigned] (MESOS-6744) DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky

2016-12-08 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-6744:
---

Assignee: Benjamin Bannier

> DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky
> ---
>
> Key: MESOS-6744
> URL: https://issues.apache.org/jira/browse/MESOS-6744
> Project: Mesos
>  Issue Type: Bug
> Environment: Recent Arch Linux VM, amd64.
>Reporter: Neil Conway
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>
> This repros consistently for me (~10 test iterations or fewer). Test log:
> {noformat}
> [ RUN  ] DefaultExecutorTest.KillTaskGroupOnTaskFailure
> I1208 03:26:47.461477 28632 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1208 03:26:47.462673 28632 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1208 03:26:47.463248 28650 recover.cpp:451] Starting replica recovery
> I1208 03:26:47.463537 28650 recover.cpp:477] Replica is in EMPTY status
> I1208 03:26:47.476333 28651 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(64)@10.0.2.15:46643
> I1208 03:26:47.476618 28650 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1208 03:26:47.477242 28649 recover.cpp:568] Updating replica status to 
> STARTING
> I1208 03:26:47.477496 28649 replica.cpp:320] Persisted replica status to 
> STARTING
> I1208 03:26:47.477607 28649 recover.cpp:477] Replica is in STARTING status
> I1208 03:26:47.478910 28653 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from __req_res__(65)@10.0.2.15:46643
> I1208 03:26:47.479385 28651 recover.cpp:197] Received a recover response from 
> a replica in STARTING status
> I1208 03:26:47.479717 28647 recover.cpp:568] Updating replica status to VOTING
> I1208 03:26:47.479996 28648 replica.cpp:320] Persisted replica status to 
> VOTING
> I1208 03:26:47.480077 28648 recover.cpp:582] Successfully joined the Paxos 
> group
> I1208 03:26:47.763380 28651 master.cpp:380] Master 
> 0bcb0250-4cf5-4209-92fe-ce260518b50f (archlinux.vagrant.vm) started on 
> 10.0.2.15:46643
> I1208 03:26:47.763463 28651 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/7lpy50/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/7lpy50/master" --zk_session_timeout="10secs"
> I1208 03:26:47.764010 28651 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1208 03:26:47.764070 28651 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1208 03:26:47.764076 28651 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1208 03:26:47.764081 28651 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/7lpy50/credentials'
> I1208 03:26:47.764482 28651 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1208 03:26:47.764659 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1208 03:26:47.764981 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1208 03:26:47.765136 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1208 03:26:47.765231 28651 master.cpp:584] Authorization enabled
> I1208 03:26:47.768061 28651 master.cpp:2043] Elected as the leading master!
> I1208 03:26:47.768097 28651 master.cpp:1566] Recovering from registrar
> I1208 03:26:47.768766 28648 log.cpp:553] Attempting to start the writer
> I1208 03:26:47.769899 28653 replica.cpp:493] Replica received implicit 
> promise request from __req_res__(66)@10.0.2.15:46643 with 

[jira] [Commented] (MESOS-6745) MesosContainerizer/DefaultExecutorTest.KillTask/0 is flaky

2016-12-08 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15732327#comment-15732327
 ] 

Benjamin Bannier commented on MESOS-6745:
-

The issue here seems to be that the scheduler sends status update 
acknowledgements in not exactly in the order the agent originally sent the 
updates.
{code}
E1208 03:32:35.358795 29303 slave.cpp:3018] Failed to handle status update 
acknowledgement (UUID: aed3ed28-1943-44c3-a8b6-40be41ffc20b) for task 
93d62044-e146-4b70-9648-221b72cfaad7 of framework 
67de7bda-9b5b-4fe9-aede-390ec9ca7290-: Duplicate acknowledgement
{code}

(The error string {{Duplicate acknowledgment}} is incorrect here; the only 
issue was that we were not waiting for this exact acknowledgement).

[~anandmazumdar] [~vinodkone]: Should the status update manager allow 
acknowledging in any order? The documentation does not make a strong point 
about ordering, and it would be relatively easy to e.g., subscribe to 
schedulers to the same stream, which would lead to issues like this on in 
production scenarios.
{code}
# ACKNOWLEDGE

Sent by the scheduler to acknowledge a status update. Note that with the new 
API, schedulers are responsible for explicitly acknowledging the receipt of 
status updates that have “status.uuid()” set. These status updates are reliably 
retried until they are acknowledged by the scheduler. The scheduler must not 
acknowledge status updates that do not have “status.uuid()” set as they are not 
retried. “uuid” is raw bytes encoded in Base64.
{code}

> MesosContainerizer/DefaultExecutorTest.KillTask/0 is flaky
> --
>
> Key: MESOS-6745
> URL: https://issues.apache.org/jira/browse/MESOS-6745
> Project: Mesos
>  Issue Type: Bug
> Environment: Recent Arch Linux VM
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This repros consistently for me (< 20 test iterations), using {{master}} as 
> of {{ab79d58c9df0ffb8ad35f6662541e7a5c3ea4a80}}. Test log:
> {noformat}
> [--] 1 test from MesosContainerizer/DefaultExecutorTest
> [ RUN  ] MesosContainerizer/DefaultExecutorTest.KillTask/0
> I1208 03:32:34.943745 29285 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1208 03:32:34.944695 29285 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1208 03:32:34.945287 29306 recover.cpp:451] Starting replica recovery
> I1208 03:32:34.945431 29306 recover.cpp:477] Replica is in EMPTY status
> I1208 03:32:34.946542 29300 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(127)@10.0.2.15:36807
> I1208 03:32:34.946768 29301 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1208 03:32:34.947377 29299 recover.cpp:568] Updating replica status to 
> STARTING
> I1208 03:32:34.947746 29306 replica.cpp:320] Persisted replica status to 
> STARTING
> I1208 03:32:34.947887 29306 recover.cpp:477] Replica is in STARTING status
> I1208 03:32:34.948559 29306 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from __req_res__(128)@10.0.2.15:36807
> I1208 03:32:34.948771 29299 recover.cpp:197] Received a recover response from 
> a replica in STARTING status
> I1208 03:32:34.949097 29302 recover.cpp:568] Updating replica status to VOTING
> I1208 03:32:34.949385 29306 replica.cpp:320] Persisted replica status to 
> VOTING
> I1208 03:32:34.949467 29306 recover.cpp:582] Successfully joined the Paxos 
> group
> I1208 03:32:34.971436 29301 master.cpp:380] Master 
> 67de7bda-9b5b-4fe9-aede-390ec9ca7290 (archlinux.vagrant.vm) started on 
> 10.0.2.15:36807
> I1208 03:32:34.971519 29301 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/8oMk6W/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" 

[jira] [Comment Edited] (MESOS-6744) DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky

2016-12-08 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15732332#comment-15732332
 ] 

Benjamin Bannier edited comment on MESOS-6744 at 12/8/16 2:20 PM:
--

This issue is caused by the same pattern as MESOS-6745, see [this 
comment|https://issues.apache.org/jira/browse/MESOS-6745?focusedCommentId=15732327=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15732327]
 over there.


was (Author: bbannier):
This issue is caused by the same pattern as MESOS-6745, see [this 
comment](https://issues.apache.org/jira/browse/MESOS-6745?focusedCommentId=15732327=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15732327)
 over there.

> DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky
> ---
>
> Key: MESOS-6744
> URL: https://issues.apache.org/jira/browse/MESOS-6744
> Project: Mesos
>  Issue Type: Bug
> Environment: Recent Arch Linux VM, amd64.
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This repros consistently for me (~10 test iterations or fewer). Test log:
> {noformat}
> [ RUN  ] DefaultExecutorTest.KillTaskGroupOnTaskFailure
> I1208 03:26:47.461477 28632 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1208 03:26:47.462673 28632 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1208 03:26:47.463248 28650 recover.cpp:451] Starting replica recovery
> I1208 03:26:47.463537 28650 recover.cpp:477] Replica is in EMPTY status
> I1208 03:26:47.476333 28651 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(64)@10.0.2.15:46643
> I1208 03:26:47.476618 28650 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1208 03:26:47.477242 28649 recover.cpp:568] Updating replica status to 
> STARTING
> I1208 03:26:47.477496 28649 replica.cpp:320] Persisted replica status to 
> STARTING
> I1208 03:26:47.477607 28649 recover.cpp:477] Replica is in STARTING status
> I1208 03:26:47.478910 28653 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from __req_res__(65)@10.0.2.15:46643
> I1208 03:26:47.479385 28651 recover.cpp:197] Received a recover response from 
> a replica in STARTING status
> I1208 03:26:47.479717 28647 recover.cpp:568] Updating replica status to VOTING
> I1208 03:26:47.479996 28648 replica.cpp:320] Persisted replica status to 
> VOTING
> I1208 03:26:47.480077 28648 recover.cpp:582] Successfully joined the Paxos 
> group
> I1208 03:26:47.763380 28651 master.cpp:380] Master 
> 0bcb0250-4cf5-4209-92fe-ce260518b50f (archlinux.vagrant.vm) started on 
> 10.0.2.15:46643
> I1208 03:26:47.763463 28651 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/7lpy50/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/7lpy50/master" --zk_session_timeout="10secs"
> I1208 03:26:47.764010 28651 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1208 03:26:47.764070 28651 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1208 03:26:47.764076 28651 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1208 03:26:47.764081 28651 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/7lpy50/credentials'
> I1208 03:26:47.764482 28651 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1208 03:26:47.764659 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1208 03:26:47.764981 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1208 

[jira] [Commented] (MESOS-6744) DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky

2016-12-08 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15732332#comment-15732332
 ] 

Benjamin Bannier commented on MESOS-6744:
-

This issue is caused by the same pattern as MESOS-6745, see [this 
comment](https://issues.apache.org/jira/browse/MESOS-6745?focusedCommentId=15732327=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15732327)
 over there.

> DefaultExecutorTest.KillTaskGroupOnTaskFailure is flaky
> ---
>
> Key: MESOS-6744
> URL: https://issues.apache.org/jira/browse/MESOS-6744
> Project: Mesos
>  Issue Type: Bug
> Environment: Recent Arch Linux VM, amd64.
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This repros consistently for me (~10 test iterations or fewer). Test log:
> {noformat}
> [ RUN  ] DefaultExecutorTest.KillTaskGroupOnTaskFailure
> I1208 03:26:47.461477 28632 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1208 03:26:47.462673 28632 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1208 03:26:47.463248 28650 recover.cpp:451] Starting replica recovery
> I1208 03:26:47.463537 28650 recover.cpp:477] Replica is in EMPTY status
> I1208 03:26:47.476333 28651 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(64)@10.0.2.15:46643
> I1208 03:26:47.476618 28650 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1208 03:26:47.477242 28649 recover.cpp:568] Updating replica status to 
> STARTING
> I1208 03:26:47.477496 28649 replica.cpp:320] Persisted replica status to 
> STARTING
> I1208 03:26:47.477607 28649 recover.cpp:477] Replica is in STARTING status
> I1208 03:26:47.478910 28653 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from __req_res__(65)@10.0.2.15:46643
> I1208 03:26:47.479385 28651 recover.cpp:197] Received a recover response from 
> a replica in STARTING status
> I1208 03:26:47.479717 28647 recover.cpp:568] Updating replica status to VOTING
> I1208 03:26:47.479996 28648 replica.cpp:320] Persisted replica status to 
> VOTING
> I1208 03:26:47.480077 28648 recover.cpp:582] Successfully joined the Paxos 
> group
> I1208 03:26:47.763380 28651 master.cpp:380] Master 
> 0bcb0250-4cf5-4209-92fe-ce260518b50f (archlinux.vagrant.vm) started on 
> 10.0.2.15:46643
> I1208 03:26:47.763463 28651 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/7lpy50/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/7lpy50/master" --zk_session_timeout="10secs"
> I1208 03:26:47.764010 28651 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1208 03:26:47.764070 28651 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1208 03:26:47.764076 28651 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1208 03:26:47.764081 28651 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/7lpy50/credentials'
> I1208 03:26:47.764482 28651 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1208 03:26:47.764659 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1208 03:26:47.764981 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1208 03:26:47.765136 28651 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1208 03:26:47.765231 28651 master.cpp:584] Authorization enabled
> I1208 03:26:47.768061 28651 master.cpp:2043] Elected as the leading master!
> I1208 03:26:47.768097 28651 master.cpp:1566] Recovering from registrar