[jira] [Updated] (MESOS-8080) The default executor does not propagate missing task exit status correctly.

2017-11-03 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8080:
---
Fix Version/s: 1.4.1

> The default executor does not propagate missing task exit status correctly.
> ---
>
> Key: MESOS-8080
> URL: https://issues.apache.org/jira/browse/MESOS-8080
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
> Fix For: 1.2.3, 1.3.2, 1.4.1, 1.5.0
>
>
> The default executor is not handling a missing nested container
> exit status correctly. It is assuming the protobuf accessor was
> returning an Option rather than explicitly checking whether the
> `exit_status` field was present in the message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6816) Allows frameworks to overwrite system environment variables

2017-11-03 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238361#comment-16238361
 ] 

Andrew Schwartzmeyer commented on MESOS-6816:
-

Fixed, need to post reviews.

> Allows frameworks to overwrite system environment variables
> ---
>
> Key: MESOS-6816
> URL: https://issues.apache.org/jira/browse/MESOS-6816
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Daniel Pravat
>Assignee: Andrew Schwartzmeyer
>Priority: Minor
>  Labels: microsoft, windows
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In case the framework is specifying an environment variable block, mesos 
> agent code overwrites the variables already present in the system environment 
> block. For example even if the framework specify a variable named `ComSpec` 
> the value observed in the agent will be one configured in the `ComSpec` 
> system environment. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-5932) Replicated log's dependency on leveldb prevents it from being used on Windows

2017-11-03 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238357#comment-16238357
 ] 

Andrew Schwartzmeyer commented on MESOS-5932:
-

These issues share the same blocker, and as noted on MESOS-5820, leveldb is now 
being officially ported to Windows.

> Replicated log's dependency on leveldb prevents it from being used on Windows
> -
>
> Key: MESOS-5932
> URL: https://issues.apache.org/jira/browse/MESOS-5932
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Alex Clemmer
>Assignee: Andrew Schwartzmeyer
>Priority: Major
>  Labels: agent, master, mesosphere
>
> The replicated log (in src/log/) depends on leveldb to store and persist data 
> in the replicas.
> This dependency is well-contained within replica.cpp, but until it is 
> abstracted out, it nonetheless prevents the master from being built on 
> Windows, which in turn prevents the agent tests from being built and run on 
> Windows.
> Preliminary investigation shows that we will probably want to split this work 
> into 2 parts:
> * Temporarily remove the ability of the master to use the replicated log on 
> Windows (in master/main.cpp). This should involve 1 conditional where we 
> instantiate a `Log::Log`. This should be enough for us to light up the agent 
> tests.
> * Add leveldb Windows support to Mesos. This involves: adding CMake files to 
> build leveldb source, and adding Windows-specific `port_*` files that will 
> map the platform-specific constructs of leveldb to Windows. We can take hints 
> from leveldown and other projects, which add their own `port_*` files that 
> suit their purposes (namely, running leveldb, in node, on Windows). NOTE: the 
> leveldb community explicitly calls out in its documentation that it is not 
> interested in non-POSIX changes, so it is likely that this will never be 
> inducted into the mainline leveldb codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6117) TCP health checks are not supported on Windows.

2017-11-03 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-6117:
---

Assignee: John Kordich  (was: Andrew Schwartzmeyer)

> TCP health checks are not supported on Windows.
> ---
>
> Key: MESOS-6117
> URL: https://issues.apache.org/jira/browse/MESOS-6117
> Project: Mesos
>  Issue Type: Task
>  Components: executor
>Reporter: Alexander Rukletsov
>Assignee: John Kordich
>Priority: Major
>  Labels: check, health-check, mesosphere, windows
>
> Currently, TCP health check is only available on Linux. Windows support 
> should be added to maintain feature parity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6714) Port `slave_tests.cpp`

2017-11-03 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-6714:
---

Assignee: Andrew Schwartzmeyer  (was: Li Li)

> Port `slave_tests.cpp`
> --
>
> Key: MESOS-6714
> URL: https://issues.apache.org/jira/browse/MESOS-6714
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Alex Clemmer
>Assignee: Andrew Schwartzmeyer
>Priority: Major
>  Labels: microsoft, windows-mvp
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6709) Port `health_check_tests.cpp`

2017-11-03 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-6709:
---

Assignee: John Kordich  (was: Andrew Schwartzmeyer)

> Port `health_check_tests.cpp`
> -
>
> Key: MESOS-6709
> URL: https://issues.apache.org/jira/browse/MESOS-6709
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Alex Clemmer
>Assignee: John Kordich
>Priority: Major
>  Labels: microsoft, windows, windows-mvp
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6712) Port `slave_authorization_tests.cpp`

2017-11-03 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-6712:
---

Assignee: John Kordich  (was: Andrew Schwartzmeyer)

> Port `slave_authorization_tests.cpp`
> 
>
> Key: MESOS-6712
> URL: https://issues.apache.org/jira/browse/MESOS-6712
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Alex Clemmer
>Assignee: John Kordich
>Priority: Major
>  Labels: microsoft, windows-mvp
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-5932) Replicated log's dependency on leveldb prevents it from being used on Windows

2017-11-03 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-5932:
---

Assignee: Andrew Schwartzmeyer

> Replicated log's dependency on leveldb prevents it from being used on Windows
> -
>
> Key: MESOS-5932
> URL: https://issues.apache.org/jira/browse/MESOS-5932
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Alex Clemmer
>Assignee: Andrew Schwartzmeyer
>Priority: Major
>  Labels: agent, master, mesosphere
>
> The replicated log (in src/log/) depends on leveldb to store and persist data 
> in the replicas.
> This dependency is well-contained within replica.cpp, but until it is 
> abstracted out, it nonetheless prevents the master from being built on 
> Windows, which in turn prevents the agent tests from being built and run on 
> Windows.
> Preliminary investigation shows that we will probably want to split this work 
> into 2 parts:
> * Temporarily remove the ability of the master to use the replicated log on 
> Windows (in master/main.cpp). This should involve 1 conditional where we 
> instantiate a `Log::Log`. This should be enough for us to light up the agent 
> tests.
> * Add leveldb Windows support to Mesos. This involves: adding CMake files to 
> build leveldb source, and adding Windows-specific `port_*` files that will 
> map the platform-specific constructs of leveldb to Windows. We can take hints 
> from leveldown and other projects, which add their own `port_*` files that 
> suit their purposes (namely, running leveldb, in node, on Windows). NOTE: the 
> leveldb community explicitly calls out in its documentation that it is not 
> interested in non-POSIX changes, so it is likely that this will never be 
> inducted into the mainline leveldb codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8098) Benchmark Master failover performance

2017-11-03 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238131#comment-16238131
 ] 

Yan Xu commented on MESOS-8098:
---

{noformat:title=}
commit ac0fa281472c2ba891f7bd0837fbd728ace73039
Author: Jiang Yan Xu 
Date:   Wed Oct 18 01:53:11 2017 -0700

Added a benchmark for agent reregistration during master failover.

Review: https://reviews.apache.org/r/63174
{noformat}

> Benchmark Master failover performance
> -
>
> Key: MESOS-8098
> URL: https://issues.apache.org/jira/browse/MESOS-8098
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Major
> Attachments: withoutperfpatches.perf.svg, withperfpatches.perf.svg
>
>
> Master failover performance often sheds light on the master's performance in 
> general as it's often the time the master experiences the highest load. Ways 
> we can benchmark the failover include the time it takes for all agents to 
> reregister, all frameworks to resubscribe or fully reconcile.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8098) Benchmark Master failover performance

2017-11-03 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-8098:
--
Attachment: withoutperfpatches.perf.svg
withperfpatches.perf.svg

Attaching two flame graphs comparing the benchmark running against the two 
versions below:

withperfpatches.perf.svg: 
https://github.com/apache/mesos/commit/41193181d6b75eeecae2729bf98007d9318e351a 
(close to the HEAD when the benchmark was created).

vs. 

withoutperfpatches.perf.svg: 
https://github.com/apache/mesos/commit/d9c90bf1d9c8b3a7dcc47be0cb773efff57cfb9d 
(before https://issues.apache.org/jira/browse/MESOS-7713 was merged)

The perf data was captured with me invoking gdb-mesos-tests.sh -> setting two 
break points on the two {{cout}} lines (right before and after the bulk 
reregistration) -> run -> coordinate {{perf record}} with the break points so 
it only captures the process behavior in between.

However I couldn't find much useful info from the resulting graphs. Perhaps 
someone can help me take a look? /cc [~bmahler] [~ipronin] [~dzhuk]?

> Benchmark Master failover performance
> -
>
> Key: MESOS-8098
> URL: https://issues.apache.org/jira/browse/MESOS-8098
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Major
> Attachments: withoutperfpatches.perf.svg, withperfpatches.perf.svg
>
>
> Master failover performance often sheds light on the master's performance in 
> general as it's often the time the master experiences the highest load. Ways 
> we can benchmark the failover include the time it takes for all agents to 
> reregister, all frameworks to resubscribe or fully reconcile.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8080) The default executor does not propagate missing task exit status correctly.

2017-11-03 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8080:
---
Fix Version/s: 1.3.2

> The default executor does not propagate missing task exit status correctly.
> ---
>
> Key: MESOS-8080
> URL: https://issues.apache.org/jira/browse/MESOS-8080
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
> Fix For: 1.2.3, 1.3.2, 1.5.0
>
>
> The default executor is not handling a missing nested container
> exit status correctly. It is assuming the protobuf accessor was
> returning an Option rather than explicitly checking whether the
> `exit_status` field was present in the message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-03 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238045#comment-16238045
 ] 

Alexander Rukletsov commented on MESOS-7506:


There are at least two different paths that lead to orphaned containers. Andrei 
described one above. Another one is still to be fully investigated, but in 
short, an executor appears to have exited and {{containerizer->wait()}} is 
triggered on its container id, but the container is not removed from the 
containerizer's internal {{containers_}} collection.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8080) The default executor does not propagate missing task exit status correctly.

2017-11-03 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8080:
---
Fix Version/s: 1.2.3

> The default executor does not propagate missing task exit status correctly.
> ---
>
> Key: MESOS-8080
> URL: https://issues.apache.org/jira/browse/MESOS-8080
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
> Fix For: 1.2.3, 1.5.0
>
>
> The default executor is not handling a missing nested container
> exit status correctly. It is assuming the protobuf accessor was
> returning an Option rather than explicitly checking whether the
> `exit_status` field was present in the message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7378) Build failure with glibc 2.12.

2017-11-03 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-7378:
---
Fix Version/s: 1.2.3

> Build failure with glibc 2.12.
> --
>
> Key: MESOS-7378
> URL: https://issues.apache.org/jira/browse/MESOS-7378
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.3.0
>Reporter: James Peach
>Assignee: Neil Conway
>Priority: Blocker
> Fix For: 1.2.3, 1.3.0, 1.4.0
>
>
> {noformat}
> 03:46:16 - ./.libs/libmesos.so: undefined reference to 
> `gnu_dev_minor(unsigned long long)'
> 03:46:16 - ./.libs/libmesos.so: undefined reference to 
> `gnu_dev_major(unsigned long long)'
> {noformat}
> This is caused by the change in MESOS-7365.
> Including {{}} directly works on modern systems, but on our 
> older version of glibc, the {{}} header does not contain C++ 
> decls. This means that the inline symbols get C++ name mangling applied and 
> they don't get found at link time.
> {noformat}
> vagrant@mesos ~]$ cat /etc/redhat-release
> CentOS release 6.8 (Final)
> [vagrant@mesos ~]$ rpm -qa | grep glibc
> glibc-common-2.12-1.192.el6.x86_64
> glibc-devel-2.12-1.192.el6.x86_64
> glibc-2.12-1.192.el6.x86_64
> glibc-headers-2.12-1.192.el6.x86_64
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8170) Propagate exit status 127 for errors after fork

2017-11-03 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237988#comment-16237988
 ] 

James Peach commented on MESOS-8170:


/cc [~benjaminhindman] [~jieyu]

> Propagate exit status 127 for errors after fork
> ---
>
> Key: MESOS-8170
> URL: https://issues.apache.org/jira/browse/MESOS-8170
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: James Peach
>
> There's no consistent methodology in lib process or Mesos for propagating 
> errors that happen between fork and exec. For 
> [posix_spawn|http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_spawn.html],
>  the POSIX standard designated an exit code or 127 to identify this case.  We 
> should adopt the same convention.
> {quote}
> The 8 bits of child process exit status that are guaranteed by IEEE Std 
> 1003.1-2001 to be accessible to the waiting parent process are insufficient 
> to disambiguate a spawn error from any other kind of error that may be 
> returned by an arbitrary process image. No other bits of the exit status are 
> required to be visible in stat_val, so these macros could not be strictly 
> implemented at the library level. Reserving an exit status of 127 for such 
> spawn errors is consistent with the use of this value by system() and popen() 
> to signal failures in these operations that occur after the function has 
> returned but before a shell is able to execute. The exit status of 127 does 
> not uniquely identify this class of error, nor does it provide any detailed 
> information on the nature of the failure. Note that a kernel implementation 
> of posix_spawn() or posix_spawnp() is permitted (and encouraged) to return 
> any possible error as the function value, thus providing more detailed 
> failure information to the parent process.
> Thus, no special macros are available to isolate asynchronous posix_spawn() 
> or posix_spawnp() errors. Instead, errors detected by the posix_spawn() or 
> posix_spawnp() operations in the context of the child process before the new 
> process image executes are reported by setting the child's exit status to 
> 127. The calling process may use the WIFEXITED and WEXITSTATUS macros on the 
> stat_val stored by the wait() or waitpid() functions to detect spawn failures 
> to the extent that other status values with which the child process image may 
> exit (before the parent can conclusively determine that the child process 
> image has begun execution) are distinct from exit status 127.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8170) Propagate exit status 127 for errors after fork

2017-11-03 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8170:
---
Description: 
There's no consistent methodology in lib process or Mesos for propagating 
errors that happen between fork and exec. For 
[posix_spawn|http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_spawn.html],
 the POSIX standard designated an exit code or 127 to identify this case.  We 
should adopt the same convention.

{quote}
The 8 bits of child process exit status that are guaranteed by IEEE Std 
1003.1-2001 to be accessible to the waiting parent process are insufficient to 
disambiguate a spawn error from any other kind of error that may be returned by 
an arbitrary process image. No other bits of the exit status are required to be 
visible in stat_val, so these macros could not be strictly implemented at the 
library level. Reserving an exit status of 127 for such spawn errors is 
consistent with the use of this value by system() and popen() to signal 
failures in these operations that occur after the function has returned but 
before a shell is able to execute. The exit status of 127 does not uniquely 
identify this class of error, nor does it provide any detailed information on 
the nature of the failure. Note that a kernel implementation of posix_spawn() 
or posix_spawnp() is permitted (and encouraged) to return any possible error as 
the function value, thus providing more detailed failure information to the 
parent process.

Thus, no special macros are available to isolate asynchronous posix_spawn() or 
posix_spawnp() errors. Instead, errors detected by the posix_spawn() or 
posix_spawnp() operations in the context of the child process before the new 
process image executes are reported by setting the child's exit status to 127. 
The calling process may use the WIFEXITED and WEXITSTATUS macros on the 
stat_val stored by the wait() or waitpid() functions to detect spawn failures 
to the extent that other status values with which the child process image may 
exit (before the parent can conclusively determine that the child process image 
has begun execution) are distinct from exit status 127.
{quote}



> Propagate exit status 127 for errors after fork
> ---
>
> Key: MESOS-8170
> URL: https://issues.apache.org/jira/browse/MESOS-8170
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: James Peach
>
> There's no consistent methodology in lib process or Mesos for propagating 
> errors that happen between fork and exec. For 
> [posix_spawn|http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_spawn.html],
>  the POSIX standard designated an exit code or 127 to identify this case.  We 
> should adopt the same convention.
> {quote}
> The 8 bits of child process exit status that are guaranteed by IEEE Std 
> 1003.1-2001 to be accessible to the waiting parent process are insufficient 
> to disambiguate a spawn error from any other kind of error that may be 
> returned by an arbitrary process image. No other bits of the exit status are 
> required to be visible in stat_val, so these macros could not be strictly 
> implemented at the library level. Reserving an exit status of 127 for such 
> spawn errors is consistent with the use of this value by system() and popen() 
> to signal failures in these operations that occur after the function has 
> returned but before a shell is able to execute. The exit status of 127 does 
> not uniquely identify this class of error, nor does it provide any detailed 
> information on the nature of the failure. Note that a kernel implementation 
> of posix_spawn() or posix_spawnp() is permitted (and encouraged) to return 
> any possible error as the function value, thus providing more detailed 
> failure information to the parent process.
> Thus, no special macros are available to isolate asynchronous posix_spawn() 
> or posix_spawnp() errors. Instead, errors detected by the posix_spawn() or 
> posix_spawnp() operations in the context of the child process before the new 
> process image executes are reported by setting the child's exit status to 
> 127. The calling process may use the WIFEXITED and WEXITSTATUS macros on the 
> stat_val stored by the wait() or waitpid() functions to detect spawn failures 
> to the extent that other status values with which the child process image may 
> exit (before the parent can conclusively determine that the child process 
> image has begun execution) are distinct from exit status 127.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8170) Propagate exit status 127 for errors after fork

2017-11-03 Thread James Peach (JIRA)
James Peach created MESOS-8170:
--

 Summary: Propagate exit status 127 for errors after fork
 Key: MESOS-8170
 URL: https://issues.apache.org/jira/browse/MESOS-8170
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: James Peach






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6214) Containerizers assume caller will call 'destroy' if 'launch' fails.

2017-11-03 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6214:
--
Fix Version/s: (was: 1.2.0)

> Containerizers assume caller will call 'destroy' if 'launch' fails.
> ---
>
> Key: MESOS-6214
> URL: https://issues.apache.org/jira/browse/MESOS-6214
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Benjamin Mahler
>Assignee: Kevin Klues
>Priority: Major
>  Labels: tech-debt
>
> The planned API for nested containers is to allow launching, waiting (for 
> termination), and killing (currently only SIGKILL) of the nested container. 
> Note that this API provides no mechanism for "cleaning up" the container 
> because it will implicitly do so once the container terminates.
> However, the containerizer currently assumes that the caller will call 
> destroy if the launch fails. In order to implement the agent's API for 
> managing nested containers, we will have to set up a failure continuation to 
> call destroy to ensure the cleanup occurs correctly.
> Ideally, the API of the containerizer does not require the caller to call 
> destroy after a launch failure, given that the launch did not succeed it 
> seems counter-intuitive for the responsibility of clean up to be on the 
> caller. In addition, in the container termination case, the containerizer 
> will implicitly clean up (so this seems inconsistent as well).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7069) The linux filesystem isolator should set mode and ownership for host volumes.

2017-11-03 Thread R.B. Boyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237741#comment-16237741
 ] 

R.B. Boyer commented on MESOS-7069:
---

Did 5187 make it into 1.2.2? If so then it's still broken and this ticket
is relevant.




> The linux filesystem isolator should set mode and ownership for host volumes.
> -
>
> Key: MESOS-7069
> URL: https://issues.apache.org/jira/browse/MESOS-7069
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Ilya Pronin
>Priority: Major
>  Labels: filesystem, linux, volumes
>
> If the host path is a relative path, the linux filesystem isolator should set 
> the mode and ownership for this host volume since it allows non-root user to 
> write to the volume. Note that this is the case of sharing the host 
> fileysystem (without rootfs).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks

2017-11-03 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8018:
--

Assignee: (was: James Peach)

> Allow framework to opt-in to forward executor's JWT token to the tasks
> --
>
> Key: MESOS-8018
> URL: https://issues.apache.org/jira/browse/MESOS-8018
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Priority: Major
>
> Nested container API is an awesome feature and enabled a lot of interesting 
> use cases. A pattern we have seen multiple times is that a task (often the 
> only one) launched by default executor wants to further creates containers 
> nested behind itself (or the executor) to run some different workload.
> Because the entire request is 1) completely local to the executor container, 
> 2) okay to be bounded within the executor's lifecycle, we'd like to allow the 
> task to use the mesos agent API directly to create these nested containers. 
> However, it creates a problem when we want to enable HTTP executor 
> authentication because the JWT auth tokens are only available to the executor 
> so the task's API request will be rejected.
> Requiring framework owner to fork or create a custom executor simply for this 
> purpose also seems a bit too heavy.
> My proposal is to allow framework to opt-in with some field so that the 
> launched task will receive certain environment variables from default 
> executor, so the task can "act upon" the executor. One idea is to add a new 
> field to allow certain environment variables to be forwarded from executor to 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking

2017-11-03 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237738#comment-16237738
 ] 

James DeFelice commented on MESOS-8169:
---

/cc [~jamespeach]

> master validation incorrectly rejects slaves, buggy executorID checking
> ---
>
> Key: MESOS-8169
> URL: https://issues.apache.org/jira/browse/MESOS-8169
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere
>
> proposed fix: https://github.com/apache/mesos/pull/248
> I observed this in my environment, where I had two frameworks that used the 
> same ExecutorID and then triggered a master failover. The master refuses to 
> reregister the slave because it's not considering the owning-framework of the 
> ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) 
> that there's an erroneous duplicate executor ID:
> {code}
> W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of 
> agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: 
> Executor has a duplicate ExecutorID 'default'
> {code}
> (yes, "default" is probably a terrible name for an ExecutorID - that's a 
> separate discussion!)
> /cc [~neilc]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking

2017-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237734#comment-16237734
 ] 

ASF GitHub Bot commented on MESOS-8169:
---

Github user jdef commented on the issue:

https://github.com/apache/mesos/pull/248
  
https://issues.apache.org/jira/browse/MESOS-8169


> master validation incorrectly rejects slaves, buggy executorID checking
> ---
>
> Key: MESOS-8169
> URL: https://issues.apache.org/jira/browse/MESOS-8169
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere
>
> proposed fix: https://github.com/apache/mesos/pull/248
> I observed this in my environment, where I had two frameworks that used the 
> same ExecutorID and then triggered a master failover. The master refuses to 
> reregister the slave because it's not considering the owning-framework of the 
> ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) 
> that there's an erroneous duplicate executor ID:
> {code}
> W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of 
> agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: 
> Executor has a duplicate ExecutorID 'default'
> {code}
> (yes, "default" is probably a terrible name for an ExecutorID - that's a 
> separate discussion!)
> /cc [~neilc]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking

2017-11-03 Thread James DeFelice (JIRA)
James DeFelice created MESOS-8169:
-

 Summary: master validation incorrectly rejects slaves, buggy 
executorID checking
 Key: MESOS-8169
 URL: https://issues.apache.org/jira/browse/MESOS-8169
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.4.0
Reporter: James DeFelice
Priority: Major


proposed fix: https://github.com/apache/mesos/pull/248

I observed this in my environment, where I had two frameworks that used the 
same ExecutorID and then triggered a master failover. The master refuses to 
reregister the slave because it's not considering the owning-framework of the 
ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) 
that there's an erroneous duplicate executor ID:

{code}
W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of agent 
at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: Executor 
has a duplicate ExecutorID 'default'
{code}

(yes, "default" is probably a terrible name for an ExecutorID - that's a 
separate discussion!)

/cc [~neilc]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7069) The linux filesystem isolator should set mode and ownership for host volumes.

2017-11-03 Thread Julien Pepy (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237560#comment-16237560
 ] 

Julien Pepy commented on MESOS-7069:


Hi, what is the status on this ticket? The review has been stalled for 6 
months, and it looks to me that MESOS-5187 has fixed the issue.

> The linux filesystem isolator should set mode and ownership for host volumes.
> -
>
> Key: MESOS-7069
> URL: https://issues.apache.org/jira/browse/MESOS-7069
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Ilya Pronin
>Priority: Major
>  Labels: filesystem, linux, volumes
>
> If the host path is a relative path, the linux filesystem isolator should set 
> the mode and ownership for this host volume since it allows non-root user to 
> write to the volume. Note that this is the case of sharing the host 
> fileysystem (without rootfs).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8093) Some tests miss subscribed event because expectation is set after event fires.

2017-11-03 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8093:
--
Description: 
Tests
{noformat}
CgroupsIsolatorTest.ROOT_CGROUPS_LimitSwap
DefaultExecutorCniTest.ROOT_VerifyContainerIP
DockerRuntimeIsolatorTest.ROOT_INTERNET_CURL_NestedSimpleCommand
DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultCmdLocalPuller
DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultEntryptLocalPuller
{noformat}
all have the same problem. They initiate a scheduler subscribe call in reaction 
to {{connected}} event. However, an expectation for {{subscribed}} event is 
created _afterwards_, which might lead to an uninteresting mock function call 
for {{subscribed}} followed by a failure to wait for {{subscribed}}, see 
attached log excerpt for more details. Problematic code is here: 
https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/containerizer/runtime_isolator_tests.cpp#L593-L615

A possible solution is to await for {{subscribed}} only, without {{connected}}, 
setting the expectation before a connection is attempted, see 
https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/default_executor_tests.cpp#L139-L159.

  was:
Tests
{noformat}
CgroupsIsolatorTest.ROOT_CGROUPS_LimitSwap
DefaultExecutorCniTest.ROOT_VerifyContainerIP
DockerRuntimeIsolatorTest.ROOT_INTERNET_CURL_NestedSimpleCommand
DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultCmdLocalPuller
DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultEntryptLocalPuller
{noformat}
all have the same problem. They initiate a scheduler subscribe call in reaction 
to {{connected}} event. However, an expectation for {{subscribed}} event is 
created _afterwards_, which might lead to an uninteresting mock function call 
for {{subscribed}} followed by a failure to wait for {{subscribed}}, see 
attached log excerpt for more details. Problematic code is here: 
https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/containerizer/runtime_isolator_tests.cpp#L593-L615

A possible solution is to await for {{subscribed}} only, without {{connected}}, 
setting un the expectation before a connection is attempted, see 
https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/default_executor_tests.cpp#L139-L159.


> Some tests miss subscribed event because expectation is set after event fires.
> --
>
> Key: MESOS-8093
> URL: https://issues.apache.org/jira/browse/MESOS-8093
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver, test
>Reporter: Alexander Rukletsov
>Assignee: Armand Grillet
>Priority: Major
>  Labels: flaky-test, mesosphere
> Attachments: ROOT_INTERNET_CURL_NestedSimpleCommand-badrun-excerpt.txt
>
>
> Tests
> {noformat}
> CgroupsIsolatorTest.ROOT_CGROUPS_LimitSwap
> DefaultExecutorCniTest.ROOT_VerifyContainerIP
> DockerRuntimeIsolatorTest.ROOT_INTERNET_CURL_NestedSimpleCommand
> DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultCmdLocalPuller
> DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultEntryptLocalPuller
> {noformat}
> all have the same problem. They initiate a scheduler subscribe call in 
> reaction to {{connected}} event. However, an expectation for {{subscribed}} 
> event is created _afterwards_, which might lead to an uninteresting mock 
> function call for {{subscribed}} followed by a failure to wait for 
> {{subscribed}}, see attached log excerpt for more details. Problematic code 
> is here: 
> https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/containerizer/runtime_isolator_tests.cpp#L593-L615
> A possible solution is to await for {{subscribed}} only, without 
> {{connected}}, setting the expectation before a connection is attempted, see 
> https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/default_executor_tests.cpp#L139-L159.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)