[jira] [Updated] (MESOS-8080) The default executor does not propagate missing task exit status correctly.
[ https://issues.apache.org/jira/browse/MESOS-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-8080: --- Fix Version/s: 1.4.1 > The default executor does not propagate missing task exit status correctly. > --- > > Key: MESOS-8080 > URL: https://issues.apache.org/jira/browse/MESOS-8080 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: James Peach >Assignee: James Peach >Priority: Major > Fix For: 1.2.3, 1.3.2, 1.4.1, 1.5.0 > > > The default executor is not handling a missing nested container > exit status correctly. It is assuming the protobuf accessor was > returning an Option rather than explicitly checking whether the > `exit_status` field was present in the message. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6816) Allows frameworks to overwrite system environment variables
[ https://issues.apache.org/jira/browse/MESOS-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238361#comment-16238361 ] Andrew Schwartzmeyer commented on MESOS-6816: - Fixed, need to post reviews. > Allows frameworks to overwrite system environment variables > --- > > Key: MESOS-6816 > URL: https://issues.apache.org/jira/browse/MESOS-6816 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Daniel Pravat >Assignee: Andrew Schwartzmeyer >Priority: Minor > Labels: microsoft, windows > Original Estimate: 24h > Remaining Estimate: 24h > > In case the framework is specifying an environment variable block, mesos > agent code overwrites the variables already present in the system environment > block. For example even if the framework specify a variable named `ComSpec` > the value observed in the agent will be one configured in the `ComSpec` > system environment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-5932) Replicated log's dependency on leveldb prevents it from being used on Windows
[ https://issues.apache.org/jira/browse/MESOS-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238357#comment-16238357 ] Andrew Schwartzmeyer commented on MESOS-5932: - These issues share the same blocker, and as noted on MESOS-5820, leveldb is now being officially ported to Windows. > Replicated log's dependency on leveldb prevents it from being used on Windows > - > > Key: MESOS-5932 > URL: https://issues.apache.org/jira/browse/MESOS-5932 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Alex Clemmer >Assignee: Andrew Schwartzmeyer >Priority: Major > Labels: agent, master, mesosphere > > The replicated log (in src/log/) depends on leveldb to store and persist data > in the replicas. > This dependency is well-contained within replica.cpp, but until it is > abstracted out, it nonetheless prevents the master from being built on > Windows, which in turn prevents the agent tests from being built and run on > Windows. > Preliminary investigation shows that we will probably want to split this work > into 2 parts: > * Temporarily remove the ability of the master to use the replicated log on > Windows (in master/main.cpp). This should involve 1 conditional where we > instantiate a `Log::Log`. This should be enough for us to light up the agent > tests. > * Add leveldb Windows support to Mesos. This involves: adding CMake files to > build leveldb source, and adding Windows-specific `port_*` files that will > map the platform-specific constructs of leveldb to Windows. We can take hints > from leveldown and other projects, which add their own `port_*` files that > suit their purposes (namely, running leveldb, in node, on Windows). NOTE: the > leveldb community explicitly calls out in its documentation that it is not > interested in non-POSIX changes, so it is likely that this will never be > inducted into the mainline leveldb codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-6117) TCP health checks are not supported on Windows.
[ https://issues.apache.org/jira/browse/MESOS-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Schwartzmeyer reassigned MESOS-6117: --- Assignee: John Kordich (was: Andrew Schwartzmeyer) > TCP health checks are not supported on Windows. > --- > > Key: MESOS-6117 > URL: https://issues.apache.org/jira/browse/MESOS-6117 > Project: Mesos > Issue Type: Task > Components: executor >Reporter: Alexander Rukletsov >Assignee: John Kordich >Priority: Major > Labels: check, health-check, mesosphere, windows > > Currently, TCP health check is only available on Linux. Windows support > should be added to maintain feature parity. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-6714) Port `slave_tests.cpp`
[ https://issues.apache.org/jira/browse/MESOS-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Schwartzmeyer reassigned MESOS-6714: --- Assignee: Andrew Schwartzmeyer (was: Li Li) > Port `slave_tests.cpp` > -- > > Key: MESOS-6714 > URL: https://issues.apache.org/jira/browse/MESOS-6714 > Project: Mesos > Issue Type: Task > Components: agent >Reporter: Alex Clemmer >Assignee: Andrew Schwartzmeyer >Priority: Major > Labels: microsoft, windows-mvp > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-6709) Port `health_check_tests.cpp`
[ https://issues.apache.org/jira/browse/MESOS-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Schwartzmeyer reassigned MESOS-6709: --- Assignee: John Kordich (was: Andrew Schwartzmeyer) > Port `health_check_tests.cpp` > - > > Key: MESOS-6709 > URL: https://issues.apache.org/jira/browse/MESOS-6709 > Project: Mesos > Issue Type: Task > Components: agent >Reporter: Alex Clemmer >Assignee: John Kordich >Priority: Major > Labels: microsoft, windows, windows-mvp > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-6712) Port `slave_authorization_tests.cpp`
[ https://issues.apache.org/jira/browse/MESOS-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Schwartzmeyer reassigned MESOS-6712: --- Assignee: John Kordich (was: Andrew Schwartzmeyer) > Port `slave_authorization_tests.cpp` > > > Key: MESOS-6712 > URL: https://issues.apache.org/jira/browse/MESOS-6712 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Alex Clemmer >Assignee: John Kordich >Priority: Major > Labels: microsoft, windows-mvp > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-5932) Replicated log's dependency on leveldb prevents it from being used on Windows
[ https://issues.apache.org/jira/browse/MESOS-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Schwartzmeyer reassigned MESOS-5932: --- Assignee: Andrew Schwartzmeyer > Replicated log's dependency on leveldb prevents it from being used on Windows > - > > Key: MESOS-5932 > URL: https://issues.apache.org/jira/browse/MESOS-5932 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Alex Clemmer >Assignee: Andrew Schwartzmeyer >Priority: Major > Labels: agent, master, mesosphere > > The replicated log (in src/log/) depends on leveldb to store and persist data > in the replicas. > This dependency is well-contained within replica.cpp, but until it is > abstracted out, it nonetheless prevents the master from being built on > Windows, which in turn prevents the agent tests from being built and run on > Windows. > Preliminary investigation shows that we will probably want to split this work > into 2 parts: > * Temporarily remove the ability of the master to use the replicated log on > Windows (in master/main.cpp). This should involve 1 conditional where we > instantiate a `Log::Log`. This should be enough for us to light up the agent > tests. > * Add leveldb Windows support to Mesos. This involves: adding CMake files to > build leveldb source, and adding Windows-specific `port_*` files that will > map the platform-specific constructs of leveldb to Windows. We can take hints > from leveldown and other projects, which add their own `port_*` files that > suit their purposes (namely, running leveldb, in node, on Windows). NOTE: the > leveldb community explicitly calls out in its documentation that it is not > interested in non-POSIX changes, so it is likely that this will never be > inducted into the mainline leveldb codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8098) Benchmark Master failover performance
[ https://issues.apache.org/jira/browse/MESOS-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238131#comment-16238131 ] Yan Xu commented on MESOS-8098: --- {noformat:title=} commit ac0fa281472c2ba891f7bd0837fbd728ace73039 Author: Jiang Yan XuDate: Wed Oct 18 01:53:11 2017 -0700 Added a benchmark for agent reregistration during master failover. Review: https://reviews.apache.org/r/63174 {noformat} > Benchmark Master failover performance > - > > Key: MESOS-8098 > URL: https://issues.apache.org/jira/browse/MESOS-8098 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Yan Xu >Assignee: Yan Xu >Priority: Major > Attachments: withoutperfpatches.perf.svg, withperfpatches.perf.svg > > > Master failover performance often sheds light on the master's performance in > general as it's often the time the master experiences the highest load. Ways > we can benchmark the failover include the time it takes for all agents to > reregister, all frameworks to resubscribe or fully reconcile. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8098) Benchmark Master failover performance
[ https://issues.apache.org/jira/browse/MESOS-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-8098: -- Attachment: withoutperfpatches.perf.svg withperfpatches.perf.svg Attaching two flame graphs comparing the benchmark running against the two versions below: withperfpatches.perf.svg: https://github.com/apache/mesos/commit/41193181d6b75eeecae2729bf98007d9318e351a (close to the HEAD when the benchmark was created). vs. withoutperfpatches.perf.svg: https://github.com/apache/mesos/commit/d9c90bf1d9c8b3a7dcc47be0cb773efff57cfb9d (before https://issues.apache.org/jira/browse/MESOS-7713 was merged) The perf data was captured with me invoking gdb-mesos-tests.sh -> setting two break points on the two {{cout}} lines (right before and after the bulk reregistration) -> run -> coordinate {{perf record}} with the break points so it only captures the process behavior in between. However I couldn't find much useful info from the resulting graphs. Perhaps someone can help me take a look? /cc [~bmahler] [~ipronin] [~dzhuk]? > Benchmark Master failover performance > - > > Key: MESOS-8098 > URL: https://issues.apache.org/jira/browse/MESOS-8098 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Yan Xu >Assignee: Yan Xu >Priority: Major > Attachments: withoutperfpatches.perf.svg, withperfpatches.perf.svg > > > Master failover performance often sheds light on the master's performance in > general as it's often the time the master experiences the highest load. Ways > we can benchmark the failover include the time it takes for all agents to > reregister, all frameworks to resubscribe or fully reconcile. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8080) The default executor does not propagate missing task exit status correctly.
[ https://issues.apache.org/jira/browse/MESOS-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-8080: --- Fix Version/s: 1.3.2 > The default executor does not propagate missing task exit status correctly. > --- > > Key: MESOS-8080 > URL: https://issues.apache.org/jira/browse/MESOS-8080 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: James Peach >Assignee: James Peach >Priority: Major > Fix For: 1.2.3, 1.3.2, 1.5.0 > > > The default executor is not handling a missing nested container > exit status correctly. It is assuming the protobuf accessor was > returning an Option rather than explicitly checking whether the > `exit_status` field was present in the message. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.
[ https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238045#comment-16238045 ] Alexander Rukletsov commented on MESOS-7506: There are at least two different paths that lead to orphaned containers. Andrei described one above. Another one is still to be fully investigated, but in short, an executor appears to have exited and {{containerizer->wait()}} is triggered on its container id, but the container is not removed from the containerizer's internal {{containers_}} collection. > Multiple tests leave orphan containers. > --- > > Key: MESOS-7506 > URL: https://issues.apache.org/jira/browse/MESOS-7506 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Ubuntu 16.04 > Fedora 23 > other Linux distros >Reporter: Alexander Rukletsov >Assignee: Andrei Budnik >Priority: Major > Labels: containerizer, flaky-test, mesosphere > > I've observed a number of flaky tests that leave orphan containers upon > cleanup. A typical log looks like this: > {noformat} > ../../src/tests/cluster.cpp:580: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8080) The default executor does not propagate missing task exit status correctly.
[ https://issues.apache.org/jira/browse/MESOS-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-8080: --- Fix Version/s: 1.2.3 > The default executor does not propagate missing task exit status correctly. > --- > > Key: MESOS-8080 > URL: https://issues.apache.org/jira/browse/MESOS-8080 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: James Peach >Assignee: James Peach >Priority: Major > Fix For: 1.2.3, 1.5.0 > > > The default executor is not handling a missing nested container > exit status correctly. It is assuming the protobuf accessor was > returning an Option rather than explicitly checking whether the > `exit_status` field was present in the message. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7378) Build failure with glibc 2.12.
[ https://issues.apache.org/jira/browse/MESOS-7378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-7378: --- Fix Version/s: 1.2.3 > Build failure with glibc 2.12. > -- > > Key: MESOS-7378 > URL: https://issues.apache.org/jira/browse/MESOS-7378 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.3.0 >Reporter: James Peach >Assignee: Neil Conway >Priority: Blocker > Fix For: 1.2.3, 1.3.0, 1.4.0 > > > {noformat} > 03:46:16 - ./.libs/libmesos.so: undefined reference to > `gnu_dev_minor(unsigned long long)' > 03:46:16 - ./.libs/libmesos.so: undefined reference to > `gnu_dev_major(unsigned long long)' > {noformat} > This is caused by the change in MESOS-7365. > Including {{}} directly works on modern systems, but on our > older version of glibc, the {{}} header does not contain C++ > decls. This means that the inline symbols get C++ name mangling applied and > they don't get found at link time. > {noformat} > vagrant@mesos ~]$ cat /etc/redhat-release > CentOS release 6.8 (Final) > [vagrant@mesos ~]$ rpm -qa | grep glibc > glibc-common-2.12-1.192.el6.x86_64 > glibc-devel-2.12-1.192.el6.x86_64 > glibc-2.12-1.192.el6.x86_64 > glibc-headers-2.12-1.192.el6.x86_64 > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8170) Propagate exit status 127 for errors after fork
[ https://issues.apache.org/jira/browse/MESOS-8170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237988#comment-16237988 ] James Peach commented on MESOS-8170: /cc [~benjaminhindman] [~jieyu] > Propagate exit status 127 for errors after fork > --- > > Key: MESOS-8170 > URL: https://issues.apache.org/jira/browse/MESOS-8170 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: James Peach > > There's no consistent methodology in lib process or Mesos for propagating > errors that happen between fork and exec. For > [posix_spawn|http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_spawn.html], > the POSIX standard designated an exit code or 127 to identify this case. We > should adopt the same convention. > {quote} > The 8 bits of child process exit status that are guaranteed by IEEE Std > 1003.1-2001 to be accessible to the waiting parent process are insufficient > to disambiguate a spawn error from any other kind of error that may be > returned by an arbitrary process image. No other bits of the exit status are > required to be visible in stat_val, so these macros could not be strictly > implemented at the library level. Reserving an exit status of 127 for such > spawn errors is consistent with the use of this value by system() and popen() > to signal failures in these operations that occur after the function has > returned but before a shell is able to execute. The exit status of 127 does > not uniquely identify this class of error, nor does it provide any detailed > information on the nature of the failure. Note that a kernel implementation > of posix_spawn() or posix_spawnp() is permitted (and encouraged) to return > any possible error as the function value, thus providing more detailed > failure information to the parent process. > Thus, no special macros are available to isolate asynchronous posix_spawn() > or posix_spawnp() errors. Instead, errors detected by the posix_spawn() or > posix_spawnp() operations in the context of the child process before the new > process image executes are reported by setting the child's exit status to > 127. The calling process may use the WIFEXITED and WEXITSTATUS macros on the > stat_val stored by the wait() or waitpid() functions to detect spawn failures > to the extent that other status values with which the child process image may > exit (before the parent can conclusively determine that the child process > image has begun execution) are distinct from exit status 127. > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8170) Propagate exit status 127 for errors after fork
[ https://issues.apache.org/jira/browse/MESOS-8170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-8170: --- Description: There's no consistent methodology in lib process or Mesos for propagating errors that happen between fork and exec. For [posix_spawn|http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_spawn.html], the POSIX standard designated an exit code or 127 to identify this case. We should adopt the same convention. {quote} The 8 bits of child process exit status that are guaranteed by IEEE Std 1003.1-2001 to be accessible to the waiting parent process are insufficient to disambiguate a spawn error from any other kind of error that may be returned by an arbitrary process image. No other bits of the exit status are required to be visible in stat_val, so these macros could not be strictly implemented at the library level. Reserving an exit status of 127 for such spawn errors is consistent with the use of this value by system() and popen() to signal failures in these operations that occur after the function has returned but before a shell is able to execute. The exit status of 127 does not uniquely identify this class of error, nor does it provide any detailed information on the nature of the failure. Note that a kernel implementation of posix_spawn() or posix_spawnp() is permitted (and encouraged) to return any possible error as the function value, thus providing more detailed failure information to the parent process. Thus, no special macros are available to isolate asynchronous posix_spawn() or posix_spawnp() errors. Instead, errors detected by the posix_spawn() or posix_spawnp() operations in the context of the child process before the new process image executes are reported by setting the child's exit status to 127. The calling process may use the WIFEXITED and WEXITSTATUS macros on the stat_val stored by the wait() or waitpid() functions to detect spawn failures to the extent that other status values with which the child process image may exit (before the parent can conclusively determine that the child process image has begun execution) are distinct from exit status 127. {quote} > Propagate exit status 127 for errors after fork > --- > > Key: MESOS-8170 > URL: https://issues.apache.org/jira/browse/MESOS-8170 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: James Peach > > There's no consistent methodology in lib process or Mesos for propagating > errors that happen between fork and exec. For > [posix_spawn|http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_spawn.html], > the POSIX standard designated an exit code or 127 to identify this case. We > should adopt the same convention. > {quote} > The 8 bits of child process exit status that are guaranteed by IEEE Std > 1003.1-2001 to be accessible to the waiting parent process are insufficient > to disambiguate a spawn error from any other kind of error that may be > returned by an arbitrary process image. No other bits of the exit status are > required to be visible in stat_val, so these macros could not be strictly > implemented at the library level. Reserving an exit status of 127 for such > spawn errors is consistent with the use of this value by system() and popen() > to signal failures in these operations that occur after the function has > returned but before a shell is able to execute. The exit status of 127 does > not uniquely identify this class of error, nor does it provide any detailed > information on the nature of the failure. Note that a kernel implementation > of posix_spawn() or posix_spawnp() is permitted (and encouraged) to return > any possible error as the function value, thus providing more detailed > failure information to the parent process. > Thus, no special macros are available to isolate asynchronous posix_spawn() > or posix_spawnp() errors. Instead, errors detected by the posix_spawn() or > posix_spawnp() operations in the context of the child process before the new > process image executes are reported by setting the child's exit status to > 127. The calling process may use the WIFEXITED and WEXITSTATUS macros on the > stat_val stored by the wait() or waitpid() functions to detect spawn failures > to the extent that other status values with which the child process image may > exit (before the parent can conclusively determine that the child process > image has begun execution) are distinct from exit status 127. > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8170) Propagate exit status 127 for errors after fork
James Peach created MESOS-8170: -- Summary: Propagate exit status 127 for errors after fork Key: MESOS-8170 URL: https://issues.apache.org/jira/browse/MESOS-8170 Project: Mesos Issue Type: Bug Components: libprocess Reporter: James Peach -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-6214) Containerizers assume caller will call 'destroy' if 'launch' fails.
[ https://issues.apache.org/jira/browse/MESOS-6214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-6214: -- Fix Version/s: (was: 1.2.0) > Containerizers assume caller will call 'destroy' if 'launch' fails. > --- > > Key: MESOS-6214 > URL: https://issues.apache.org/jira/browse/MESOS-6214 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Benjamin Mahler >Assignee: Kevin Klues >Priority: Major > Labels: tech-debt > > The planned API for nested containers is to allow launching, waiting (for > termination), and killing (currently only SIGKILL) of the nested container. > Note that this API provides no mechanism for "cleaning up" the container > because it will implicitly do so once the container terminates. > However, the containerizer currently assumes that the caller will call > destroy if the launch fails. In order to implement the agent's API for > managing nested containers, we will have to set up a failure continuation to > call destroy to ensure the cleanup occurs correctly. > Ideally, the API of the containerizer does not require the caller to call > destroy after a launch failure, given that the launch did not succeed it > seems counter-intuitive for the responsibility of clean up to be on the > caller. In addition, in the container termination case, the containerizer > will implicitly clean up (so this seems inconsistent as well). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7069) The linux filesystem isolator should set mode and ownership for host volumes.
[ https://issues.apache.org/jira/browse/MESOS-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237741#comment-16237741 ] R.B. Boyer commented on MESOS-7069: --- Did 5187 make it into 1.2.2? If so then it's still broken and this ticket is relevant. > The linux filesystem isolator should set mode and ownership for host volumes. > - > > Key: MESOS-7069 > URL: https://issues.apache.org/jira/browse/MESOS-7069 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Gilbert Song >Assignee: Ilya Pronin >Priority: Major > Labels: filesystem, linux, volumes > > If the host path is a relative path, the linux filesystem isolator should set > the mode and ownership for this host volume since it allows non-root user to > write to the volume. Note that this is the case of sharing the host > fileysystem (without rootfs). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks
[ https://issues.apache.org/jira/browse/MESOS-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8018: -- Assignee: (was: James Peach) > Allow framework to opt-in to forward executor's JWT token to the tasks > -- > > Key: MESOS-8018 > URL: https://issues.apache.org/jira/browse/MESOS-8018 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li >Priority: Major > > Nested container API is an awesome feature and enabled a lot of interesting > use cases. A pattern we have seen multiple times is that a task (often the > only one) launched by default executor wants to further creates containers > nested behind itself (or the executor) to run some different workload. > Because the entire request is 1) completely local to the executor container, > 2) okay to be bounded within the executor's lifecycle, we'd like to allow the > task to use the mesos agent API directly to create these nested containers. > However, it creates a problem when we want to enable HTTP executor > authentication because the JWT auth tokens are only available to the executor > so the task's API request will be rejected. > Requiring framework owner to fork or create a custom executor simply for this > purpose also seems a bit too heavy. > My proposal is to allow framework to opt-in with some field so that the > launched task will receive certain environment variables from default > executor, so the task can "act upon" the executor. One idea is to add a new > field to allow certain environment variables to be forwarded from executor to > task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking
[ https://issues.apache.org/jira/browse/MESOS-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237738#comment-16237738 ] James DeFelice commented on MESOS-8169: --- /cc [~jamespeach] > master validation incorrectly rejects slaves, buggy executorID checking > --- > > Key: MESOS-8169 > URL: https://issues.apache.org/jira/browse/MESOS-8169 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: James DeFelice >Priority: Major > Labels: mesosphere > > proposed fix: https://github.com/apache/mesos/pull/248 > I observed this in my environment, where I had two frameworks that used the > same ExecutorID and then triggered a master failover. The master refuses to > reregister the slave because it's not considering the owning-framework of the > ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) > that there's an erroneous duplicate executor ID: > {code} > W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of > agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: > Executor has a duplicate ExecutorID 'default' > {code} > (yes, "default" is probably a terrible name for an ExecutorID - that's a > separate discussion!) > /cc [~neilc] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking
[ https://issues.apache.org/jira/browse/MESOS-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237734#comment-16237734 ] ASF GitHub Bot commented on MESOS-8169: --- Github user jdef commented on the issue: https://github.com/apache/mesos/pull/248 https://issues.apache.org/jira/browse/MESOS-8169 > master validation incorrectly rejects slaves, buggy executorID checking > --- > > Key: MESOS-8169 > URL: https://issues.apache.org/jira/browse/MESOS-8169 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: James DeFelice >Priority: Major > Labels: mesosphere > > proposed fix: https://github.com/apache/mesos/pull/248 > I observed this in my environment, where I had two frameworks that used the > same ExecutorID and then triggered a master failover. The master refuses to > reregister the slave because it's not considering the owning-framework of the > ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) > that there's an erroneous duplicate executor ID: > {code} > W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of > agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: > Executor has a duplicate ExecutorID 'default' > {code} > (yes, "default" is probably a terrible name for an ExecutorID - that's a > separate discussion!) > /cc [~neilc] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking
James DeFelice created MESOS-8169: - Summary: master validation incorrectly rejects slaves, buggy executorID checking Key: MESOS-8169 URL: https://issues.apache.org/jira/browse/MESOS-8169 Project: Mesos Issue Type: Bug Affects Versions: 1.4.0 Reporter: James DeFelice Priority: Major proposed fix: https://github.com/apache/mesos/pull/248 I observed this in my environment, where I had two frameworks that used the same ExecutorID and then triggered a master failover. The master refuses to reregister the slave because it's not considering the owning-framework of the ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) that there's an erroneous duplicate executor ID: {code} W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: Executor has a duplicate ExecutorID 'default' {code} (yes, "default" is probably a terrible name for an ExecutorID - that's a separate discussion!) /cc [~neilc] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7069) The linux filesystem isolator should set mode and ownership for host volumes.
[ https://issues.apache.org/jira/browse/MESOS-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237560#comment-16237560 ] Julien Pepy commented on MESOS-7069: Hi, what is the status on this ticket? The review has been stalled for 6 months, and it looks to me that MESOS-5187 has fixed the issue. > The linux filesystem isolator should set mode and ownership for host volumes. > - > > Key: MESOS-7069 > URL: https://issues.apache.org/jira/browse/MESOS-7069 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Gilbert Song >Assignee: Ilya Pronin >Priority: Major > Labels: filesystem, linux, volumes > > If the host path is a relative path, the linux filesystem isolator should set > the mode and ownership for this host volume since it allows non-root user to > write to the volume. Note that this is the case of sharing the host > fileysystem (without rootfs). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8093) Some tests miss subscribed event because expectation is set after event fires.
[ https://issues.apache.org/jira/browse/MESOS-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armand Grillet updated MESOS-8093: -- Description: Tests {noformat} CgroupsIsolatorTest.ROOT_CGROUPS_LimitSwap DefaultExecutorCniTest.ROOT_VerifyContainerIP DockerRuntimeIsolatorTest.ROOT_INTERNET_CURL_NestedSimpleCommand DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultCmdLocalPuller DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultEntryptLocalPuller {noformat} all have the same problem. They initiate a scheduler subscribe call in reaction to {{connected}} event. However, an expectation for {{subscribed}} event is created _afterwards_, which might lead to an uninteresting mock function call for {{subscribed}} followed by a failure to wait for {{subscribed}}, see attached log excerpt for more details. Problematic code is here: https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/containerizer/runtime_isolator_tests.cpp#L593-L615 A possible solution is to await for {{subscribed}} only, without {{connected}}, setting the expectation before a connection is attempted, see https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/default_executor_tests.cpp#L139-L159. was: Tests {noformat} CgroupsIsolatorTest.ROOT_CGROUPS_LimitSwap DefaultExecutorCniTest.ROOT_VerifyContainerIP DockerRuntimeIsolatorTest.ROOT_INTERNET_CURL_NestedSimpleCommand DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultCmdLocalPuller DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultEntryptLocalPuller {noformat} all have the same problem. They initiate a scheduler subscribe call in reaction to {{connected}} event. However, an expectation for {{subscribed}} event is created _afterwards_, which might lead to an uninteresting mock function call for {{subscribed}} followed by a failure to wait for {{subscribed}}, see attached log excerpt for more details. Problematic code is here: https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/containerizer/runtime_isolator_tests.cpp#L593-L615 A possible solution is to await for {{subscribed}} only, without {{connected}}, setting un the expectation before a connection is attempted, see https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/default_executor_tests.cpp#L139-L159. > Some tests miss subscribed event because expectation is set after event fires. > -- > > Key: MESOS-8093 > URL: https://issues.apache.org/jira/browse/MESOS-8093 > Project: Mesos > Issue Type: Bug > Components: scheduler driver, test >Reporter: Alexander Rukletsov >Assignee: Armand Grillet >Priority: Major > Labels: flaky-test, mesosphere > Attachments: ROOT_INTERNET_CURL_NestedSimpleCommand-badrun-excerpt.txt > > > Tests > {noformat} > CgroupsIsolatorTest.ROOT_CGROUPS_LimitSwap > DefaultExecutorCniTest.ROOT_VerifyContainerIP > DockerRuntimeIsolatorTest.ROOT_INTERNET_CURL_NestedSimpleCommand > DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultCmdLocalPuller > DockerRuntimeIsolatorTest.ROOT_NestedDockerDefaultEntryptLocalPuller > {noformat} > all have the same problem. They initiate a scheduler subscribe call in > reaction to {{connected}} event. However, an expectation for {{subscribed}} > event is created _afterwards_, which might lead to an uninteresting mock > function call for {{subscribed}} followed by a failure to wait for > {{subscribed}}, see attached log excerpt for more details. Problematic code > is here: > https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/containerizer/runtime_isolator_tests.cpp#L593-L615 > A possible solution is to await for {{subscribed}} only, without > {{connected}}, setting the expectation before a connection is attempted, see > https://github.com/apache/mesos/blob/1c51c98638bb9ea0e8ec6a3f284b33d6c1a4e8ef/src/tests/default_executor_tests.cpp#L139-L159. -- This message was sent by Atlassian JIRA (v6.4.14#64029)