[jira] [Created] (MESOS-6344) Allow `network/cni` isolator to take a search path for CNI plugins instead of single directory
Avinash Sridharan created MESOS-6344: Summary: Allow `network/cni` isolator to take a search path for CNI plugins instead of single directory Key: MESOS-6344 URL: https://issues.apache.org/jira/browse/MESOS-6344 Project: Mesos Issue Type: Task Components: containerization Reporter: Avinash Sridharan Assignee: Avinash Sridharan Currently the `network/cni` isolator expects a single directory with the `--network_cni_plugins_dir` . This is very limiting because this forces the operator to put all the CNI plugins in the same directory. With Mesos port-mapper CNI plugin this would also imply that the operator would have to move this plugin from the Mesos installation directory to a directory specified in the `--network_cni_plugins_dir`. To simplify the operators experience it would make sense for the `--network_cni_plugins_dir` flag to take in set of directories instead of single directory. The `network/cni` isolator can then search this set of directories to find the CNI plugin. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6343) Documentation Error: Default Executor does not implicitly construct resources
Joris Van Remoortere created MESOS-6343: --- Summary: Documentation Error: Default Executor does not implicitly construct resources Key: MESOS-6343 URL: https://issues.apache.org/jira/browse/MESOS-6343 Project: Mesos Issue Type: Documentation Reporter: Joris Van Remoortere Priority: Blocker https://github.com/apache/mesos/blob/d16f53d5a9e15d1d9533739a8c052bc546ec3262/include/mesos/v1/mesos.proto#L544-L546 This probably got carried forward from early design discussions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6342) Not able to access TaskInfo's Data field from Tasks launched by CmdExecutor
Nima Vaziri created MESOS-6342: -- Summary: Not able to access TaskInfo's Data field from Tasks launched by CmdExecutor Key: MESOS-6342 URL: https://issues.apache.org/jira/browse/MESOS-6342 Project: Mesos Issue Type: Bug Reporter: Nima Vaziri There's some config data that's being put in a TaskInfo's Data field on the Scheduler's side. This data is of arbitrary size (in the order of hundreds of KB) so it might be possible to dump it into a file on the executor's side in case its size is big. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5879) cgroups/net_cls isolator causing agent recovery issues
[ https://issues.apache.org/jira/browse/MESOS-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556798#comment-15556798 ] Avinash Sridharan commented on MESOS-5879: -- @hasodent once we fix MESOS-6035, we can close this I am assuming? > cgroups/net_cls isolator causing agent recovery issues > -- > > Key: MESOS-5879 > URL: https://issues.apache.org/jira/browse/MESOS-5879 > Project: Mesos > Issue Type: Bug > Components: cgroups, isolation, slave >Reporter: Silas Snider >Assignee: Avinash Sridharan > Labels: mesosphere > > We run with 'cgroups/net_cls' in our isolator list, and when we restart any > agent process in a cluster running an experimental custom isolator as well, > the agents are unable to recover from checkpoint, because net_cls reports > that unknown orphan containers have duplicate net_cls handles. > While this is a problem that needs to be solved (probably by fixing our > custom isolator), it's also a problem that the net_cls isolator fails > recovery just for duplicate handles in cgroups that it is literally about to > unconditionally destroy during recovery. Can this be fixed? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6323) 'mesos-containerizer launch' should inherit agent environment variables.
[ https://issues.apache.org/jira/browse/MESOS-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-6323: -- Target Version/s: 1.1.0 > 'mesos-containerizer launch' should inherit agent environment variables. > > > Key: MESOS-6323 > URL: https://issues.apache.org/jira/browse/MESOS-6323 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Priority: Critical > > If some dynamic libraries that agent depends on are stored in a non standard > location, and the operator starts the agent using LD_LIBRARY_PATH. When we > actually fork/exec the 'mesos-containerizer launch' helper, we need to make > sure it inherits agent's environment variables. Otherwise, it might throw > linking errors. This makes sense because it's a Mesos controlled process. > However, the the helper actually fork/exec the user container (or executor), > we need to make sure to strip the agent environment variables. > The tricky case is for default executor and command executor. These two are > controlled by Mesos as well, we also want them to have agent environment > variables. We need to somehow distinguish this from custom executor case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6106) Validate the host ports which container wants to expose to are from the resources assigned to it
[ https://issues.apache.org/jira/browse/MESOS-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556791#comment-15556791 ] Avinash Sridharan commented on MESOS-6106: -- [~qianzhang] had a discussion with [~jieyu] and he wanted to land the port-mapper CNI plugin in 1.1.0, which is probably a week away. Wanted to check if we can get this done in that time frame. Going and marking the Target version as 1.1.0 for the time being so that it shows up on the dashboard. > Validate the host ports which container wants to expose to are from the > resources assigned to it > > > Key: MESOS-6106 > URL: https://issues.apache.org/jira/browse/MESOS-6106 > Project: Mesos > Issue Type: Task > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > > In CNI isolator, we need to validate the host ports which container wants to > expose to ({{NetworkInfo.PortMapping.host_port}}) are from the resources > assigned to it (i.e., from the resource offer used by framework to launch > container), so that we can ensure container will not expose to an arbitrary > host port. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6106) Validate the host ports which container wants to expose to are from the resources assigned to it
[ https://issues.apache.org/jira/browse/MESOS-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-6106: - Target Version/s: 1.1.0 > Validate the host ports which container wants to expose to are from the > resources assigned to it > > > Key: MESOS-6106 > URL: https://issues.apache.org/jira/browse/MESOS-6106 > Project: Mesos > Issue Type: Task > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > > In CNI isolator, we need to validate the host ports which container wants to > expose to ({{NetworkInfo.PortMapping.host_port}}) are from the resources > assigned to it (i.e., from the resource offer used by framework to launch > container), so that we can ensure container will not expose to an arbitrary > host port. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6017) Introduce `PortMapping` protobuf.
[ https://issues.apache.org/jira/browse/MESOS-6017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-6017: - Target Version/s: 1.1.0 > Introduce `PortMapping` protobuf. > - > > Key: MESOS-6017 > URL: https://issues.apache.org/jira/browse/MESOS-6017 > Project: Mesos > Issue Type: Task > Components: containerization > Environment: Linux >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan > Labels: mesosphere > Fix For: 1.1.0 > > > Currently we have a `PortMapping` message defined for `DockerInfo`. This can > be used only by the `DockerContainerizer`. We need to introduce a new > Protobuf message in `NetworkInfo` which will allow frameworks to specify port > mapping when using CNI with the `MesosContainerizer`. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6022) unit-test for the port mapper plugin
[ https://issues.apache.org/jira/browse/MESOS-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-6022: - Target Version/s: 1.1.0 > unit-test for the port mapper plugin > > > Key: MESOS-6022 > URL: https://issues.apache.org/jira/browse/MESOS-6022 > Project: Mesos > Issue Type: Task > Components: containerization > Environment: Linux >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan > Labels: mesosphere > > Write unit-tests for the port mapper plugin. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6023) Create a binary for the port-mapper plugin
[ https://issues.apache.org/jira/browse/MESOS-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-6023: - Target Version/s: 1.1.0 Fix Version/s: 1.1.0 > Create a binary for the port-mapper plugin > -- > > Key: MESOS-6023 > URL: https://issues.apache.org/jira/browse/MESOS-6023 > Project: Mesos > Issue Type: Task > Components: containerization > Environment: Linux >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan > Fix For: 1.1.0 > > > The CNI port mapper plugin needs to be a separate binary that will be invoked > by the `network/cni` isolator as a CNI plugin. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6040) Add a CMake build for `mesos-port-mapper`
[ https://issues.apache.org/jira/browse/MESOS-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-6040: - Target Version/s: 1.1.0 > Add a CMake build for `mesos-port-mapper` > - > > Key: MESOS-6040 > URL: https://issues.apache.org/jira/browse/MESOS-6040 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan > Labels: mesosphere > > Once the port-mapper binary compiles with GNU make, we need to modify the > CMake to build the port-mapper binary as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6282) CNI isolator should print plugin's stderr
[ https://issues.apache.org/jira/browse/MESOS-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-6282: - Target Version/s: 1.1.0 > CNI isolator should print plugin's stderr > - > > Key: MESOS-6282 > URL: https://issues.apache.org/jira/browse/MESOS-6282 > Project: Mesos > Issue Type: Improvement > Components: containerization, isolation, network >Reporter: Dan Osborne >Assignee: Avinash Sridharan > > It's quite difficult for both Operators and CNI plugin developers to diagnose > CNI plugin errors in production or in test when the only error information > available is the stdout error string returned by the plugin (assuming it > managed to even print its correctly formatted text to stdout). > Many CNI plugins print logging information to stderr, [as per the CNI > spec|https://github.com/containernetworking/cni/blob/master/SPEC.md#result]: > bq. In addition, stderr can be used for unstructured output such as logs. > Therefore, I propose the Mesos CNI Isolator capture the CNI plugin's stderr > output and log it to the Agent Logs, for easier diagnosis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6337) Nested containers getting killed before network isolation can be applied to them.
[ https://issues.apache.org/jira/browse/MESOS-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556678#comment-15556678 ] Avinash Sridharan commented on MESOS-6337: -- I looked into this issue, and it turns out its a duplicate of https://issues.apache.org/jira/browse/MESOS-6323 Looking at the stderr of the failed nested containers. Saw the following error messages: mesos-containerizer: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory So its problem of the containers inheriting the right environment variables. > Nested containers getting killed before network isolation can be applied to > them. > - > > Key: MESOS-6337 > URL: https://issues.apache.org/jira/browse/MESOS-6337 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Linux >Reporter: Avinash Sridharan >Assignee: Gilbert Song > Labels: mesosphere > > Seeing this odd behavior in one of our clusters: > ``` > http.cpp:1948] Failed to launch nested container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to seed container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to setup hostname and network files: Failed to enter > the mount namespace of pid 21591: Pid 21591 does not exist > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894485 > 31531 containerizer.cpp:1931] Destroying container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e in > ISOLATING state > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894439 > 31531 containerizer.cpp:2300] Container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e has > exited > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.854456 > 31534 systemd.cpp:96] Assigned child process '21591' to > 'mesos_executors.slice' > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831861 > 21580 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL > socket > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set > LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831526 > 21580 openssl.cpp:432] Will only verify peer certificate if presented! > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set > LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831521 > 21580 openssl.cpp:426] Will not verify peer certificate! > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831511 > 21580 openssl.cpp:421] CA directory path unspecified! NOTE: Set CA directory > path with LIBPROCESS_SSL_CA_DIR= > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831405 > 21580 openssl.cpp:399] Failed SSL connections will be downgraded to a non-SSL > socket > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: WARNING: Logging before > InitGoogleLogging() is written to STDERR > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.828413 > 21581 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL > socket > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set > LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification > ``` > The above log is "reverse" chronological order, so please read it bottom up. > The relevant log is: > ``` > http.cpp:1948] Failed to launch nested container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to seed container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to setup hostname and network files: Failed to enter > the mount namespace of pid 21591: Pid 21591 does not exist > ``` > Looks like the nested container failed to launch because the `isolate` call > to the `network/cni` isolator failed. Seems like when the isolator received > the `isolate` call the PID for the nested container has already exited and it > couldn't enter its mount namespace to setup the network files. > The odd thing here is that the nested container would have been frozen, and > hence was not running, so not sure what killed the nested container. My > suspicion falls on systemd, since I also see this log message: > ``` > Oct 07 18:02:31 ip-10-10-0-207 mesos-agent[31520]: I1007 18:02:31.473656 > 31532 systemd.cpp:96] Assigned child process '1596' to 'mesos_executors.slice' > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6337) Nested containers getting killed before network isolation can be applied to them.
[ https://issues.apache.org/jira/browse/MESOS-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6337: -- Fix Version/s: (was: 1.1.0) > Nested containers getting killed before network isolation can be applied to > them. > - > > Key: MESOS-6337 > URL: https://issues.apache.org/jira/browse/MESOS-6337 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Linux >Reporter: Avinash Sridharan >Assignee: Gilbert Song > Labels: mesosphere > > Seeing this odd behavior in one of our clusters: > ``` > http.cpp:1948] Failed to launch nested container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to seed container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to setup hostname and network files: Failed to enter > the mount namespace of pid 21591: Pid 21591 does not exist > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894485 > 31531 containerizer.cpp:1931] Destroying container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e in > ISOLATING state > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894439 > 31531 containerizer.cpp:2300] Container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e has > exited > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.854456 > 31534 systemd.cpp:96] Assigned child process '21591' to > 'mesos_executors.slice' > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831861 > 21580 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL > socket > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set > LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831526 > 21580 openssl.cpp:432] Will only verify peer certificate if presented! > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set > LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831521 > 21580 openssl.cpp:426] Will not verify peer certificate! > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831511 > 21580 openssl.cpp:421] CA directory path unspecified! NOTE: Set CA directory > path with LIBPROCESS_SSL_CA_DIR= > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831405 > 21580 openssl.cpp:399] Failed SSL connections will be downgraded to a non-SSL > socket > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: WARNING: Logging before > InitGoogleLogging() is written to STDERR > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.828413 > 21581 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL > socket > Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set > LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification > ``` > The above log is "reverse" chronological order, so please read it bottom up. > The relevant log is: > ``` > http.cpp:1948] Failed to launch nested container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to seed container > cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: > Collect failed: Failed to setup hostname and network files: Failed to enter > the mount namespace of pid 21591: Pid 21591 does not exist > ``` > Looks like the nested container failed to launch because the `isolate` call > to the `network/cni` isolator failed. Seems like when the isolator received > the `isolate` call the PID for the nested container has already exited and it > couldn't enter its mount namespace to setup the network files. > The odd thing here is that the nested container would have been frozen, and > hence was not running, so not sure what killed the nested container. My > suspicion falls on systemd, since I also see this log message: > ``` > Oct 07 18:02:31 ip-10-10-0-207 mesos-agent[31520]: I1007 18:02:31.473656 > 31532 systemd.cpp:96] Assigned child process '1596' to 'mesos_executors.slice' > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6118) Agent would crash with docker container tasks due to host mount table read.
[ https://issues.apache.org/jira/browse/MESOS-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6118: -- Shepherd: Jie Yu > Agent would crash with docker container tasks due to host mount table read. > --- > > Key: MESOS-6118 > URL: https://issues.apache.org/jira/browse/MESOS-6118 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 1.0.1 > Environment: Build: 2016-08-26 23:06:27 by centos > Version: 1.0.1 > Git tag: 1.0.1 > Git SHA: 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 > systemd version `219` detected > Inializing systemd state > Created systemd slice: `/run/systemd/system/mesos_executors.slice` > Started systemd slice `mesos_executors.slice` > Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni > Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher > Linux ip-10-254-192-40 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 > UTC 2016 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Jamie Briant >Assignee: Kevin Klues >Priority: Blocker > Labels: linux, slave > Attachments: crashlogfull.log, cycle2.log, cycle3.log, cycle5.log, > cycle6.log, slave-crash.log > > > I have a framework which schedules thousands of short running (a few seconds > to a few minutes) of tasks, over a period of several minutes. In 1.0.1, the > slave process will crash every few minutes (with systemd restarting it). > Crash is: > Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: F0901 20:52:23.905678 1232 > fs.cpp:140] Check failed: !visitedParents.contains(parentId) > Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: *** Check failure stack trace: > *** > Version 1.0.0 works without this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.
[ https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556594#comment-15556594 ] Guangya Liu commented on MESOS-6308: Thanks [~bbannier] , I reproduced this issue again after running almost 1 hour and found it failed as following when adding metrics: {code} F1007 18:22:39.125012 255385600 sorter.cpp:458] Check failed: contains(name) *** Check failure stack trace: *** @0x108b7afda google::LogMessage::Fail() @0x108b79f67 google::LogMessage::SendToLog() @0x108b7ac8a google::LogMessage::Flush() @0x108b81af8 google::LogMessageFatal::~LogMessageFatal() @0x108b7b415 google::LogMessageFatal::~LogMessageFatal() @0x106bcd4d5 mesos::internal::master::allocator::DRFSorter::calculateShare() @0x106bc710e mesos::internal::master::allocator::Metrics::add()::$_0::operator()() @0x106bca6e2 _ZZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNSt3__112basic_stringIcNS9_11char_traitsIcEENS9_9allocatorIcE3$_0EENS_6FutureIdEERKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clEST_ @0x106bca6a0 _ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS3_6FutureIdEERKNS3_4UPIDEOT_EUlPNS3_11ProcessBaseEE_SW_EEEvDpOT_ @0x106bca34c _ZNSt3__110__function6__funcIZN7process8internal8DispatchIdEclIRKZN5mesos8internal6master9allocator7Metrics3addERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcE3$_0EENS2_6FutureIdEERKNS2_4UPIDEOT_EUlPNS2_11ProcessBaseEE_NSF_ISW_EEFvSV_EEclEOSV_ @0x108a598df std::__1::function<>::operator()() @0x108a2a30f process::ProcessBase::visit() @0x108a8df9e process::DispatchEvent::visit() @0x100c65c51 process::ProcessBase::serve() @0x108a26fe1 process::ProcessManager::resume() @0x108a32ad6 process::ProcessManager::init_threads()::$_1::operator()() @0x108a32779 _ZNSt3__114__thread_proxyINS_5tupleIJZN7process14ProcessManager12init_threadsEvE3$_1EPvS6_ @ 0x7fff957a405a _pthread_body @ 0x7fff957a3fd7 _pthread_start @ 0x7fff957a13ed thread_start E1007 18:23:06.083991 317579264 process.cpp:2154] Failed to shutdown socket with fd 15: Socket is not connected Abort trap: 6 {code} Will check more for if there are case that we can add metrics for a non existent client? [~bbannier] , please show your comments if any. Thanks. > CHECK failure in DRF sorter. > > > Key: MESOS-6308 > URL: https://issues.apache.org/jira/browse/MESOS-6308 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Guangya Liu > > Saw this CHECK failed in our internal CI: > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450 > {noformat} > [03:08:28] : [Step 10/10] [ RUN ] PartitionTest.DisconnectedFramework > [03:08:28]W: [Step 10/10] I1004 03:08:28.200443 577 cluster.cpp:158] > Creating default 'local' authorizer > [03:08:28]W: [Step 10/10] I1004 03:08:28.206408 577 leveldb.cpp:174] > Opened db in 5.827159ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208127 577 leveldb.cpp:181] > Compacted db in 1.697508ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208150 577 leveldb.cpp:196] > Created db iterator in 5756ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208160 577 leveldb.cpp:202] > Seeked to beginning of db in 1483ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208168 577 leveldb.cpp:271] > Iterated through 0 keys in the db in 1101ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208184 577 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [03:08:28]W: [Step 10/10] I1004 03:08:28.208452 591 recover.cpp:451] > Starting replica recovery > [03:08:28]W: [Step 10/10] I1004 03:08:28.208664 596 recover.cpp:477] > Replica is in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209079 591 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(3666)@172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209203 593 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209394 598 recover.cpp:568] > Updating replica status to STARTING > [03:08:28]W: [Step 10/10] I1004 03:08:28.209473 598 master.cpp:380] > Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) > started on 172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209489 598 master.cpp:382] Flags > at startup: --acls=""
[jira] [Updated] (MESOS-6341) Improve environment variable setting for executors, tasks and nested containers.
[ https://issues.apache.org/jira/browse/MESOS-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-6341: -- Component/s: slave containerization > Improve environment variable setting for executors, tasks and nested > containers. > > > Key: MESOS-6341 > URL: https://issues.apache.org/jira/browse/MESOS-6341 > Project: Mesos > Issue Type: Epic > Components: containerization, slave >Reporter: Jie Yu > > This is an umbrella ticket to track all the environment variable related > tickets in Mesos that need to be solved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers
[ https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-3740: -- Story Points: (was: 3) > LIBPROCESS_IP not passed to Docker containers > - > > Key: MESOS-3740 > URL: https://issues.apache.org/jira/browse/MESOS-3740 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 > Environment: Mesos 0.24.1 >Reporter: Cody Maloney > Labels: mesosphere > > Docker containers aren't currently passed all the same environment variables > that Mesos Containerizer tasks are. See: > https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254 > for all the environment variables explicitly set for mesos containers. > While some of them don't necessarily make sense for docker containers, when > the docker has inside of it a libprocess process (A mesos framework > scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP > set otherwise the same sort of problems that happen because of MESOS-3553 can > happen (libprocess will try to guess the machine's IP address with likely bad > results in a number of operating environment). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6341) Improve environment variable setting for executors, tasks and nested containers.
Jie Yu created MESOS-6341: - Summary: Improve environment variable setting for executors, tasks and nested containers. Key: MESOS-6341 URL: https://issues.apache.org/jira/browse/MESOS-6341 Project: Mesos Issue Type: Epic Reporter: Jie Yu This is an umbrella ticket to track all the environment variable related tickets in Mesos that need to be solved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6340) Set HOME for Mesos tasks
[ https://issues.apache.org/jira/browse/MESOS-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556516#comment-15556516 ] Zameer Manji commented on MESOS-6340: - Thermos (Aurora's executor) works around this issue by setting {{$HOME}} to cwd {{$WORK_DIR}} and/or using {{$MESOS_SANDBOX}} when it is set. I think [~joshua.cohen] can confirm or deny this. Personally, if $HOME could default to those values that would be fantastic. Executors can do their own customization if needed, but setting something would be better than nothing. > Set HOME for Mesos tasks > > > Key: MESOS-6340 > URL: https://issues.apache.org/jira/browse/MESOS-6340 > Project: Mesos > Issue Type: Bug > Components: containerization, slave >Reporter: Cody Maloney >Assignee: Jie Yu > > Quite a few programs assume {{$HOME}} points to a user-editable data file > directory. > One example is PYTHON, which tries to look up $HOME to find user-installed > pacakges, and if that fails it tries to look up the user in the passwd > database which often goes badly (The container is running under the `nobody` > user): > {code} > if i == 1: > if 'HOME' not in os.environ: > import pwd > userhome = pwd.getpwuid(os.getuid()).pw_dir > else: > userhome = os.environ['HOME'] > {code} > Just setting HOME by default to WORK_DIR would enable more software to work > correctly out of the box. Software which needs to specialize / change it (or > schedulers with specific preferences), should still be able to set it > arbitrarily and anything a scheduler explicitly sets should overwrite the > default value of $WORK_DIR -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6340) Set HOME for Mesos tasks
Cody Maloney created MESOS-6340: --- Summary: Set HOME for Mesos tasks Key: MESOS-6340 URL: https://issues.apache.org/jira/browse/MESOS-6340 Project: Mesos Issue Type: Bug Components: containerization, slave Reporter: Cody Maloney Assignee: Jie Yu Quite a few programs assume {{$HOME}} points to a user-editable data file directory. One example is PYTHON, which tries to look up $HOME to find user-installed pacakges, and if that fails it tries to look up the user in the passwd database which often goes badly (The container is running under the `nobody` user): {code} if i == 1: if 'HOME' not in os.environ: import pwd userhome = pwd.getpwuid(os.getuid()).pw_dir else: userhome = os.environ['HOME'] {code} Just setting HOME by default to WORK_DIR would enable more software to work correctly out of the box. Software which needs to specialize / change it (or schedulers with specific preferences), should still be able to set it arbitrarily and anything a scheduler explicitly sets should overwrite the default value of $WORK_DIR -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5578) Support static address allocation in CNI
[ https://issues.apache.org/jira/browse/MESOS-5578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-5578: - Affects Version/s: (was: 1.0.0) > Support static address allocation in CNI > > > Key: MESOS-5578 > URL: https://issues.apache.org/jira/browse/MESOS-5578 > Project: Mesos > Issue Type: Task > Components: containerization > Environment: Linux >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan > Labels: mesosphere > > Currently a framework can't specify a static IP address for the container > when using the network/cni isolator. > The `ipaddress` field in the `NetworkInfo` protobuf was designed for this > specific purpose but since the CNI spec does not specify a means to allocate > an IP address to the container the `network/cni` isolator cannot honor this > field even when it is filled in by the framework. > Creating this ticket to act as a place holder to track this limitation. As > and when the CNI spec allows us to specify a static IP address for the > container, we can resolve this ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6339) Support docker registry that requires basic auth.
Jie Yu created MESOS-6339: - Summary: Support docker registry that requires basic auth. Key: MESOS-6339 URL: https://issues.apache.org/jira/browse/MESOS-6339 Project: Mesos Issue Type: Improvement Reporter: Jie Yu Currently, we assume Bearer auth (in Mesos containerizer) because it's what docker hub uses. We also need to support basic auth for some private registry that people deploys. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6239) Fix warnings and errors produced by new hardened CXXFLAGS
[ https://issues.apache.org/jira/browse/MESOS-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Wood updated MESOS-6239: -- Description: Most of the new warnings/errors come from libprocess/stout as there were never any CXXFLAGS propagated to them. https://reviews.apache.org/r/52647/ was:Most of the new warnings/errors come from libprocess/stout as there were never any CXXFLAGS propagated to them. > Fix warnings and errors produced by new hardened CXXFLAGS > - > > Key: MESOS-6239 > URL: https://issues.apache.org/jira/browse/MESOS-6239 > Project: Mesos > Issue Type: Improvement >Reporter: Aaron Wood >Assignee: Aaron Wood >Priority: Minor > Labels: c++, clang, gcc, libprocess, security, stout > > Most of the new warnings/errors come from libprocess/stout as there were > never any CXXFLAGS propagated to them. > https://reviews.apache.org/r/52647/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6229) Default to using hardened compilation flags
[ https://issues.apache.org/jira/browse/MESOS-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Wood updated MESOS-6229: -- Description: Provide a default set of hardened compilation flags to help protect against overflows and other attacks. Apply to libprocess and stout as well. Current set of flags that were discussed on slack to implement: -Wformat-security -Wstack-protector -fstack-protector-strong (-fstack-protector-all might be overkill, it could be more effective to use this. Requires gcc >= 4.9 which should be reasonable) -pie -fPIE -fPIC -D_FORTIFY_SOURCE=2 -Wl,-z,relro,-z,now (currently not a part of the patch) -fno-omit-frame-pointer https://reviews.apache.org/r/52645/ was: Provide a default set of hardened compilation flags to help protect against overflows and other attacks. Apply to libprocess and stout as well. Current set of flags that were discussed on slack to implement: -Wformat-security -Wstack-protector -fstack-protector-strong (-fstack-protector-all might be overkill, it could be more effective to use this. Requires gcc >= 4.9) -pie -fPIE -fPIC -D_FORTIFY_SOURCE=2 -Wl,-z,relro,-z,now (currently not a part of the patch) -fno-omit-frame-pointer https://reviews.apache.org/r/52645/ > Default to using hardened compilation flags > --- > > Key: MESOS-6229 > URL: https://issues.apache.org/jira/browse/MESOS-6229 > Project: Mesos > Issue Type: Improvement >Reporter: Aaron Wood >Assignee: Aaron Wood >Priority: Minor > Labels: c++, clang, gcc, security > > Provide a default set of hardened compilation flags to help protect against > overflows and other attacks. Apply to libprocess and stout as well. Current > set of flags that were discussed on slack to implement: > -Wformat-security > -Wstack-protector > -fstack-protector-strong (-fstack-protector-all might be overkill, it could > be more effective to use this. Requires gcc >= 4.9 which should be reasonable) > -pie > -fPIE > -fPIC > -D_FORTIFY_SOURCE=2 > -Wl,-z,relro,-z,now (currently not a part of the patch) > -fno-omit-frame-pointer > https://reviews.apache.org/r/52645/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6229) Default to using hardened compilation flags
[ https://issues.apache.org/jira/browse/MESOS-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Wood updated MESOS-6229: -- Description: Provide a default set of hardened compilation flags to help protect against overflows and other attacks. Apply to libprocess and stout as well. Current set of flags that were discussed on slack to implement: -Wformat-security -Wstack-protector -fstack-protector-strong (-fstack-protector-all might be overkill, it could be more effective to use this. Requires gcc >= 4.9) -pie -fPIE -fPIC -D_FORTIFY_SOURCE=2 -Wl,-z,relro,-z,now (currently not a part of the patch) -fno-omit-frame-pointer https://reviews.apache.org/r/52645/ was: Provide a default set of hardened compilation flags to help protect against overflows and other attacks. Apply to libprocess and stout as well. Current set of flags that were discussed on slack to implement: -Wformat-security -Wstack-protector -fstack-protector-strong (-fstack-protector-all might be overkill, it could be more effective to use this. Requires gcc >= 4.9) -pie -fPIE -D_FORTIFY_SOURCE=2 -Wl,-z,relro,-z,now (currently not a part of the patch) -fno-omit-frame-pointer https://reviews.apache.org/r/52645/ > Default to using hardened compilation flags > --- > > Key: MESOS-6229 > URL: https://issues.apache.org/jira/browse/MESOS-6229 > Project: Mesos > Issue Type: Improvement >Reporter: Aaron Wood >Assignee: Aaron Wood >Priority: Minor > Labels: c++, clang, gcc, security > > Provide a default set of hardened compilation flags to help protect against > overflows and other attacks. Apply to libprocess and stout as well. Current > set of flags that were discussed on slack to implement: > -Wformat-security > -Wstack-protector > -fstack-protector-strong (-fstack-protector-all might be overkill, it could > be more effective to use this. Requires gcc >= 4.9) > -pie > -fPIE > -fPIC > -D_FORTIFY_SOURCE=2 > -Wl,-z,relro,-z,now (currently not a part of the patch) > -fno-omit-frame-pointer > https://reviews.apache.org/r/52645/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6229) Default to using hardened compilation flags
[ https://issues.apache.org/jira/browse/MESOS-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Wood updated MESOS-6229: -- Description: Provide a default set of hardened compilation flags to help protect against overflows and other attacks. Apply to libprocess and stout as well. Current set of flags that were discussed on slack to implement: -Wformat-security -Wstack-protector -fstack-protector-strong (-fstack-protector-all might be overkill, it could be more effective to use this. Requires gcc >= 4.9) -pie -fPIE -D_FORTIFY_SOURCE=2 -O2 (possibly -O3 for greater optimizations, up for discussion) -Wl,-z,relro,-z,now (currently not a part of the patch) -fno-omit-frame-pointer https://reviews.apache.org/r/52645/ was: Provide a default set of hardened compilation flags to help protect against overflows and other attacks. Apply to libprocess and stout as well. Current set of flags that were discussed on slack to implement: -Wformat-security -Wstack-protector -fstack-protector-all -pie -fPIE -D_FORTIFY_SOURCE=2 -O2 (possibly -O3 for greater optimizations, up for discussion) -Wl,-z,relro,-z,now -fno-omit-frame-pointer -fstack-protector-strong (-fstack-protector-all might be overkill, it could be more effective to use this. Requires gcc >= 4.9) > Default to using hardened compilation flags > --- > > Key: MESOS-6229 > URL: https://issues.apache.org/jira/browse/MESOS-6229 > Project: Mesos > Issue Type: Improvement >Reporter: Aaron Wood >Assignee: Aaron Wood >Priority: Minor > Labels: c++, clang, gcc, security > > Provide a default set of hardened compilation flags to help protect against > overflows and other attacks. Apply to libprocess and stout as well. Current > set of flags that were discussed on slack to implement: > -Wformat-security > -Wstack-protector > -fstack-protector-strong (-fstack-protector-all might be overkill, it could > be more effective to use this. Requires gcc >= 4.9) > -pie > -fPIE > -D_FORTIFY_SOURCE=2 > -O2 (possibly -O3 for greater optimizations, up for discussion) > -Wl,-z,relro,-z,now (currently not a part of the patch) > -fno-omit-frame-pointer > https://reviews.apache.org/r/52645/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6283) Fix the Web UI allowing access to the task sandbox for nested containers.
[ https://issues.apache.org/jira/browse/MESOS-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6283: -- Target Version/s: 1.1.0 Priority: Blocker (was: Major) Fix Version/s: (was: 1.1.0) > Fix the Web UI allowing access to the task sandbox for nested containers. > - > > Key: MESOS-6283 > URL: https://issues.apache.org/jira/browse/MESOS-6283 > Project: Mesos > Issue Type: Bug > Components: webui >Reporter: Anand Mazumdar >Assignee: haosdent >Priority: Blocker > Labels: mesosphere > Attachments: sandbox.gif > > > Currently, the sandbox button for a child task is broken on the WebUI. It > does nothing and dies with an error that the executor for this task cannot be > found. We need to fix the WebUI to follow the symlink "tasks/taskId" and > display the task sandbox to the users. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6338) Support --revocable_cpu_low_priority flag for docker containerizer
[ https://issues.apache.org/jira/browse/MESOS-6338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556038#comment-15556038 ] Jie Yu commented on MESOS-6338: --- Sounds good. Keep in mind that docker containerizer will receive less support than mesos containerizer in the future, and new features (e.g., pod, gpus) will go to mesos containerizer first typically. > Support --revocable_cpu_low_priority flag for docker containerizer > -- > > Key: MESOS-6338 > URL: https://issues.apache.org/jira/browse/MESOS-6338 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Kunal Thakar > > The mesos containerizer supports setting lower shares for revocable tasks by > passing --revocable_cpu_low_priority to the mesos agent. This flag is only > supported for mesos containerizer, but I don't see a reason why the behavior > can't be replicated for the docker containerizer. > On setting the flag, CPU shares assigned to revocable tasks are lower than > normal tasks > (https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/cgroups/subsystems/cpu.cpp#L83). > This does not happen in the docker containerizer > (https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1517), > but it can be easily replicated there. > I can send a patch if this is acceptable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6317) Race in master/allocator when updating oversubscribed resources of an agent.
[ https://issues.apache.org/jira/browse/MESOS-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-6317: --- Summary: Race in master/allocator when updating oversubscribed resources of an agent. (was: Race in master update slave.) > Race in master/allocator when updating oversubscribed resources of an agent. > > > Key: MESOS-6317 > URL: https://issues.apache.org/jira/browse/MESOS-6317 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > Fix For: 1.1.0 > > > Currently, when {{updateSlave}} in master, it will first rescind offers and > then updateSlave in allocator, but there is a race for this, there might be a > batch allocation inserted bwteen the two. In this case, the order will be > rescind offer -> batch allocation -> update slave. This order will cause some > issues when the oversubscribed resources was decreased. > Suppose the oversubscribed resources was decreased from 2 to 1, then after > rescind offer finished, the batch allocation will allocate the old 2 > oversubscribed resources again, then update slave will update the total > oversubscribed resources to 1. This will cause the agent host have some time > overcommitted due to the tasks can still use 2 oversubscribed resources but > not 1 oversubscribed resources, once the tasks using the 2 oversubscribed > resources finished, everything goes back. > So here we should adjust the order of rescind offer and updateSlave in master > to avoid resource overcommit. > If we update slave first then rescind offer, the order will be update slave > -> batch allocation -> rescind offer, this order will have no problem when > descreasing resources. Suppose the oversubscribed resources was decreased > from 2 to 1, then update slave will update total oversubscribed resources to > 1 directly, then the batch allocation will not allocate any oversubscribed > resources since there are more allocated than total oversubscribed resources, > then rescind offer will rescind all offers using oversubscribed resources. > This will not lead the agent host to be overcommitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5139) ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar is flaky
[ https://issues.apache.org/jira/browse/MESOS-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song updated MESOS-5139: Assignee: (was: Gilbert Song) > ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar is flaky > -- > > Key: MESOS-5139 > URL: https://issues.apache.org/jira/browse/MESOS-5139 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.28.0 > Environment: Ubuntu14.04 >Reporter: Vinod Kone > Labels: mesosphere > > Found this on ASF CI while testing 0.28.1-rc2 > {code} > [ RUN ] ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar > E0406 18:29:30.870481 520 shell.hpp:93] Command 'hadoop version 2>&1' > failed; this is the output: > sh: 1: hadoop: not found > E0406 18:29:30.870576 520 fetcher.cpp:59] Failed to create URI fetcher > plugin 'hadoop': Failed to create HDFS client: Failed to execute 'hadoop > version 2>&1'; the command was either not found or exited with a non-zero > exit status: 127 > I0406 18:29:30.871052 520 local_puller.cpp:90] Creating local puller with > docker registry '/tmp/3l8ZBv/images' > I0406 18:29:30.873325 539 metadata_manager.cpp:159] Looking for image 'abc' > I0406 18:29:30.874438 539 local_puller.cpp:142] Untarring image 'abc' from > '/tmp/3l8ZBv/images/abc.tar' to '/tmp/3l8ZBv/store/staging/5tw8bD' > I0406 18:29:30.901916 547 local_puller.cpp:162] The repositories JSON file > for image 'abc' is '{"abc":{"latest":"456"}}' > I0406 18:29:30.902304 547 local_puller.cpp:290] Extracting layer tar ball > '/tmp/3l8ZBv/store/staging/5tw8bD/123/layer.tar to rootfs > '/tmp/3l8ZBv/store/staging/5tw8bD/123/rootfs' > I0406 18:29:30.909144 547 local_puller.cpp:290] Extracting layer tar ball > '/tmp/3l8ZBv/store/staging/5tw8bD/456/layer.tar to rootfs > '/tmp/3l8ZBv/store/staging/5tw8bD/456/rootfs' > ../../src/tests/containerizer/provisioner_docker_tests.cpp:183: Failure > (imageInfo).failure(): Collect failed: Subprocess 'tar, tar, -x, -f, > /tmp/3l8ZBv/store/staging/5tw8bD/456/layer.tar, -C, > /tmp/3l8ZBv/store/staging/5tw8bD/456/rootfs' failed: tar: This does not look > like a tar archive > tar: Exiting with failure status due to previous errors > [ FAILED ] ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar (243 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6207) Python bindings fail to build with custom SVN installation path
[ https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1969#comment-1969 ] Ilya Pronin commented on MESOS-6207: Thanks! Strange, on my RB profile page all three fields (first / last name and email) are filled in. But the "Keep profile information private" checkbox was checked. Could that cause the problem? > Python bindings fail to build with custom SVN installation path > --- > > Key: MESOS-6207 > URL: https://issues.apache.org/jira/browse/MESOS-6207 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.0.1 >Reporter: Ilya Pronin >Assignee: Ilya Pronin >Priority: Trivial > Fix For: 1.1.0 > > > In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building > Python bindings. This variable picks {{LDFLAGS}} during configuration phase > before we check for custom SVN installation path and misses > {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with > uncommon SVN installation path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6250) Ensure valid task state before connecting with framework on master failover
[ https://issues.apache.org/jira/browse/MESOS-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1947#comment-1947 ] Joseph Wu commented on MESOS-6250: -- This, along with a variety of other partition scenarios are tracked in this epic: https://issues.apache.org/jira/browse/MESOS-5344 > Ensure valid task state before connecting with framework on master failover > --- > > Key: MESOS-6250 > URL: https://issues.apache.org/jira/browse/MESOS-6250 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.28.0, 0.28.1, 1.0.1 > Environment: OS X 10.11.6 >Reporter: Markus Jura >Priority: Minor > > During a Mesos master failover the master re-registers with its slaves to > receive the current state of the running tasks. It also reconnects to a > framework. > In the documentation it is recommended that a framework performs an explicit > task reconciliation when the Mesos master re-registers: > http://mesos.apache.org/documentation/latest/reconciliation/ > When allowing a reconciliation of a framework, Mesos master should guarantee > that its task state is valid, i.e. the same as on the slaves. Otherwise, > Mesos can reply with status updates of state {{TASK_LOST}} even the tasks is > still running on the slave. > Now, on Mesos master failover, Mesos does not guarantee that it first > re-registers with it slaves before it re-connects to a framework. So it can > occur that the framework connects before Mesos has finished or started the > re-registration with the slaves. When the framework then sends reconciliation > requests directly after a re-registration Mesos will reply with status > updates where the task state is wrong ({{TASK_LOST}} instead of > {{TASK_RUNNING}}). > For a reconciliation request, Mesos should guarantee that the task state is > consistent with the slaves before it replies with a status update. > Another possibility would be that Mesos sends a message to the framework once > it has re-registered with the slaves so that the framework then starts the > reconciliation. So far a framework can only delay the reconciliation for a > certain amount of time. But it does not know how long the delay should be > because Mesos is not notifying the framework when the task state is > consistent again. > *Log: Mesos master - connecting with framework before re-registering with > slaves* > {code:bash} > I0926 12:39:42.006933 4284416 detector.cpp:152] Detected a new leader: > (id='92') > I0926 12:39:42.007242 1064960 group.cpp:706] Trying to get > '/mesos/json.info_92' in ZooKeeper > I0926 12:39:42.008129 4284416 zookeeper.cpp:259] A new leading master > (UPID=master@127.0.0.1:5049) is detected > I0926 12:39:42.008304 4284416 master.cpp:1847] The newly elected leader is > master@127.0.0.1:5049 with id 96178e81-8371-48af-ba5e-c79d16c27fab > I0926 12:39:42.008332 4284416 master.cpp:1860] Elected as the leading master! > I0926 12:39:42.008349 4284416 master.cpp:1547] Recovering from registrar > I0926 12:39:42.008488 3211264 registrar.cpp:332] Recovering registrar > I0926 12:39:42.015935 4284416 registrar.cpp:365] Successfully fetched the > registry (0B) in 7.426816ms > I0926 12:39:42.015985 4284416 registrar.cpp:464] Applied 1 operations in > 11us; attempting to update the 'registry' > I0926 12:39:42.021425 4284416 registrar.cpp:509] Successfully updated the > 'registry' in 5.426176ms > I0926 12:39:42.021462 4284416 registrar.cpp:395] Successfully recovered > registrar > I0926 12:39:42.021581 528384 master.cpp:1655] Recovered 0 agents from the > Registry (118B) ; allowing 10mins for agents to re-register > I0926 12:39:42.299598 3747840 master.cpp:2424] Received SUBSCRIBE call for > framework 'conductr' at > scheduler-65610031-d679-49e5-b7bd-6068500d4674@192.168.2.106:65290 > I0926 12:39:42.299697 3747840 master.cpp:2500] Subscribing framework conductr > with checkpointing disabled and capabilities [ ] > I0926 12:39:42.300122 2674688 hierarchical.cpp:271] Added framework conductr > I0926 12:39:42.954983 1601536 master.cpp:4787] Re-registering agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1) > I0926 12:39:42.955189 1064960 registrar.cpp:464] Applied 1 operations in > 60us; attempting to update the 'registry' > I0926 12:39:42.955893 1064960 registrar.cpp:509] Successfully updated the > 'registry' in 649984ns > I0926 12:39:42.956224 4284416 master.cpp:7447] Adding task > c69df81e-35f4-4c2e-863b-4e9d5ae2e850 with resources mem(*):0 on agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) > I0926 12:39:42.956704 4284416 master.cpp:4872] Re-registered agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 > (127.0.0.1) with cpus(*):8; mem(*):15360; disk(*):470832; >
[jira] [Created] (MESOS-6338) Support --revocable_cpu_low_priority flag for docker containerizer
Kunal Thakar created MESOS-6338: --- Summary: Support --revocable_cpu_low_priority flag for docker containerizer Key: MESOS-6338 URL: https://issues.apache.org/jira/browse/MESOS-6338 Project: Mesos Issue Type: Improvement Components: containerization Reporter: Kunal Thakar The mesos containerizer supports setting lower shares for revocable tasks by passing --revocable_cpu_low_priority to the mesos agent. This flag is only supported for mesos containerizer, but I don't see a reason why the behavior can't be replicated for the docker containerizer. On setting the flag, CPU shares assigned to revocable tasks are lower than normal tasks (https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/cgroups/subsystems/cpu.cpp#L83). This does not happen in the docker containerizer (https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1517), but it can be easily replicated there. I can send a patch if this is acceptable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2723) The mesos-execute tool does not support zk:// master URLs
[ https://issues.apache.org/jira/browse/MESOS-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1894#comment-1894 ] Joseph Wu commented on MESOS-2723: -- The existing review is not quite correct (and has been discarded due to inactivity). The fix should be to: 1) Make the {{--master}} a required flag (i.e. change {{Option master}} to {{string master}}. 2) Remove all custom (unnecessary) validation for {{flags.master}}. > The mesos-execute tool does not support zk:// master URLs > - > > Key: MESOS-2723 > URL: https://issues.apache.org/jira/browse/MESOS-2723 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.22.1 >Reporter: Tom Arnfeld > Labels: newbie > > It appears that the {{mesos-execute}} command line tool does it's own PID > validation of the {{--master}} param which prevents it from supporting > clusters managed with ZooKeeper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2723) The mesos-execute tool does not support zk:// master URLs
[ https://issues.apache.org/jira/browse/MESOS-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-2723: - Assignee: (was: Tom Arnfeld) Story Points: 1 Labels: newbie (was: ) > The mesos-execute tool does not support zk:// master URLs > - > > Key: MESOS-2723 > URL: https://issues.apache.org/jira/browse/MESOS-2723 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.22.1 >Reporter: Tom Arnfeld > Labels: newbie > > It appears that the {{mesos-execute}} command line tool does it's own PID > validation of the {{--master}} param which prevents it from supporting > clusters managed with ZooKeeper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6337) Nested containers getting killed before network isolation can be applied to them.
Avinash Sridharan created MESOS-6337: Summary: Nested containers getting killed before network isolation can be applied to them. Key: MESOS-6337 URL: https://issues.apache.org/jira/browse/MESOS-6337 Project: Mesos Issue Type: Bug Components: containerization Environment: Linux Reporter: Avinash Sridharan Assignee: Gilbert Song Fix For: 1.1.0 Seeing this odd behavior in one of our clusters: ``` http.cpp:1948] Failed to launch nested container cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: Collect failed: Failed to seed container cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: Collect failed: Failed to setup hostname and network files: Failed to enter the mount namespace of pid 21591: Pid 21591 does not exist Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894485 31531 containerizer.cpp:1931] Destroying container cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e in ISOLATING state Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.894439 31531 containerizer.cpp:2300] Container cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e has exited Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.854456 31534 systemd.cpp:96] Assigned child process '21591' to 'mesos_executors.slice' Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831861 21580 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL socket Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831526 21580 openssl.cpp:432] Will only verify peer certificate if presented! Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831521 21580 openssl.cpp:426] Will not verify peer certificate! Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: I1007 02:05:55.831511 21580 openssl.cpp:421] CA directory path unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR= Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.831405 21580 openssl.cpp:399] Failed SSL connections will be downgraded to a non-SSL socket Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: WARNING: Logging before InitGoogleLogging() is written to STDERR Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: W1007 02:05:55.828413 21581 process.cpp:882] Failed SSL connections will be downgraded to a non-SSL socket Oct 07 02:05:55 ip-10-10-0-207 mesos-agent[31520]: NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification ``` The above log is "reverse" chronological order, so please read it bottom up. The relevant log is: ``` http.cpp:1948] Failed to launch nested container cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: Collect failed: Failed to seed container cb92634b-42b3-40f3-94f7-609f89a362bc.46d884e4-d0eb-4572-be1d-24414df7cb2e: Collect failed: Failed to setup hostname and network files: Failed to enter the mount namespace of pid 21591: Pid 21591 does not exist ``` Looks like the nested container failed to launch because the `isolate` call to the `network/cni` isolator failed. Seems like when the isolator received the `isolate` call the PID for the nested container has already exited and it couldn't enter its mount namespace to setup the network files. The odd thing here is that the nested container would have been frozen, and hence was not running, so not sure what killed the nested container. My suspicion falls on systemd, since I also see this log message: ``` Oct 07 18:02:31 ip-10-10-0-207 mesos-agent[31520]: I1007 18:02:31.473656 31532 systemd.cpp:96] Assigned child process '1596' to 'mesos_executors.slice' ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6142) Frameworks may RESERVE for an arbitrary role.
[ https://issues.apache.org/jira/browse/MESOS-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1785#comment-1785 ] Gastón Kleiman commented on MESOS-6142: --- Patch: https://reviews.apache.org/r/52642/ > Frameworks may RESERVE for an arbitrary role. > - > > Key: MESOS-6142 > URL: https://issues.apache.org/jira/browse/MESOS-6142 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Affects Versions: 1.0.0 >Reporter: Alexander Rukletsov >Assignee: Gastón Kleiman >Priority: Blocker > Labels: mesosphere, reservations > > The master does not validate that resources from a reservation request have > the same role the framework is registered with. As a result, frameworks may > reserve resources for arbitrary roles. > I've modified the role in [the {{ReserveThenUnreserve}} > test|https://github.com/apache/mesos/blob/bca600cf5602ed8227d91af9f73d689da14ad786/src/tests/reservation_tests.cpp#L117] > to "yoyo" and observed the following in the test's log: > {noformat} > I0908 18:35:43.379122 2138112 master.cpp:3362] Processing ACCEPT call for > offers: [ dfaf67e6-7c1c-4988-b427-c49842cb7bb7-O0 ] on agent > dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 > (alexr.railnet.train) for framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- > (default) at > scheduler-ca12a660-9f08-49de-be4e-d452aa3aa6da@10.200.181.237:60116 > I0908 18:35:43.379170 2138112 master.cpp:3022] Authorizing principal > 'test-principal' to reserve resources 'cpus(yoyo, test-principal):1; > mem(yoyo, test-principal):512' > I0908 18:35:43.379678 2138112 master.cpp:3642] Applying RESERVE operation for > resources cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 from > framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- (default) at > scheduler-ca12a660-9f08-49de-be4e-d452aa3aa6da@10.200.181.237:60116 to agent > dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 > (alexr.railnet.train) > I0908 18:35:43.379767 2138112 master.cpp:7341] Sending checkpointed resources > cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 to agent > dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 > (alexr.railnet.train) > I0908 18:35:43.380273 3211264 slave.cpp:2497] Updated checkpointed resources > from to cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 > I0908 18:35:43.380574 2674688 hierarchical.cpp:760] Updated allocation of > framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- on agent > dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 from cpus(*):1; mem(*):512; > disk(*):470841; ports(*):[31000-32000] to ports(*):[31000-32000]; cpus(yoyo, > test-principal):1; disk(*):470841; mem(yoyo, test-principal):512 with RESERVE > operation > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6335) Add user doc for task group tasks
[ https://issues.apache.org/jira/browse/MESOS-6335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6335: -- Shepherd: Vinod Kone > Add user doc for task group tasks > - > > Key: MESOS-6335 > URL: https://issues.apache.org/jira/browse/MESOS-6335 > Project: Mesos > Issue Type: Documentation >Reporter: Vinod Kone > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6336) SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky
[ https://issues.apache.org/jira/browse/MESOS-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1621#comment-1621 ] Greg Mann edited comment on MESOS-6336 at 10/7/16 4:57 PM: --- Here's a partial log from the ASF CI as well, from 10 days ago. This one was CentOS 7: {code} I0927 06:49:21.610502 30001 http.cpp:883] Using default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0927 06:49:21.610563 30003 recover.cpp:568] Updating replica status to VOTING I0927 06:49:21.610743 30001 http.cpp:883] Using default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0927 06:49:21.610916 30001 master.cpp:584] Authorization enabled I0927 06:49:21.611145 30011 hierarchical.cpp:149] Initialized hierarchical allocator process I0927 06:49:21.611171 30013 whitelist_watcher.cpp:77] No whitelist given I0927 06:49:21.611275 30009 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 414250ns I0927 06:49:21.611301 30009 replica.cpp:320] Persisted replica status to VOTING I0927 06:49:21.611450 30008 recover.cpp:582] Successfully joined the Paxos group I0927 06:49:21.611651 30008 recover.cpp:466] Recover process terminated I0927 06:49:21.613910 30012 master.cpp:2013] Elected as the leading master! I0927 06:49:21.613943 30012 master.cpp:1560] Recovering from registrar I0927 06:49:21.614099 30013 registrar.cpp:329] Recovering registrar I0927 06:49:21.614842 30012 log.cpp:553] Attempting to start the writer I0927 06:49:21.616055 30014 replica.cpp:493] Replica received implicit promise request from __req_res__(6052)@172.17.0.2:49598 with proposal 1 I0927 06:49:21.616436 30014 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 345420ns I0927 06:49:21.616459 30014 replica.cpp:342] Persisted promised to 1 I0927 06:49:21.616914 30006 coordinator.cpp:238] Coordinator attempting to fill missing positions I0927 06:49:21.618098 30006 replica.cpp:388] Replica received explicit promise request from __req_res__(6053)@172.17.0.2:49598 for position 0 with proposal 2 I0927 06:49:21.618446 30006 leveldb.cpp:341] Persisting action (8 bytes) to leveldb took 305036ns I0927 06:49:21.618474 30006 replica.cpp:708] Persisted action NOP at position 0 I0927 06:49:21.619513 30012 replica.cpp:537] Replica received write request for position 0 from __req_res__(6054)@172.17.0.2:49598 I0927 06:49:21.619604 30012 leveldb.cpp:436] Reading position from leveldb took 55504ns I0927 06:49:21.619915 30012 leveldb.cpp:341] Persisting action (14 bytes) to leveldb took 262919ns I0927 06:49:21.619941 30012 replica.cpp:708] Persisted action NOP at position 0 I0927 06:49:21.620503 30016 replica.cpp:691] Replica received learned notice for position 0 from @0.0.0.0:0 I0927 06:49:21.620851 30016 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 313765ns I0927 06:49:21.620878 30016 replica.cpp:708] Persisted action NOP at position 0 I0927 06:49:21.621417 30014 log.cpp:569] Writer started with ending position 0 I0927 06:49:21.622566 30013 leveldb.cpp:436] Reading position from leveldb took 28375ns I0927 06:49:21.623528 30005 registrar.cpp:362] Successfully fetched the registry (0B) in 9.373952ms I0927 06:49:21.623668 30005 registrar.cpp:461] Applied 1 operations in 25023ns; attempting to update the registry I0927 06:49:21.624490 30012 log.cpp:577] Attempting to append 168 bytes to the log I0927 06:49:21.624620 30004 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 1 I0927 06:49:21.625282 30007 replica.cpp:537] Replica received write request for position 1 from __req_res__(6055)@172.17.0.2:49598 I0927 06:49:21.625720 30007 leveldb.cpp:341] Persisting action (187 bytes) to leveldb took 396032ns I0927 06:49:21.625746 30007 replica.cpp:708] Persisted action APPEND at position 1 I0927 06:49:21.626509 30012 replica.cpp:691] Replica received learned notice for position 1 from @0.0.0.0:0 I0927 06:49:21.626986 30012 leveldb.cpp:341] Persisting action (189 bytes) to leveldb took 328126ns I0927 06:49:21.627027 30012 replica.cpp:708] Persisted action APPEND at position 1 I0927 06:49:21.628249 30014 registrar.cpp:506] Successfully updated the registry in 4.504832ms I0927 06:49:21.628463 30016 log.cpp:596] Attempting to truncate the log to 1 I0927 06:49:21.628484 30014 registrar.cpp:392] Successfully recovered registrar I0927 06:49:21.628619 30005 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 2 I0927 06:49:21.629341 30010 master.cpp:1676] Recovered 0 agents from the registry (129B); allowing 10mins for agents to re-register I0927 06:49:21.629361 30007 hierarchical.cpp:176] Skipping recovery of hierarchical allocator: nothing to recover I0927 06:49:21.629873 30004 replica.cpp:537] Replica received write request for position 2 from __req_res__(6056)@172.17.0.2:49598 I0927 06:49:21.630329 30004 leveldb.cpp:341] Persisting action (16 bytes) to
[jira] [Commented] (MESOS-6336) SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky
[ https://issues.apache.org/jira/browse/MESOS-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1621#comment-1621 ] Greg Mann commented on MESOS-6336: -- Here's a partial log from the ASF CI as well, from 10 days ago: {code} I0927 06:49:21.610502 30001 http.cpp:883] Using default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0927 06:49:21.610563 30003 recover.cpp:568] Updating replica status to VOTING I0927 06:49:21.610743 30001 http.cpp:883] Using default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0927 06:49:21.610916 30001 master.cpp:584] Authorization enabled I0927 06:49:21.611145 30011 hierarchical.cpp:149] Initialized hierarchical allocator process I0927 06:49:21.611171 30013 whitelist_watcher.cpp:77] No whitelist given I0927 06:49:21.611275 30009 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 414250ns I0927 06:49:21.611301 30009 replica.cpp:320] Persisted replica status to VOTING I0927 06:49:21.611450 30008 recover.cpp:582] Successfully joined the Paxos group I0927 06:49:21.611651 30008 recover.cpp:466] Recover process terminated I0927 06:49:21.613910 30012 master.cpp:2013] Elected as the leading master! I0927 06:49:21.613943 30012 master.cpp:1560] Recovering from registrar I0927 06:49:21.614099 30013 registrar.cpp:329] Recovering registrar I0927 06:49:21.614842 30012 log.cpp:553] Attempting to start the writer I0927 06:49:21.616055 30014 replica.cpp:493] Replica received implicit promise request from __req_res__(6052)@172.17.0.2:49598 with proposal 1 I0927 06:49:21.616436 30014 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 345420ns I0927 06:49:21.616459 30014 replica.cpp:342] Persisted promised to 1 I0927 06:49:21.616914 30006 coordinator.cpp:238] Coordinator attempting to fill missing positions I0927 06:49:21.618098 30006 replica.cpp:388] Replica received explicit promise request from __req_res__(6053)@172.17.0.2:49598 for position 0 with proposal 2 I0927 06:49:21.618446 30006 leveldb.cpp:341] Persisting action (8 bytes) to leveldb took 305036ns I0927 06:49:21.618474 30006 replica.cpp:708] Persisted action NOP at position 0 I0927 06:49:21.619513 30012 replica.cpp:537] Replica received write request for position 0 from __req_res__(6054)@172.17.0.2:49598 I0927 06:49:21.619604 30012 leveldb.cpp:436] Reading position from leveldb took 55504ns I0927 06:49:21.619915 30012 leveldb.cpp:341] Persisting action (14 bytes) to leveldb took 262919ns I0927 06:49:21.619941 30012 replica.cpp:708] Persisted action NOP at position 0 I0927 06:49:21.620503 30016 replica.cpp:691] Replica received learned notice for position 0 from @0.0.0.0:0 I0927 06:49:21.620851 30016 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 313765ns I0927 06:49:21.620878 30016 replica.cpp:708] Persisted action NOP at position 0 I0927 06:49:21.621417 30014 log.cpp:569] Writer started with ending position 0 I0927 06:49:21.622566 30013 leveldb.cpp:436] Reading position from leveldb took 28375ns I0927 06:49:21.623528 30005 registrar.cpp:362] Successfully fetched the registry (0B) in 9.373952ms I0927 06:49:21.623668 30005 registrar.cpp:461] Applied 1 operations in 25023ns; attempting to update the registry I0927 06:49:21.624490 30012 log.cpp:577] Attempting to append 168 bytes to the log I0927 06:49:21.624620 30004 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 1 I0927 06:49:21.625282 30007 replica.cpp:537] Replica received write request for position 1 from __req_res__(6055)@172.17.0.2:49598 I0927 06:49:21.625720 30007 leveldb.cpp:341] Persisting action (187 bytes) to leveldb took 396032ns I0927 06:49:21.625746 30007 replica.cpp:708] Persisted action APPEND at position 1 I0927 06:49:21.626509 30012 replica.cpp:691] Replica received learned notice for position 1 from @0.0.0.0:0 I0927 06:49:21.626986 30012 leveldb.cpp:341] Persisting action (189 bytes) to leveldb took 328126ns I0927 06:49:21.627027 30012 replica.cpp:708] Persisted action APPEND at position 1 I0927 06:49:21.628249 30014 registrar.cpp:506] Successfully updated the registry in 4.504832ms I0927 06:49:21.628463 30016 log.cpp:596] Attempting to truncate the log to 1 I0927 06:49:21.628484 30014 registrar.cpp:392] Successfully recovered registrar I0927 06:49:21.628619 30005 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 2 I0927 06:49:21.629341 30010 master.cpp:1676] Recovered 0 agents from the registry (129B); allowing 10mins for agents to re-register I0927 06:49:21.629361 30007 hierarchical.cpp:176] Skipping recovery of hierarchical allocator: nothing to recover I0927 06:49:21.629873 30004 replica.cpp:537] Replica received write request for position 2 from __req_res__(6056)@172.17.0.2:49598 I0927 06:49:21.630329 30004 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 404029ns I0927 06:49:21.630362 30004 replica.cpp:708]
[jira] [Created] (MESOS-6336) SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky
Greg Mann created MESOS-6336: Summary: SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky Key: MESOS-6336 URL: https://issues.apache.org/jira/browse/MESOS-6336 Project: Mesos Issue Type: Bug Components: slave Reporter: Greg Mann The test {{SlaveTest.KillTaskGroupBetweenRunTaskParts}} sometimes segfaults during the agent's {{finalize()}} method. This was observed on our internal CI, on Fedora with libev, without SSL: {code} [ RUN ] SlaveTest.KillTaskGroupBetweenRunTaskParts I1007 14:12:57.973811 28630 cluster.cpp:158] Creating default 'local' authorizer I1007 14:12:57.982128 28630 leveldb.cpp:174] Opened db in 8.195028ms I1007 14:12:57.982599 28630 leveldb.cpp:181] Compacted db in 446238ns I1007 14:12:57.982616 28630 leveldb.cpp:196] Created db iterator in 3650ns I1007 14:12:57.982622 28630 leveldb.cpp:202] Seeked to beginning of db in 451ns I1007 14:12:57.982627 28630 leveldb.cpp:271] Iterated through 0 keys in the db in 352ns I1007 14:12:57.982638 28630 replica.cpp:776] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I1007 14:12:57.983024 28645 recover.cpp:451] Starting replica recovery I1007 14:12:57.983127 28651 recover.cpp:477] Replica is in EMPTY status I1007 14:12:57.983459 28644 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from __req_res__(6234)@172.30.2.161:38776 I1007 14:12:57.983543 28651 recover.cpp:197] Received a recover response from a replica in EMPTY status I1007 14:12:57.983680 28650 recover.cpp:568] Updating replica status to STARTING I1007 14:12:57.983990 28648 master.cpp:380] Master 76d4d55f-dcc6-4033-85d9-7ec97ef353cb (ip-172-30-2-161.ec2.internal.mesosphere.io) started on 172.30.2.161:38776 I1007 14:12:57.984007 28648 master.cpp:382] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/rVbcaO/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --quiet="false" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/rVbcaO/master" --zk_session_timeout="10secs" I1007 14:12:57.984127 28648 master.cpp:432] Master only allowing authenticated frameworks to register I1007 14:12:57.984134 28648 master.cpp:446] Master only allowing authenticated agents to register I1007 14:12:57.984139 28648 master.cpp:459] Master only allowing authenticated HTTP frameworks to register I1007 14:12:57.984143 28648 credentials.hpp:37] Loading credentials for authentication from '/tmp/rVbcaO/credentials' I1007 14:12:57.988487 28648 master.cpp:504] Using default 'crammd5' authenticator I1007 14:12:57.988530 28648 http.cpp:883] Using default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I1007 14:12:57.988585 28648 http.cpp:883] Using default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I1007 14:12:57.988648 28648 http.cpp:883] Using default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I1007 14:12:57.988680 28648 master.cpp:584] Authorization enabled I1007 14:12:57.988757 28650 whitelist_watcher.cpp:77] No whitelist given I1007 14:12:57.988772 28646 hierarchical.cpp:149] Initialized hierarchical allocator process I1007 14:12:57.988917 28651 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 5.186917ms I1007 14:12:57.988934 28651 replica.cpp:320] Persisted replica status to STARTING I1007 14:12:57.989045 28651 recover.cpp:477] Replica is in STARTING status I1007 14:12:57.989316 28648 master.cpp:2013] Elected as the leading master! I1007 14:12:57.989331 28648 master.cpp:1560] Recovering from registrar I1007 14:12:57.989408 28649 replica.cpp:673] Replica in STARTING status received a broadcasted recover request from __req_res__(6235)@172.30.2.161:38776 I1007 14:12:57.989423 28648 registrar.cpp:329] Recovering registrar I1007 14:12:57.989792 28647 recover.cpp:197] Received a recover response from a replica in STARTING status I1007 14:12:57.989956 28650 recover.cpp:568]
[jira] [Updated] (MESOS-6322) Agent fails to kill empty parent container
[ https://issues.apache.org/jira/browse/MESOS-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6322: -- Sprint: Mesosphere Sprint 44 Story Points: 3 > Agent fails to kill empty parent container > -- > > Key: MESOS-6322 > URL: https://issues.apache.org/jira/browse/MESOS-6322 > Project: Mesos > Issue Type: Bug >Reporter: Greg Mann >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > I launched a pod using Marathon, which led to the launching of a task group > on a Mesos agent. The pod spec was flawed, so this led to Marathon repeatedly > re-launching multiple instances of the task group. After this went on for a > few minutes, I told Marathon to scale the app to 0 instances, which sends > {{TASK_KILLED}} for one task in each task group. After this, the Mesos agent > reports {{TASK_KILLED}} status updates for all 3 tasks in the pod, but > hitting the {{/containers}} endpoint on the agent reveals that the executor > container for this task group is still running. > Here is the task group launching on the agent: > {code} > slave.cpp:1696] Launching task group containing tasks [ > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1, > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2, > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask ] for > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- > {code} > and here is the executor container starting: > {code} > mesos-agent[2994]: I1006 20:23:27.407563 3094 containerizer.cpp:965] > Starting container bf38ff09-3da1-487a-8926-1f4cc88bce32 for executor > 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework > 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- > {code} > and here is the output showing the {{TASK_KILLED}} updates for one task group: > {code} > mesos-agent[2994]: I1006 20:23:28.728224 3097 slave.cpp:2283] Asked to kill > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- > mesos-agent[2994]: W1006 20:23:28.728304 3097 slave.cpp:2364] Transitioning > the state of task > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- to TASK_KILLED because > the executor is not registered > mesos-agent[2994]: I1006 20:23:28.728659 3097 slave.cpp:3609] Handling > status update TASK_KILLED (UUID: 1cb8197a-7829-4a05-9cb1-14ba97519c42) for > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0 > mesos-agent[2994]: I1006 20:23:28.728817 3097 slave.cpp:3609] Handling > status update TASK_KILLED (UUID: e377e9fb-6466-4ce5-b32a-37d840b9f87c) for > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0 > mesos-agent[2994]: I1006 20:23:28.728912 3097 slave.cpp:3609] Handling > status update TASK_KILLED (UUID: 24d44b25-ea52-43a1-afdb-6c04389879d2) for > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0 > {code} > however, if we grep the log for the executor's ID, the last line mentioning > it is: > {code} > slave.cpp:3080] Creating a marker file for HTTP based executor > 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework > 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- (via HTTP) at path > '/var/lib/mesos/slave/meta/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32/http.marker' > {code} > so it seems the executor never exited. If we hit the agent's {{/containers}} > endpoint, we get output which includes this executor container: > {code} > { > "container_id": "bf38ff09-3da1-487a-8926-1f4cc88bce32", > "executor_id": "instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601", > "executor_name": "", > "framework_id": "42838ca8-8d6b-475b-9b3b-59f3cd0d6834-", > "source": "", > "statistics": { > "cpus_limit": 0.1, > "cpus_nr_periods": 17, > "cpus_nr_throttled": 11, > "cpus_system_time_secs": 0.02, > "cpus_throttled_time_secs": 0.784142448, > "cpus_user_time_secs": 0.09, > "disk_limit_bytes": 10485760, > "disk_used_bytes": 20480, > "mem_anon_bytes": 11337728, > "mem_cache_bytes": 0, > "mem_critical_pressure_counter": 0, > "mem_file_bytes": 0, > "mem_limit_bytes": 33554432, > "mem_low_pressure_counter": 0, >
[jira] [Commented] (MESOS-5275) Add capabilities support for unified containerizer.
[ https://issues.apache.org/jira/browse/MESOS-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1543#comment-1543 ] Jie Yu commented on MESOS-5275: --- commit 4ea9651aabd01f623f2578d2823271488d924c5b Author: Benjamin BannierDate: Wed Oct 5 21:44:04 2016 -0700 Created an isolator for Linux capabilities. Review: https://reviews.apache.org/r/50271/ commit f6a25360053fc38e843129cc7e1f9fe4cf757ecd Author: Benjamin Bannier Date: Wed Oct 5 21:35:40 2016 -0700 Reorganized includes in containerizer. Review: https://reviews.apache.org/r/52081/ commit e7d1f53621a09da47ee7dc5d6fcd6326cb72792d Author: Benjamin Bannier Date: Wed Oct 5 21:28:12 2016 -0700 Added `ping` to test linux rootfs. Review: https://reviews.apache.org/r/51931/ commit 5e3648c871f8008d8e11390b2ccba86c59d82f70 Author: Benjamin Bannier Date: Wed Oct 5 20:55:42 2016 -0700 Introduced Linux capabilities support for Mesos executor. This change introduces Linux capability-based security the Mesos exector. A new flag `capabilities` is introduced to optionally specify the capabilities tasks launched by the Mesos executor are allowed to use. Review: https://reviews.apache.org/r/51930/ > Add capabilities support for unified containerizer. > --- > > Key: MESOS-5275 > URL: https://issues.apache.org/jira/browse/MESOS-5275 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Jojy Varghese >Assignee: Benjamin Bannier > Labels: mesosphere > Fix For: 1.1.0 > > > Add capabilities support for unified containerizer. > Requirements: > 1. Use the mesos capabilities API. > 2. Frameworks be able to add capability requests for containers. > 3. Agents be able to add maximum allowed capabilities for all containers > launched. > Design document: > https://docs.google.com/document/d/1YiTift8TQla2vq3upQr7K-riQ_pQ-FKOCOsysQJROGc/edit#heading=h.rgfwelqrskmd -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1500#comment-1500 ] Megha commented on MESOS-6223: -- This jira came out as a pre-requisite to support task restart post a reboot. There are definitely use cases where you would need a persistent agent Id because resources like persistent volumes are not tied to the lifecycle of the ephemeral agent and exist even after the agent is gone. But the thing is in order to support task restart on the rebooted host we need the previous agent Id or session Id (from MESOS-5368) to recover and figure out which tasks to restart and restart them eventually. So, I believe the agent or session recovery post a reboot is needed. I believe recovery being short-circuited after reboot is an optimization because of the fact that no tasks/executors are running after agent's host reboot which will change with MESOS-3545. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Megha > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6327) Large docker images make the mesos containerizer crash with: Too many levels of symbolic links
[ https://issues.apache.org/jira/browse/MESOS-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1358#comment-1358 ] Gilbert Song commented on MESOS-6327: - [~a-nldisr] Thanks for reporting this issue. Currently, Mesos would select `copy` backend for unified containerizer by default. However, for better performance with large images (or too many layers #s), we would recommend using `overlay` backend, or `aufs`. We consider support auto backend by default in Mesos MESOS-5931. We need to fix this issue in copy backend. Could you please test out if you are still blocked by using `overlay` backend? Hopefully that would resolve your issue. > Large docker images make the mesos containerizer crash with: Too many levels > of symbolic links > -- > > Key: MESOS-6327 > URL: https://issues.apache.org/jira/browse/MESOS-6327 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.0, 1.0.1 > Environment: centos 7.2 (1511), ubuntu 14.04 (trusty). Replicated in > the Apache Aurora vagrant image >Reporter: Rogier Dikkes >Priority: Critical > > When deploying Mesos containers with large (6G+, 60+ layers) Docker images > the task crashes with the error: > Mesos agent logs: > E1007 08:40:12.954227 8117 slave.cpp:3976] Container > 'a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4' for executor > 'thermos-www-data-devel-hello_docker_image-0-d42d2af6-6b44-4b2b-be95-e1ba93a6b365' > of framework df > c91a86-84b9-4539-a7be-4ace7b7b44a1- failed to start: Collect failed: > Collect failed: Failed to copy layer: cp: cannot stat > ‘/var/lib/mesos/provisioner/containers/a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4/b > ackends/copy/rootfses/5f328f72-25d4-4a26-ac83-8d30bbc44e97/usr/share/zoneinfo/right/Asia/Urumqi’: > Too many levels of symbolic links > ... (complete pastebin: http://pastebin.com/umZ4Q5d1 ) > How to replicate: > Start the aurora vagrant image. Adjust the > /etc/mesos-slave/executor_registration_timeout to 5 mins. Adjust the file > /vagrant/examples/jobs/hello_docker_image.aurora to start a large Docker > image instead of the example. (you can use anldisr/jupyter:0.4 i created as a > test image, this is based upon the jupyter notebook stacks.). Create the job, > watch it fail after x number of minutes. > The mesos sandbox is empty. > Aurora errors i see: > 28 minutes ago - FAILED : Failed to launch container: Collect failed: Collect > failed: Failed to copy layer: cp: cannot stat > ‘/var/lib/mesos/provisioner/containers/93420a36-0e0c-4f04-b401-74c426c25686/backends/copy/rootfses/6e185a51-7174-4b0d-a305-42b634eb91bb/usr/share/zoneinfo/right/Asia/Urumqi’: > Too many levels of symbolic links cp: cannot stat ... > Too many levels of symbolic links ; Container destroyed while provisioning > images > (complete pastebin: http://pastebin.com/uecHYD5J ) > To rule out the image i started this and more images as a normal Docker > container. This works without issues. > Mesos flags related configured: > -appc_store_dir > /tmp/mesos/images/appc > -containerizers > docker,mesos > -executor_registration_timeout > 5mins > -image_providers > appc,docker > -image_provisioner_backend > copy > -isolation > filesystem/linux,docker/runtime > Affected Mesos versions tested: 1.0.1 & 1.0.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6335) Add user doc for task group tasks
Vinod Kone created MESOS-6335: - Summary: Add user doc for task group tasks Key: MESOS-6335 URL: https://issues.apache.org/jira/browse/MESOS-6335 Project: Mesos Issue Type: Documentation Reporter: Vinod Kone -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6333) Don't send TASK_LOST when removing a framework from an agent
Neil Conway created MESOS-6333: -- Summary: Don't send TASK_LOST when removing a framework from an agent Key: MESOS-6333 URL: https://issues.apache.org/jira/browse/MESOS-6333 Project: Mesos Issue Type: Improvement Components: master Reporter: Neil Conway Assignee: Neil Conway Update this code: {code} // Remove pointers to framework's tasks in slaves, and send status // updates. // NOTE: A copy is needed because removeTask modifies slave->tasks. foreachvalue (Task* task, utils::copy(slave->tasks[framework->id()])) { // Remove tasks that belong to this framework. if (task->framework_id() == framework->id()) { // A framework might not actually exist because the master failed // over and the framework hasn't reconnected yet. For more info // please see the comments in 'removeFramework(Framework*)'. const StatusUpdate& update = protobuf::createStatusUpdate( task->framework_id(), task->slave_id(), task->task_id(), TASK_LOST, TaskStatus::SOURCE_MASTER, None(), "Slave " + slave->info.hostname() + " disconnected", TaskStatus::REASON_SLAVE_DISCONNECTED, (task->has_executor_id() ? Option(task->executor_id()) : None())); updateTask(task, update); removeTask(task); forward(update, UPID(), framework); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6332) Don't send TASK_LOST in the agent
Neil Conway created MESOS-6332: -- Summary: Don't send TASK_LOST in the agent Key: MESOS-6332 URL: https://issues.apache.org/jira/browse/MESOS-6332 Project: Mesos Issue Type: Improvement Components: slave Reporter: Neil Conway Assignee: Neil Conway The agent sends {{TASK_LOST}} to handle various error situations. For partition-aware frameworks, we should not send {{TASK_LOST}} -- we should send a more specific {{TaskState}}, depending on the exact circumstances. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6331) Don't send TASK_LOST when accepting offers in a disconnected scheduler
Neil Conway created MESOS-6331: -- Summary: Don't send TASK_LOST when accepting offers in a disconnected scheduler Key: MESOS-6331 URL: https://issues.apache.org/jira/browse/MESOS-6331 Project: Mesos Issue Type: Improvement Components: scheduler driver Reporter: Neil Conway Assignee: Neil Conway Update this to send TASK_DROPPED for partition-aware frameworks: {code} if (!connected) { VLOG(1) << "Ignoring accept offers message as master is disconnected"; // NOTE: Reply to the framework with TASK_LOST messages for each // task launch. See details from notes in launchTasks. foreach (const Offer::Operation& operation, operations) { if (operation.type() != Offer::Operation::LAUNCH) { continue; } foreach (const TaskInfo& task, operation.launch().task_infos()) { StatusUpdate update = protobuf::createStatusUpdate( framework.id(), None(), task.task_id(), TASK_LOST, TaskStatus::SOURCE_MASTER, None(), "Master disconnected", TaskStatus::REASON_MASTER_DISCONNECTED); statusUpdate(UPID(), update, UPID()); } } return; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6330) Send TASK_UNKNOWN for tasks on unknown agents
Neil Conway created MESOS-6330: -- Summary: Send TASK_UNKNOWN for tasks on unknown agents Key: MESOS-6330 URL: https://issues.apache.org/jira/browse/MESOS-6330 Project: Mesos Issue Type: Improvement Components: master Reporter: Neil Conway Assignee: Neil Conway In Mesos <= 1.0, we send {{TASK_LOST}} for explicit reconciliation requests for tasks running on agents the master has never heard about. For partition-aware frameworks in Mesos >= 1.1, we should instead send TASK_UNKNOWN in this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6329) Send TASK_DROPPED for task launch errors
Neil Conway created MESOS-6329: -- Summary: Send TASK_DROPPED for task launch errors Key: MESOS-6329 URL: https://issues.apache.org/jira/browse/MESOS-6329 Project: Mesos Issue Type: Improvement Components: master Reporter: Neil Conway Assignee: Neil Conway In Mesos <= 1.0, we send {{TASK_LOST}} for task launch attempts that fail due to a transient error (e.g., a concurrent dynamic reservation that consumes the resources the task launch was trying to use). For PARTITION_AWARE frameworks, we should instead send TASK_DROPPED in this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6328) Make initialization of openssl eager
Benjamin Bannier created MESOS-6328: --- Summary: Make initialization of openssl eager Key: MESOS-6328 URL: https://issues.apache.org/jira/browse/MESOS-6328 Project: Mesos Issue Type: Bug Components: security Reporter: Benjamin Bannier Priority: Minor Currently openssl is initialized lazily since {{openssl::initialize}} is called whenever the first ssl socket is created with {{LibeventSSLSocketImpl::create}}, while it should be possible to just call it in spots where {{process::initialize}} is called. This was brought up during https://reviews.apache.org/r/52154/. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6216) LibeventSSLSocketImpl::create is not safe to call concurrently with os::getenv
[ https://issues.apache.org/jira/browse/MESOS-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554910#comment-15554910 ] Till Toenshoff commented on MESOS-6216: --- Today. > LibeventSSLSocketImpl::create is not safe to call concurrently with os::getenv > -- > > Key: MESOS-6216 > URL: https://issues.apache.org/jira/browse/MESOS-6216 > Project: Mesos > Issue Type: Bug > Components: security >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > Attachments: build.log > > > {{LibeventSSLSocketImpl::create}} is called whenever a potentially > ssl-enabled socket is created. It in turn calls {{openssl::initialize}} which > calls a function {{reinitialize}} using {{os::setenv}}. Here {{os::setenv}} > is used to set up SSL-related libprocess environment variables > {{LIBPROCESS_SSL_*}}. > Since {{os::setenv}} is not thread-safe just like the {{::setenv}} it wraps, > any calling of functions like {{os::getenv}} (or via {{os::environment}}) > concurrently with the first invocation of {{LibeventSSLSocketImpl::create}} > performs unsynchronized r/w access to the same data structure in the runtime. > We usually perform most setup of the environment before we start the > libprocess runtime with {{process::initialize}} from a {{main}} function, see > e.g., {{src/slave/main.cpp}} or {{src/master/main.cpp}} and others. It > appears that we should move the setup of libprocess' SSL environment > variables to a similar spot. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6207) Python bindings fail to build with custom SVN installation path
[ https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554765#comment-15554765 ] Till Toenshoff commented on MESOS-6207: --- [~ipronin] your reviewboard profile seems to be incomplete causing your patch to not have an author attribute set. I manually fixed that for this patch but you might want to fix that permanently in your ReviewBoard account. The missing email address seems to be the root cause. > Python bindings fail to build with custom SVN installation path > --- > > Key: MESOS-6207 > URL: https://issues.apache.org/jira/browse/MESOS-6207 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.0.1 >Reporter: Ilya Pronin >Assignee: Ilya Pronin >Priority: Trivial > Fix For: 1.1.0 > > > In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building > Python bindings. This variable picks {{LDFLAGS}} during configuration phase > before we check for custom SVN installation path and misses > {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with > uncommon SVN installation path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6321) CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting
[ https://issues.apache.org/jira/browse/MESOS-6321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6321: --- Shepherd: Michael Park Sprint: Mesosphere Sprint 44 Story Points: 1 Target Version/s: 1.1.0 > CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting > - > > Key: MESOS-6321 > URL: https://issues.apache.org/jira/browse/MESOS-6321 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway >Assignee: Alexander Rukletsov > Labels: mesosphere > > Observed in internal CI: > {noformat} > [15:52:21] : [Step 10/10] [ RUN ] > HierarchicalAllocatorTest.NoDoubleAccounting > [15:52:21]W: [Step 10/10] I1006 15:52:21.813817 23713 > hierarchical.cpp:275] Added framework framework1 > [15:52:21]W: [Step 10/10] I1006 15:52:21.814100 23713 > hierarchical.cpp:1694] No allocations performed > [15:52:21]W: [Step 10/10] I1006 15:52:21.814102 23712 process.cpp:3377] > Handling HTTP event for process 'metrics' with path: '/metrics/snapshot' > [15:52:21]W: [Step 10/10] I1006 15:52:21.814121 23713 > hierarchical.cpp:1789] No inverse offers to send out! > [15:52:21]W: [Step 10/10] I1006 15:52:21.814146 23713 > hierarchical.cpp:1286] Performed allocation for 0 agents in 52445ns > [15:52:21]W: [Step 10/10] I1006 15:52:21.814206 23713 > hierarchical.cpp:485] Added agent agent1 (agent1) with cpus(*):1 (allocated: > cpus(*):1) > [15:52:21]W: [Step 10/10] I1006 15:52:21.814237 23713 > hierarchical.cpp:1694] No allocations performed > [15:52:21]W: [Step 10/10] I1006 15:52:21.814247 23713 > hierarchical.cpp:1789] No inverse offers to send out! > [15:52:21]W: [Step 10/10] I1006 15:52:21.814259 23713 > hierarchical.cpp:1309] Performed allocation for agent agent1 in 33887ns > [15:52:21]W: [Step 10/10] I1006 15:52:21.814294 23713 > hierarchical.cpp:485] Added agent agent2 (agent2) with cpus(*):1 (allocated: > cpus(*):1) > [15:52:21]W: [Step 10/10] I1006 15:52:21.814332 23713 > hierarchical.cpp:1694] No allocations performed > [15:52:21]W: [Step 10/10] I1006 15:52:21.814342 23713 > hierarchical.cpp:1789] No inverse offers to send out! > [15:52:21]W: [Step 10/10] I1006 15:52:21.814349 23713 > hierarchical.cpp:1309] Performed allocation for agent agent2 in 42682ns > [15:52:21]W: [Step 10/10] I1006 15:52:21.814417 23713 > hierarchical.cpp:275] Added framework framework2 > [15:52:21]W: [Step 10/10] I1006 15:52:21.814445 23713 > hierarchical.cpp:1694] No allocations performed > [15:52:21]W: [Step 10/10] I1006 15:52:21.814455 23713 > hierarchical.cpp:1789] No inverse offers to send out! > [15:52:21]W: [Step 10/10] I1006 15:52:21.814469 23713 > hierarchical.cpp:1286] Performed allocation for 2 agents in 37976ns > [15:52:21]W: [Step 10/10] F1006 15:52:21.824954 23692 json.hpp:334] Check > failed: 'boost::get(this)' Must be non NULL > [15:52:21]W: [Step 10/10] *** Check failure stack trace: *** > [15:52:21]W: [Step 10/10] @ 0x7fe953bbd71d > google::LogMessage::Fail() > [15:52:21]W: [Step 10/10] @ 0x7fe953bbf55d > google::LogMessage::SendToLog() > [15:52:21]W: [Step 10/10] @ 0x7fe953bbd30c > google::LogMessage::Flush() > [15:52:21]W: [Step 10/10] @ 0x7fe953bbfe59 > google::LogMessageFatal::~LogMessageFatal() > [15:52:21]W: [Step 10/10] @ 0x7cc903 JSON::Value::as<>() > [15:52:21]W: [Step 10/10] @ 0x8b633c > mesos::internal::tests::HierarchicalAllocatorTest_NoDoubleAccounting_Test::TestBody() > [15:52:21]W: [Step 10/10] @ 0x129ce23 > testing::internal::HandleExceptionsInMethodIfSupported<>() > [15:52:21]W: [Step 10/10] @ 0x1292f07 testing::Test::Run() > [15:52:21]W: [Step 10/10] @ 0x1292fae > testing::TestInfo::Run() > [15:52:21]W: [Step 10/10] @ 0x12930b5 > testing::TestCase::Run() > [15:52:21]W: [Step 10/10] @ 0x1293368 > testing::internal::UnitTestImpl::RunAllTests() > [15:52:21]W: [Step 10/10] @ 0x1293624 > testing::UnitTest::Run() > [15:52:21]W: [Step 10/10] @ 0x507254 main > [15:52:21]W: [Step 10/10] @ 0x7fe95122876d (unknown) > [15:52:21]W: [Step 10/10] @ 0x51e341 (unknown) > [15:52:21]W: [Step 10/10] Aborted (core dumped) > [15:52:21]W: [Step 10/10] Process exited with code 134 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6321) CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting
[ https://issues.apache.org/jira/browse/MESOS-6321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554654#comment-15554654 ] Alexander Rukletsov commented on MESOS-6321: Good run should look like this: {noformat} [ RUN ] HierarchicalAllocatorTest.NoDoubleAccounting I1007 11:29:37.357229 3211264 hierarchical.cpp:149] Initialized hierarchical allocator process I1007 11:29:37.357724 1601536 hierarchical.cpp:275] Added framework framework1 I1007 11:29:37.357810 1601536 hierarchical.cpp:1694] No allocations performed I1007 11:29:37.357842 1601536 hierarchical.cpp:1789] No inverse offers to send out! I1007 11:29:37.357875 1601536 hierarchical.cpp:1286] Performed allocation for 0 agents in 127us I1007 11:29:37.358070 1601536 hierarchical.cpp:485] Added agent agent1 (agent1) with cpus(*):1 (allocated: cpus(*):1) I1007 11:29:37.358151 1601536 hierarchical.cpp:1694] No allocations performed I1007 11:29:37.358165 1601536 hierarchical.cpp:1789] No inverse offers to send out! I1007 11:29:37.358182 1601536 hierarchical.cpp:1309] Performed allocation for agent agent1 in 87us I1007 11:29:37.358243 1601536 hierarchical.cpp:485] Added agent agent2 (agent2) with cpus(*):1 (allocated: cpus(*):1) I1007 11:29:37.358337 1601536 hierarchical.cpp:1694] No allocations performed I1007 11:29:37.358361 1601536 hierarchical.cpp:1789] No inverse offers to send out! I1007 11:29:37.358373 1601536 hierarchical.cpp:1309] Performed allocation for agent agent2 in 102us I1007 11:29:37.358554 1601536 hierarchical.cpp:275] Added framework framework2 I1007 11:29:37.358619 1601536 hierarchical.cpp:1694] No allocations performed I1007 11:29:37.358649 1601536 hierarchical.cpp:1789] No inverse offers to send out! I1007 11:29:37.358662 1601536 hierarchical.cpp:1286] Performed allocation for 2 agents in 95us I1007 11:29:37.358786 1064960 process.cpp:3377] Handling HTTP event for process 'metrics' with path: '/metrics/snapshot' [ OK ] HierarchicalAllocatorTest.NoDoubleAccounting (18 ms) {noformat} The test failed because allocation events are processed after the metrics event, meaning metrics do not contain information we are looking for. The fix would be to make sure allocation events are processed *before* querying metrics. > CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting > - > > Key: MESOS-6321 > URL: https://issues.apache.org/jira/browse/MESOS-6321 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway >Assignee: Alexander Rukletsov > Labels: mesosphere > > Observed in internal CI: > {noformat} > [15:52:21] : [Step 10/10] [ RUN ] > HierarchicalAllocatorTest.NoDoubleAccounting > [15:52:21]W: [Step 10/10] I1006 15:52:21.813817 23713 > hierarchical.cpp:275] Added framework framework1 > [15:52:21]W: [Step 10/10] I1006 15:52:21.814100 23713 > hierarchical.cpp:1694] No allocations performed > [15:52:21]W: [Step 10/10] I1006 15:52:21.814102 23712 process.cpp:3377] > Handling HTTP event for process 'metrics' with path: '/metrics/snapshot' > [15:52:21]W: [Step 10/10] I1006 15:52:21.814121 23713 > hierarchical.cpp:1789] No inverse offers to send out! > [15:52:21]W: [Step 10/10] I1006 15:52:21.814146 23713 > hierarchical.cpp:1286] Performed allocation for 0 agents in 52445ns > [15:52:21]W: [Step 10/10] I1006 15:52:21.814206 23713 > hierarchical.cpp:485] Added agent agent1 (agent1) with cpus(*):1 (allocated: > cpus(*):1) > [15:52:21]W: [Step 10/10] I1006 15:52:21.814237 23713 > hierarchical.cpp:1694] No allocations performed > [15:52:21]W: [Step 10/10] I1006 15:52:21.814247 23713 > hierarchical.cpp:1789] No inverse offers to send out! > [15:52:21]W: [Step 10/10] I1006 15:52:21.814259 23713 > hierarchical.cpp:1309] Performed allocation for agent agent1 in 33887ns > [15:52:21]W: [Step 10/10] I1006 15:52:21.814294 23713 > hierarchical.cpp:485] Added agent agent2 (agent2) with cpus(*):1 (allocated: > cpus(*):1) > [15:52:21]W: [Step 10/10] I1006 15:52:21.814332 23713 > hierarchical.cpp:1694] No allocations performed > [15:52:21]W: [Step 10/10] I1006 15:52:21.814342 23713 > hierarchical.cpp:1789] No inverse offers to send out! > [15:52:21]W: [Step 10/10] I1006 15:52:21.814349 23713 > hierarchical.cpp:1309] Performed allocation for agent agent2 in 42682ns > [15:52:21]W: [Step 10/10] I1006 15:52:21.814417 23713 > hierarchical.cpp:275] Added framework framework2 > [15:52:21]W: [Step 10/10] I1006 15:52:21.814445 23713 > hierarchical.cpp:1694] No allocations performed > [15:52:21]W: [Step 10/10] I1006 15:52:21.814455 23713 > hierarchical.cpp:1789] No inverse offers to send out! > [15:52:21]W: [Step 10/10] I1006 15:52:21.814469 23713 > hierarchical.cpp:1286] Performed allocation for 2 agents in 37976ns >
[jira] [Commented] (MESOS-2723) The mesos-execute tool does not support zk:// master URLs
[ https://issues.apache.org/jira/browse/MESOS-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554589#comment-15554589 ] Christian Parpart commented on MESOS-2723: -- Hey, I was just expecting --master flag to support zk URLs, too. So I ended up in this ticket. Can we bump the review again, somehow? Best regards, Christian. > The mesos-execute tool does not support zk:// master URLs > - > > Key: MESOS-2723 > URL: https://issues.apache.org/jira/browse/MESOS-2723 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.22.1 >Reporter: Tom Arnfeld >Assignee: Tom Arnfeld > > It appears that the {{mesos-execute}} command line tool does it's own PID > validation of the {{--master}} param which prevents it from supporting > clusters managed with ZooKeeper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6327) Large docker images make the mesos containerizer crash with: Too many levels of symbolic links
Rogier Dikkes created MESOS-6327: Summary: Large docker images make the mesos containerizer crash with: Too many levels of symbolic links Key: MESOS-6327 URL: https://issues.apache.org/jira/browse/MESOS-6327 Project: Mesos Issue Type: Bug Components: containerization, docker Affects Versions: 1.0.1, 1.0.0 Environment: centos 7.2 (1511), ubuntu 14.04 (trusty). Replicated in the Apache Aurora vagrant image Reporter: Rogier Dikkes Priority: Critical When deploying Mesos containers with large (6G+, 60+ layers) Docker images the task crashes with the error: Mesos agent logs: E1007 08:40:12.954227 8117 slave.cpp:3976] Container 'a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4' for executor 'thermos-www-data-devel-hello_docker_image-0-d42d2af6-6b44-4b2b-be95-e1ba93a6b365' of framework df c91a86-84b9-4539-a7be-4ace7b7b44a1- failed to start: Collect failed: Collect failed: Failed to copy layer: cp: cannot stat ‘/var/lib/mesos/provisioner/containers/a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4/b ackends/copy/rootfses/5f328f72-25d4-4a26-ac83-8d30bbc44e97/usr/share/zoneinfo/right/Asia/Urumqi’: Too many levels of symbolic links ... (complete pastebin: http://pastebin.com/umZ4Q5d1 ) How to replicate: Start the aurora vagrant image. Adjust the /etc/mesos-slave/executor_registration_timeout to 5 mins. Adjust the file /vagrant/examples/jobs/hello_docker_image.aurora to start a large Docker image instead of the example. (you can use anldisr/jupyter:0.4 i created as a test image, this is based upon the jupyter notebook stacks.). Create the job, watch it fail after x number of minutes. The mesos sandbox is empty. Aurora errors i see: 28 minutes ago - FAILED : Failed to launch container: Collect failed: Collect failed: Failed to copy layer: cp: cannot stat ‘/var/lib/mesos/provisioner/containers/93420a36-0e0c-4f04-b401-74c426c25686/backends/copy/rootfses/6e185a51-7174-4b0d-a305-42b634eb91bb/usr/share/zoneinfo/right/Asia/Urumqi’: Too many levels of symbolic links cp: cannot stat ... Too many levels of symbolic links ; Container destroyed while provisioning images (complete pastebin: http://pastebin.com/uecHYD5J ) To rule out the image i started this and more images as a normal Docker container. This works without issues. Mesos flags related configured: -appc_store_dir /tmp/mesos/images/appc -containerizers docker,mesos -executor_registration_timeout 5mins -image_providers appc,docker -image_provisioner_backend copy -isolation filesystem/linux,docker/runtime Affected Mesos versions tested: 1.0.1 & 1.0.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6326) Build failed on Mac
Klaus Ma created MESOS-6326: --- Summary: Build failed on Mac Key: MESOS-6326 URL: https://issues.apache.org/jira/browse/MESOS-6326 Project: Mesos Issue Type: Bug Affects Versions: 1.0.1 Reporter: Klaus Ma Priority: Minor Built Mesos 1.0.1 failed on Mac: {{uname -a}}: Darwin Klauss-MacBook-Pro.local 16.0.0 Darwin Kernel Version 16.0.0: Mon Aug 29 17:56:20 PDT 2016; root:xnu-3789.1.32~3/RELEASE_X86_64 x86_64 {code} In file included from ../../src/appc/spec.cpp:19: In file included from ../../3rdparty/stout/include/stout/protobuf.hpp:31: In file included from ../3rdparty/protobuf-2.6.1/src/google/protobuf/repeated_field.h:58: In file included from ../3rdparty/protobuf-2.6.1/src/google/protobuf/generated_message_util.h:44: In file included from ../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/once.h:81: In file included from ../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops.h:184: ../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops_internals_macosx.h:164:10: error: 'OSAtomicAdd64Barrier' is deprecated: first deprecated in macOS 10.12 - Use std::atomic_fetch_add() from instead [-Werror,-Wdeprecated-declarations] return OSAtomicAdd64Barrier(increment, ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/libkern/OSAtomicDeprecated.h:247:9: note: 'OSAtomicAdd64Barrier' has been explicitly marked deprecated here int64_t OSAtomicAdd64Barrier( int64_t __theAmount, ^ In file included from ../../src/appc/spec.cpp:19: In file included from ../../3rdparty/stout/include/stout/protobuf.hpp:31: In file included from ../3rdparty/protobuf-2.6.1/src/google/protobuf/repeated_field.h:58: In file included from ../3rdparty/protobuf-2.6.1/src/google/protobuf/generated_message_util.h:44: In file included from ../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/once.h:81: In file included from ../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops.h:184: ../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops_internals_macosx.h:173:9: error: 'OSAtomicCompareAndSwap64Barrier' is deprecated: first deprecated in macOS 10.12 - Use std::atomic_compare_exchange_strong() from instead [-Werror,-Wdeprecated-declarations] if (OSAtomicCompareAndSwap64Barrier( ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/libkern/OSAtomicDeprecated.h:645:9: note: 'OSAtomicCompareAndSwap64Barrier' has been explicitly marked deprecated here boolOSAtomicCompareAndSwap64Barrier( int64_t __oldValue, int64_t __newValue, ^ 12 errors generated. make[2]: *** [appc/libmesos_no_3rdparty_la-spec.lo] Error 1 make[1]: *** [all] Error 2 make: *** [all-recursive] Error 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6325) Boolean member Executor::commandExecutor not always properly initialized
Benjamin Bannier created MESOS-6325: --- Summary: Boolean member Executor::commandExecutor not always properly initialized Key: MESOS-6325 URL: https://issues.apache.org/jira/browse/MESOS-6325 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Bannier The constructor of {{Executor}} in {{src/slave/slave}} does not make sure that the member variable {{commandExecutor}} is always set. The following logic is used to determine its value, {code} Result executorPath = os::realpath(path::join(slave->flags.launcher_dir, MESOS_EXECUTOR)); if (executorPath.isSome()) { commandExecutor = strings::contains(info.command().value(), executorPath.get()); } {code} Should we fail to determine the realpath of the mesos executor, {{commandExecutor}} will not be set. Since {{commandExecutor}} is a scalar field, no default initialization happens and its value will be random memory (which might often evaluate to {{true}}). We need to make sure the member variable is set on all branches. Looking at the code it seems we might be able to just explicitly assert some {{executorPath}}. This was pointed out by coverity, https://scan5.coverity.com/reports.htm#v10074/p10429/fileInstanceId=100298128=28784922=1373526. -- This message was sent by Atlassian JIRA (v6.3.4#6332)