[jira] [Commented] (MESOS-5600) "dirty" was never set back as false in sorter
[ https://issues.apache.org/jira/browse/MESOS-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325712#comment-15325712 ] Klaus Ma commented on MESOS-5600: - Think more about that, current behaiour is right: if total is updated, it will re-calculate all, otherwise, only related client is updated.when update weight, we should do the same thing. > "dirty" was never set back as false in sorter > - > > Key: MESOS-5600 > URL: https://issues.apache.org/jira/browse/MESOS-5600 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Guangya Liu >Assignee: Guangya Liu > > dirty was set as true when the total resource was updated in the cluster, but > it was never set back as false. The dirty should be set back as false in > DRFSorter::sort > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L320-L334 > The reason that we cannot detect this is because once an agent was added to > cluster, the dirty will be set as true and the sorter will always call sort() > to calculate share for each framework, this will impact the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5600) "dirty" was never set back as false in sorter
[ https://issues.apache.org/jira/browse/MESOS-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325693#comment-15325693 ] Klaus Ma commented on MESOS-5600: - OK, thanks; one more comments :). In {{DRFSorter::allocated()}}, the {{allocations}} of client was updated so {{dirty}} should be true to trigger {{sort()}} and {{update(name)}} seems not necessary as the {{share}} will be re-calculated in {{sort()}}. Further more, I'm thinking how many performance contribution will {{dirty}} help? In each allocator loop, the allocation maybe changed so the sorter should re-calculate the order. To improve the performance, I think we can only update the single client in {{allocated}} & {{unallocated}}; and only return clients list in {{sort()}} (the client was sorted when inserted in {{allocated}} & {{unallocated}}). > "dirty" was never set back as false in sorter > - > > Key: MESOS-5600 > URL: https://issues.apache.org/jira/browse/MESOS-5600 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Guangya Liu >Assignee: Guangya Liu > > dirty was set as true when the total resource was updated in the cluster, but > it was never set back as false. The dirty should be set back as false in > DRFSorter::sort > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L320-L334 > The reason that we cannot detect this is because once an agent was added to > cluster, the dirty will be set as true and the sorter will always call sort() > to calculate share for each framework, this will impact the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5600) "dirty" was never set back as false in sorter
[ https://issues.apache.org/jira/browse/MESOS-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325675#comment-15325675 ] Guangya Liu commented on MESOS-5600: I filed another JIRA here to call update() when update weight for a client https://issues.apache.org/jira/browse/MESOS-5601 > "dirty" was never set back as false in sorter > - > > Key: MESOS-5600 > URL: https://issues.apache.org/jira/browse/MESOS-5600 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Guangya Liu >Assignee: Guangya Liu > > dirty was set as true when the total resource was updated in the cluster, but > it was never set back as false. The dirty should be set back as false in > DRFSorter::sort > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L320-L334 > The reason that we cannot detect this is because once an agent was added to > cluster, the dirty will be set as true and the sorter will always call sort() > to calculate share for each framework, this will impact the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5601) The sorter should re-calculate share if weight was update
Guangya Liu created MESOS-5601: -- Summary: The sorter should re-calculate share if weight was update Key: MESOS-5601 URL: https://issues.apache.org/jira/browse/MESOS-5601 Project: Mesos Issue Type: Bug Components: allocation Reporter: Guangya Liu Assignee: Guangya Liu When update weight for client, if dirty is false, the sorter should re-calculate the share. https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L64 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5600) "dirty" was never set back as false in sorter
[ https://issues.apache.org/jira/browse/MESOS-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325674#comment-15325674 ] Klaus Ma commented on MESOS-5600: - It's not only about {{total}}; in {{sort()}}, it will update the order of client based on share: in {{DRFSorter::update}}, the weight is updated which will impact the result of {{DRFSorter::calculateShare}}; in {{DRFSorter::remove}}, the client was removed, {{sort()}} should not return removed clients. > "dirty" was never set back as false in sorter > - > > Key: MESOS-5600 > URL: https://issues.apache.org/jira/browse/MESOS-5600 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Guangya Liu >Assignee: Guangya Liu > > dirty was set as true when the total resource was updated in the cluster, but > it was never set back as false. The dirty should be set back as false in > DRFSorter::sort > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L320-L334 > The reason that we cannot detect this is because once an agent was added to > cluster, the dirty will be set as true and the sorter will always call sort() > to calculate share for each framework, this will impact the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5600) "dirty" was never set back as false in sorter
[ https://issues.apache.org/jira/browse/MESOS-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325673#comment-15325673 ] Guangya Liu commented on MESOS-5600: Why? I saw the total resources was not update in the code block, comments? > "dirty" was never set back as false in sorter > - > > Key: MESOS-5600 > URL: https://issues.apache.org/jira/browse/MESOS-5600 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Guangya Liu >Assignee: Guangya Liu > > dirty was set as true when the total resource was updated in the cluster, but > it was never set back as false. The dirty should be set back as false in > DRFSorter::sort > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L320-L334 > The reason that we cannot detect this is because once an agent was added to > cluster, the dirty will be set as true and the sorter will always call sort() > to calculate share for each framework, this will impact the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5600) "dirty" was never set back as false in sorter
[ https://issues.apache.org/jira/browse/MESOS-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325668#comment-15325668 ] Klaus Ma edited comment on MESOS-5600 at 6/11/16 2:49 AM: -- [~gyliu]/[~bmahler], please also check the following code, I think the {{dirty}} should be also updated accordingly: https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L48-L85 was (Author: klaus1982): [~gyliu]/[~bmahler], please also check the following code, I think the `dirty` should be also updated accordingly: https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L48-L85 > "dirty" was never set back as false in sorter > - > > Key: MESOS-5600 > URL: https://issues.apache.org/jira/browse/MESOS-5600 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Guangya Liu >Assignee: Guangya Liu > > dirty was set as true when the total resource was updated in the cluster, but > it was never set back as false. The dirty should be set back as false in > DRFSorter::sort > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L320-L334 > The reason that we cannot detect this is because once an agent was added to > cluster, the dirty will be set as true and the sorter will always call sort() > to calculate share for each framework, this will impact the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5600) "dirty" was never set back as false in sorter
[ https://issues.apache.org/jira/browse/MESOS-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325668#comment-15325668 ] Klaus Ma commented on MESOS-5600: - [~gyliu]/[~bmahler], please also check the following code, I think the `dirty` should be also updated accordingly: https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L48-L85 > "dirty" was never set back as false in sorter > - > > Key: MESOS-5600 > URL: https://issues.apache.org/jira/browse/MESOS-5600 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Guangya Liu >Assignee: Guangya Liu > > dirty was set as true when the total resource was updated in the cluster, but > it was never set back as false. The dirty should be set back as false in > DRFSorter::sort > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L320-L334 > The reason that we cannot detect this is because once an agent was added to > cluster, the dirty will be set as true and the sorter will always call sort() > to calculate share for each framework, this will impact the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5600) "dirty" was never set back as false in sorter
[ https://issues.apache.org/jira/browse/MESOS-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325648#comment-15325648 ] Guangya Liu commented on MESOS-5600: [~bmahler] compared with move the false initialization into the initializer list, I prefer we keep current logic by initializing the {{dirty}} in sorter.hpp. My thinking is that if we move the false initialization into the initializer list, then I may need to add a new constructor for sorter() to initialize dirty and update all initialize code when create the sorters, this will involve many code change, but the current initialize logic is simple and easy to understand. Comments? > "dirty" was never set back as false in sorter > - > > Key: MESOS-5600 > URL: https://issues.apache.org/jira/browse/MESOS-5600 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Guangya Liu >Assignee: Guangya Liu > > dirty was set as true when the total resource was updated in the cluster, but > it was never set back as false. The dirty should be set back as false in > DRFSorter::sort > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L320-L334 > The reason that we cannot detect this is because once an agent was added to > cluster, the dirty will be set as true and the sorter will always call sort() > to calculate share for each framework, this will impact the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5600) "dirty" was never set back as false in sorter
[ https://issues.apache.org/jira/browse/MESOS-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5600: --- Component/s: allocation > "dirty" was never set back as false in sorter > - > > Key: MESOS-5600 > URL: https://issues.apache.org/jira/browse/MESOS-5600 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Guangya Liu >Assignee: Guangya Liu > > dirty was set as true when the total resource was updated in the cluster, but > it was never set back as false. The dirty should be set back as false in > DRFSorter::sort > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L320-L334 > The reason that we cannot detect this is because once an agent was added to > cluster, the dirty will be set as true and the sorter will always call sort() > to calculate share for each framework, this will impact the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5600) "dirty" was never set back as false in sorter
Guangya Liu created MESOS-5600: -- Summary: "dirty" was never set back as false in sorter Key: MESOS-5600 URL: https://issues.apache.org/jira/browse/MESOS-5600 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu dirty was set as true when the total resource was updated in the cluster, but it was never set back as false. The dirty should be set back as false in DRFSorter::sort https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L320-L334 The reason that we cannot detect this is because once an agent was added to cluster, the dirty will be set as true and the sorter will always call sort() to calculate share for each framework, this will impact the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5524) Expose resource allocation constraints (quota, shares) to schedulers.
[ https://issues.apache.org/jira/browse/MESOS-5524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5524: --- Summary: Expose resource allocation constraints (quota, shares) to schedulers. (was: Expose resource consumption constraints (quota, shares) to schedulers.) > Expose resource allocation constraints (quota, shares) to schedulers. > - > > Key: MESOS-5524 > URL: https://issues.apache.org/jira/browse/MESOS-5524 > Project: Mesos > Issue Type: Epic > Components: allocation, scheduler api >Reporter: Benjamin Mahler > > Currently, schedulers do not have visibility into their quota or shares of > the cluster. By providing this information, we give the scheduler the ability > to make better decisions. As we start to allow schedulers to decide how > they'd like to use a particular resource (e.g. as non-revocable or > revocable), schedulers need visibility into their quota and shares to make an > effective decision (otherwise they may accidentally exceed their quota and > will not find out until mesos replies with TASK_LOST REASON_QUOTA_EXCEEDED). > We would start by exposing the following information: > * quota: e.g. cpus:10, mem:20, disk:40 > * shares: e.g. cpus:20, mem:40, disk:80 > Currently, quota is used for non-revocable resources and the idea is to use > shares only for consuming revocable resources since the number of shares > available to a role changes dynamically as resources come and go, frameworks > come and go, or the operator manipulates the amount of resources sectioned > off for quota. > By exposing quota and shares, the framework knows when it can consume > additional non-revocable resources (i.e. when it has fewer non-revocable > resources allocated to it than its quota) or when it can consume revocable > resources (always! but in the future, it cannot revoke another user's > revocable resources if the framework is above its fair share). > This also allows schedulers to determine whether they have sufficient quota > assigned to them, and to alert the operator if they need more to run safely. > Also, by viewing their fair share, the framework can expose monitoring > information that shows the discrepancy between how much it would like and its > fair share (note that the framework can actually exceed its fair share but in > the future this will mean increased potential for revocation). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5578) Support static address allocation in CNI
[ https://issues.apache.org/jira/browse/MESOS-5578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325545#comment-15325545 ] Qian Zhang commented on MESOS-5578: --- Agree, we definitely need to avoid plugin specific implementation in our CNI isolator. > Support static address allocation in CNI > > > Key: MESOS-5578 > URL: https://issues.apache.org/jira/browse/MESOS-5578 > Project: Mesos > Issue Type: Task > Components: containerization >Affects Versions: 1.0.0 > Environment: Linux >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan > Labels: mesosphere > > Currently a framework can't specify a static IP address for the container > when using the network/cni isolator. > The `ipaddress` field in the `NetworkInfo` protobuf was designed for this > specific purpose but since the CNI spec does not specify a means to allocate > an IP address to the container the `network/cni` isolator cannot honor this > field even when it is filled in by the framework. > Creating this ticket to act as a place holder to track this limitation. As > and when the CNI spec allows us to specify a static IP address for the > container, we can resolve this ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5593) Devolve v1 operator protos before using them in Master/Agent.
[ https://issues.apache.org/jira/browse/MESOS-5593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-5593: -- Fix Version/s: 1.0.0 > Devolve v1 operator protos before using them in Master/Agent. > - > > Key: MESOS-5593 > URL: https://issues.apache.org/jira/browse/MESOS-5593 > Project: Mesos > Issue Type: Improvement >Reporter: Anand Mazumdar >Assignee: haosdent >Priority: Critical > Labels: mesosphere > Fix For: 1.0.0 > > > We had adopted the following workflow for the Scheduler/Executor endpoints on > the Master/Agent. > - The user makes a call to the versioned endpoint with a versioned protobuf. > e.g., {{v1::mesos::Call}} > - We {{devolve}} the versioned protobuf into an unversioned protobuf before > using it internally. > {code} > scheduler::Call call = devolve(v1Call); > {code} > The above approach has the advantage that the internal Mesos code only has to > deal with unversioned protobufs. It looks like we have not been following > this idiom for the Operator API. We should create a unversioned protobuf file > similar to we did for the Scheduler/Executor API and then {{devolve}} the > versioned protobufs. (e.g., mesos/master/master.proto) > The signature of some of the operator endpoints would then change to only be > dealing with unversioned protobufs: > {code} > Future Master::Http::getHealth( > const master::Call& call, > const Option& principal, > const ContentType& contentType) const > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5586) Move design docs from wiki to web page
[ https://issues.apache.org/jira/browse/MESOS-5586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325310#comment-15325310 ] Vinod Kone commented on MESOS-5586: --- Thanks for breaking down the tickets. Looks like you missed creating the *Epic* type ticket? > Move design docs from wiki to web page > -- > > Key: MESOS-5586 > URL: https://issues.apache.org/jira/browse/MESOS-5586 > Project: Mesos > Issue Type: Documentation >Reporter: Tomasz Janiszewski >Assignee: Tomasz Janiszewski >Priority: Minor > > {quote} > Hi folks, > I am proposing moving our content in Wiki (e.g., working groups, release > tracking, etc.) to our docs in the code repo. I personally found that wiki > is hard to use and there's no reviewing process for changes in the Wiki. > The content in Wiki historically received less attention than that in the > docs. > What do you think? > - Jie > {quote} > http://www.mail-archive.com/dev@mesos.apache.org/msg35506.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5576) Masters may drop the first message they send between masters after a network partition
[ https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-5576: - Sprint: Mesosphere Sprint 37 Story Points: 5 > Masters may drop the first message they send between masters after a network > partition > -- > > Key: MESOS-5576 > URL: https://issues.apache.org/jira/browse/MESOS-5576 > Project: Mesos > Issue Type: Bug > Components: leader election, master, replicated log >Affects Versions: 0.28.2 > Environment: Observed in an OpenStack environment where each master > lives on a separate VM. >Reporter: Joseph Wu > Labels: mesosphere > > We observed the following situation in a cluster of five masters: > || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 || > | 0 | Follower | Follower | Follower | Follower | Leader | > | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster > by downing this VM's network || > | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost > leadership | > | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to > leader | Still down | > | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | > Still down | > | 5 | Leader | Follower | Follower | Follower | Still down | > | 6 | Leader | Follower | Follower | Follower | Comes back up | > | 7 | Leader | Follower | Follower | Follower | Follower | > | 8 || Partitioned in the same way as Master 5 | Follower | Follower | > Follower | Follower | > | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | > Follower | Follower | > | 10 | Still down | Performs consensus | Replies to leader | Replies to > leader || Doesn't get the message! || > | 11 | Still down | Performs writing | Acks to leader | Acks to leader || > Acks to leader || > | 12 | Still down | Leader | Follower | Follower | Follower | > Master 2 sends a series of messages to the recently-restarted Master 5. The > first message is dropped, but subsequent messages are not dropped. > This appears to be due to a stale link between the masters. Before leader > election, the replicated log actors create a network watcher, which adds > links to masters that join the ZK group: > https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159 > This link does not appear to break (Master 2 -> 5) when Master 5 goes down, > perhaps due to how the network partition was induced (in the hypervisor > layer, rather than in the VM itself). > When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not > observe the [expected log > message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494] > Instead, we see a log line in Master 2: > {code} > process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is > not connected > {code} > The broken link is removed by the libprocess {{socket_manager}} and the > following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new > socket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5587) FullFrameworkWriter makes master segmentation fault.
[ https://issues.apache.org/jira/browse/MESOS-5587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325273#comment-15325273 ] Joerg Schad commented on MESOS-5587: Potentially the sandbox authorization has the same issue: see https://reviews.apache.org/r/48566/ > FullFrameworkWriter makes master segmentation fault. > > > Key: MESOS-5587 > URL: https://issues.apache.org/jira/browse/MESOS-5587 > Project: Mesos > Issue Type: Bug >Reporter: Gilbert Song >Assignee: Joerg Schad >Priority: Blocker > Labels: authentication, mesosphere > Fix For: 1.0.0 > > > FullFrameworkWriter::operator() may take down the master. Here is the log: > {noformat} > Jun 09 02:28:42 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:42.147253 > 18633 master.cpp:5772] Sending 1 offers to framework > 6d4248cd-2832-4152-b5d0-defbf36f6759-0001 (chronos) at > scheduler-c9cb7c2c-ae6b-4a34-8663-6a52980161c1@10.10.0.20:39285 > Jun 09 02:28:42 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:42.148890 > 18637 master.cpp:4066] Processing DECLINE call for offers: [ > 7567c338-3ae5-4a84-bf5b-6a75a8a49341-O992 ] for framework > 6d4248cd-2832-4152-b5d0-defbf36f6759-0001 (chronos) at > scheduler-c9cb7c2c-ae6b-4a34-8663-6a52980161c1@10.10.0.20:39285 > Jun 09 02:28:42 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:42.639813 > 18632 http.cpp:483] HTTP GET for /master/state-summary from 10.10.0.180:45790 > with User-Agent='python-requests/2.6.0 CPython/3.4.2 > Linux/3.10.0-327.10.1.el7.x86_64' > Jun 09 02:28:42 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:42.890702 > 18632 http.cpp:483] HTTP GET for /master/state from 10.10.0.181:33830 with > User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36' > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.139240 > 18639 http.cpp:483] HTTP GET for /master/state-summary from 10.10.0.181:33831 > with User-Agent='python-requests/2.6.0 CPython/3.4.2 > Linux/3.10.0-327.18.2.el7.x86_64' > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.148582 > 18633 master.cpp:5772] Sending 1 offers to framework > 4c6031e7-4cfd-4219-89b2-d19c7101e045-0001 (Long Lived Framework (C++)) > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.150388 > 18635 http.cpp:483] HTTP POST for /master/api/v1/scheduler from > 10.10.0.178:51645 > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.150645 > 18635 master.cpp:3457] Processing ACCEPT call for offers: [ > 7567c338-3ae5-4a84-bf5b-6a75a8a49341-O993 ] on agent > 091e9c3f-8a01-4890-8790-48b75fd81b40-S0 at slave(1)@10.10.0.20:5051 > (10.10.0.20) for framework 4c6031e7-4cfd-4219-89b2-d19c7101e045-0001 (Long > Lived Framework (C++)) > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.151268 > 18635 master.hpp:178] Adding task 5699 with resources cpus(*):0.001; mem(*):1 > on agent 091e9c3f-8a01-4890-8790-48b75fd81b40-S0 (10.10.0.20) > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.151322 > 18635 master.cpp:3946] Launching task 5699 of framework > 4c6031e7-4cfd-4219-89b2-d19c7101e045-0001 (Long Lived Framework (C++)) with > resources cpus(*):0.001; mem(*):1 on agent > 091e9c3f-8a01-4890-8790-48b75fd81b40-S0 at slave(1)@10.10.0.20:5051 > (10.10.0.20) > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.160475 > 18635 master.cpp:5211] Status update TASK_RUNNING (UUID: > 3f651ba8-7c80-4ac0-ae18-579371ec82d5) for task 5699 of framework > 4c6031e7-4cfd-4219-89b2-d19c7101e045-0001 from agent > 091e9c3f-8a01-4890-8790-48b75fd81b40-S0 at slave(1)@10.10.0.20:5051 > (10.10.0.20) > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.160516 > 18635 master.cpp:5259] Forwarding status update TASK_RUNNING (UUID: > 3f651ba8-7c80-4ac0-ae18-579371ec82d5) for task 5699 of framework > 4c6031e7-4cfd-4219-89b2-d19c7101e045-0001 > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.160645 > 18635 master.cpp:6871] Updating the state of task 5699 of framework > 4c6031e7-4cfd-4219-89b2-d19c7101e045-0001 (latest state: TASK_RUNNING, status > update state: TASK_RUNNING) > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.161842 > 18639 http.cpp:483] HTTP POST for /master/api/v1/scheduler from > 10.10.0.178:51645 > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.161912 > 18639 master.cpp:4365] Processing ACKNOWLEDGE call > 3f651ba8-7c80-4ac0-ae18-579371ec82d5 for task 5699 of framework > 4c6031e7-4cfd-4219-89b2-d19c7101e045-0001 (Long Lived Framework (C++)) on > agent 091e9c3f-8a01-4890-8790-48b75fd81b40-S0 > Jun 09 02:28:43 ip-10-10-0-180 mesos-master[18627]: I0609 02:28:43.556354 >
[jira] [Commented] (MESOS-2105) Reliably report OOM even if the executor exits normally
[ https://issues.apache.org/jira/browse/MESOS-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325265#comment-15325265 ] Greg Mann commented on MESOS-2105: -- We recently observed this on an internal test cluster. An executor was OOM-killed before the cgroup mem isolator was able to destroy the offending container. Here are the kernel logs from the agent machine interleaved with the Mesos agent logs: {code} Jun 10 16:14:47 ip-10-10-0-87 mesos-slave[3038]: I0610 16:14:47.434166 3044 mem.cpp:644] OOM detected for container d9d84892-1165-43a2-9675-10b88be141f4 Jun 10 16:14:47 ip-10-10-0-87 kernel: docker0: port 1(vethb30b136) entered forwarding state Jun 10 16:14:47 ip-10-10-0-87 kernel: balloon-executo invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Jun 10 16:14:47 ip-10-10-0-87 kernel: balloon-executo cpuset=/ mems_allowed=0 Jun 10 16:14:47 ip-10-10-0-87 kernel: CPU: 2 PID: 23924 Comm: balloon-executo Tainted: G T 3.10.0-327.10.1.el7.x86_64 #1 Jun 10 16:14:47 ip-10-10-0-87 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/12/2016 Jun 10 16:14:47 ip-10-10-0-87 kernel: 8803a6463980 9a29939c 88025f85bcd0 816352cc Jun 10 16:14:47 ip-10-10-0-87 kernel: 88025f85bd60 8163026c 8802ec7265b8 0001 Jun 10 16:14:47 ip-10-10-0-87 kernel: 0003 fffeefff 0001 8803a6467803 Jun 10 16:14:47 ip-10-10-0-87 kernel: Call Trace: Jun 10 16:14:47 ip-10-10-0-87 kernel: [] dump_stack+0x19/0x1b Jun 10 16:14:47 ip-10-10-0-87 kernel: [] dump_header+0x8e/0x214 Jun 10 16:14:47 ip-10-10-0-87 kernel: [] oom_kill_process+0x24e/0x3b0 Jun 10 16:14:47 ip-10-10-0-87 kernel: [] ? has_capability_noaudit+0x1e/0x30 Jun 10 16:14:47 ip-10-10-0-87 kernel: [] mem_cgroup_oom_synchronize+0x555/0x580 Jun 10 16:14:47 ip-10-10-0-87 kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Jun 10 16:14:47 ip-10-10-0-87 kernel: [] pagefault_out_of_memory+0x14/0x90 Jun 10 16:14:47 ip-10-10-0-87 kernel: [] mm_fault_error+0x68/0x12b Jun 10 16:14:47 ip-10-10-0-87 kernel: [] __do_page_fault+0x3e2/0x450 Jun 10 16:14:47 ip-10-10-0-87 kernel: [] do_page_fault+0x23/0x80 Jun 10 16:14:47 ip-10-10-0-87 kernel: [] page_fault+0x28/0x30 Jun 10 16:14:47 ip-10-10-0-87 kernel: Task in /mesos/d9d84892-1165-43a2-9675-10b88be141f4 killed as a result of limit of /mesos/d9d84892-1165-43a2-9675-10b88be141f4 Jun 10 16:14:47 ip-10-10-0-87 kernel: memory: usage 196608kB, limit 196608kB, failcnt 50 Jun 10 16:14:47 ip-10-10-0-87 kernel: memory+swap: usage 196608kB, limit 9007199254740991kB, failcnt 0 Jun 10 16:14:47 ip-10-10-0-87 kernel: kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Jun 10 16:14:47 ip-10-10-0-87 kernel: Memory cgroup stats for /mesos/d9d84892-1165-43a2-9675-10b88be141f4: cache:0KB rss:196608KB rss_huge:188416KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon: Jun 10 16:14:47 ip-10-10-0-87 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Jun 10 16:14:47 ip-10-10-0-87 kernel: [23886] 0 23886 2378 288 110 0 sh Jun 10 16:14:47 ip-10-10-0-87 kernel: [23914] 0 23914 22324052827 1590 0 balloon-executo Jun 10 16:14:47 ip-10-10-0-87 kernel: Memory cgroup out of memory: Kill process 23924 (balloon-executo) score 1045 or sacrifice child Jun 10 16:14:47 ip-10-10-0-87 kernel: Killed process 23914 (balloon-executo) total-vm:892960kB, anon-rss:196168kB, file-rss:15140kB Jun 10 16:14:47 ip-10-10-0-87 mesos-slave[3038]: I0610 16:14:47.600641 3043 slave.cpp:3788] executor(1)@10.10.0.87:37878 exited {code} > Reliably report OOM even if the executor exits normally > --- > > Key: MESOS-2105 > URL: https://issues.apache.org/jira/browse/MESOS-2105 > Project: Mesos > Issue Type: Improvement > Components: isolation >Affects Versions: 0.20.0 >Reporter: Ian Downes > > Container OOMs are asynchronously reported by the kernel and the following > sequence can occur: > 1) Container OOMs > 2) Kernel chooses to kill the task > 3) Executor notices, reports TASK_FAILED, then exits > 4) MesosContainerizer sees executor exit, *doesn't check for an OOM*, and > destroys the container > 5) Memory isolator may or may not have seen the OOM event but the container > is destroyed anyway. > The task is reported to have failed but without including the cause. > Suggest always checking if an OOM has occurred, even if the executor exits > normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5536) Completed executors presented as alive
[ https://issues.apache.org/jira/browse/MESOS-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325264#comment-15325264 ] Anand Mazumdar commented on MESOS-5536: --- Thanks for reporting this issue [~janisz]. From the logs, This looks related to MESOS-5380 that we recently fixed. Can you try this with Mesos 0.28.2? > Completed executors presented as alive > -- > > Key: MESOS-5536 > URL: https://issues.apache.org/jira/browse/MESOS-5536 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.28.0 > Environment: Ubuntu 14.04.3 LTS >Reporter: Tomasz Janiszewski > > I'm running Mesos 0.28.0. Mesos {{slave(1)/state}} endpoint returns some > completed executors not in frameworks.completed_executors but in > frameworks.executors. Alsa this executor presents in {{monitor/statistics}} > {code:JavaScript:title=slave(1)/state} > { > "attributes": {...}, > "completed_frameworks": [], > "flags": {...}, > "frameworks": [ > { > "checkpoint": true, > "completed_executors": [...], > "executors": [ > { > "queued_tasks": [], > "tasks": [], > "completed_tasks": [ > { > "discovery": {...}, > "executor_id": "", > "framework_id": > "f65b163c-0faf-441f-ac14-91739fa4394c-", > "id": > "service.a3b609b8-27ec-11e6-8044-02c89eb9127e", > "labels": [...], > "name": "service", > "resources": {...}, > "slave_id": > "ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13", > "state": "TASK_KILLED", > "statuses": [] > } > ], > "container": "ead42e63-ac92-4ad0-a99c-4af9c3fa5e31", > "directory": "...", > "id": "service.a3b609b8-27ec-11e6-8044-02c89eb9127e", > "name": "Command Executor (Task: > service.a3b609b8-27ec-11e6-8044-02c89eb9127e) (Command: sh -c 'cd > service...')", > "resources": {...}, > "source": "service.a3b609b8-27ec-11e6-8044-02c89eb9127e" > > }, > ... > ], > } > ], > "git_sha": "961edbd82e691a619a4c171a7aadc9c32957fa73", > "git_tag": "0.28.0", > "version": "0.28.0", > ... > } > {code} > {code:title="var/log/mesos/mesos-slave.INFO"} > 13:33:19.479182 [slave.cpp:1361] Got assigned task > service.a3b609b8-27ec-11e6-8044-02c89eb9127e for framework > f65b163c-0faf-441f-ac14-91739fa4394c- > 13:33:19.482566 [slave.cpp:1480] Launching task > service.a3b609b8-27ec-11e6-8044-02c89eb9127e for framework > f65b163c-0faf-441f-ac14-91739fa4394c- > 13:33:19.483921 [paths.cpp:528] Trying to chown > '/tmp/mesos/slaves/ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13/frameworks/f65b163c-0faf-441f-ac14-91739fa4394c-/executors/service.a3b609b8-27ec-11e6-8044-02c89eb9127e/runs/ead42e63-ac92-4ad0-a99c-4af9c3fa5e31' > to user 'mesosuser' > 13:33:19.504173 [slave.cpp:5367] Launching executor > service.a3b609b8-27ec-11e6-8044-02c89eb9127e of framework > f65b163c-0faf-441f-ac14-91739fa4394c- with resources cpus(*):0.1; > mem(*):32 in work directory > '/tmp/mesos/slaves/ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13/frameworks/f65b163c-0faf-441f-ac14-91739fa4394c-/executors/service.a3b609b8-27ec-11e6-8044-02c89eb9127e/runs/ead42e63-ac92-4ad0-a99c-4af9c3fa5e31' > 13:33:19.505537 [containerizer.cpp:666] Starting container > 'ead42e63-ac92-4ad0-a99c-4af9c3fa5e31' for executor > 'service.a3b609b8-27ec-11e6-8044-02c89eb9127e' of framework > 'f65b163c-0faf-441f-ac14-91739fa4394c-' > 13:33:19.505734 [slave.cpp:1698] Queuing task > 'service.a3b609b8-27ec-11e6-8044-02c89eb9127e' for executor > 'service.a3b609b8-27ec-11e6-8044-02c89eb9127e' of framework > f65b163c-0faf-441f-ac14-91739fa4394c- > ... > 13:33:19.977483 [containerizer.cpp:1118] Checkpointing executor's forked pid > 25576 to > '/tmp/mesos/meta/slaves/ef232fd9-5114-4d8f-adc3-1669c1e6fdc5-S13/frameworks/f65b163c-0faf-441f-ac14-91739fa4394c-/executors/service.a3b609b8-27ec-11e6-8044-02c89eb9127e/runs/ead42e63-ac92-4ad0-a99c-4af9c3fa5e31/pids/forked.pid' > 13:33:35.775195 [slave.cpp:1891] Asked to kill task > service.a3b609b8-27ec-11e6-8044-02c89eb9127e of framework > f65b163c-0faf-441f-ac14-91739fa4394c- > 13:33:35.775645 [slave.cpp:3002] Handling status update TASK_KILLED (UUID: > eba64915-7df2-483d-8982-a9a46a48a81b) for task > service.a3b609b8-27ec-11e6-8044-02c89eb9127e of framework >
[jira] [Created] (MESOS-5599) oom
Greg Mann created MESOS-5599: Summary: oom Key: MESOS-5599 URL: https://issues.apache.org/jira/browse/MESOS-5599 Project: Mesos Issue Type: Bug Reporter: Greg Mann -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4749) Move HTB out of containers
[ https://issues.apache.org/jira/browse/MESOS-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4749: -- Component/s: network > Move HTB out of containers > -- > > Key: MESOS-4749 > URL: https://issues.apache.org/jira/browse/MESOS-4749 > Project: Mesos > Issue Type: Task > Components: network >Reporter: Cong Wang >Assignee: Cong Wang >Priority: Minor > > Currently we set a fixed HTB bandwidth in each of the container, which makes > it impossible to share the link if idle. As the first step, we should move it > out of the containers, into the qdisc hierarchy of the physical interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4749) Move HTB out of containers
[ https://issues.apache.org/jira/browse/MESOS-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4749: -- Shepherd: Jie Yu Sprint: Mesosphere Sprint 36 Story Points: 3 > Move HTB out of containers > -- > > Key: MESOS-4749 > URL: https://issues.apache.org/jira/browse/MESOS-4749 > Project: Mesos > Issue Type: Task >Reporter: Cong Wang >Assignee: Cong Wang >Priority: Minor > > Currently we set a fixed HTB bandwidth in each of the container, which makes > it impossible to share the link if idle. As the first step, we should move it > out of the containers, into the qdisc hierarchy of the physical interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5598) pailer dies and no longer spools logs from docker container
John Camelon created MESOS-5598: --- Summary: pailer dies and no longer spools logs from docker container Key: MESOS-5598 URL: https://issues.apache.org/jira/browse/MESOS-5598 Project: Mesos Issue Type: Bug Affects Versions: 0.22.2 Reporter: John Camelon There are numerous instances where we see pailer choke on logs and stop updating. When I ssh into the host where the container is running, running "docker logs" will yield much more output past the point where pailer stopped working. I am not sure what logs I am supposed to gather to diagnose this, please let me know. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5597) Document Mesos "health check" feature
Neil Conway created MESOS-5597: -- Summary: Document Mesos "health check" feature Key: MESOS-5597 URL: https://issues.apache.org/jira/browse/MESOS-5597 Project: Mesos Issue Type: Bug Components: documentation Reporter: Neil Conway We don't talk about this feature at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5596) Document agent SIGTERM behavior
Neil Conway created MESOS-5596: -- Summary: Document agent SIGTERM behavior Key: MESOS-5596 URL: https://issues.apache.org/jira/browse/MESOS-5596 Project: Mesos Issue Type: Bug Components: documentation Reporter: Neil Conway Priority: Minor Sending a SIGTERM to agents can be useful; we should document how agents/masters handle this situation, versus a spontaneous agent disconnection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5595) GMock warning in FaultToleranceTest.SchedulerReregisterAfterFailoverTimeout
[ https://issues.apache.org/jira/browse/MESOS-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-5595: --- Shepherd: Anand Mazumdar > GMock warning in FaultToleranceTest.SchedulerReregisterAfterFailoverTimeout > --- > > Key: MESOS-5595 > URL: https://issues.apache.org/jira/browse/MESOS-5595 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Neil Conway >Assignee: Neil Conway >Priority: Trivial > Labels: mesosphere > > {noformat} > [ RUN ] FaultToleranceTest.SchedulerReregisterAfterFailoverTimeout > GMOCK WARNING: > Uninteresting mock function call - returning directly. > Function call: error(0x7fff573067c0, @0x7fbccb44a920 "Framework > disconnected") > Stack trace: > [ OK ] FaultToleranceTest.SchedulerReregisterAfterFailoverTimeout (181 > ms) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5595) GMock warning in FaultToleranceTest.SchedulerReregisterAfterFailoverTimeout
Neil Conway created MESOS-5595: -- Summary: GMock warning in FaultToleranceTest.SchedulerReregisterAfterFailoverTimeout Key: MESOS-5595 URL: https://issues.apache.org/jira/browse/MESOS-5595 Project: Mesos Issue Type: Bug Components: tests Reporter: Neil Conway Assignee: Neil Conway Priority: Trivial {noformat} [ RUN ] FaultToleranceTest.SchedulerReregisterAfterFailoverTimeout GMOCK WARNING: Uninteresting mock function call - returning directly. Function call: error(0x7fff573067c0, @0x7fbccb44a920 "Framework disconnected") Stack trace: [ OK ] FaultToleranceTest.SchedulerReregisterAfterFailoverTimeout (181 ms) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5588) Improve error handling when parsing acls.
[ https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324043#comment-15324043 ] Joerg Schad commented on MESOS-5588: The problem arises from the fact that the protobuf parser ignores unknown field values: https://github.com/apache/mesos/blob/master/3rdparty/stout/include/stout/protobuf.hpp#L599 > Improve error handling when parsing acls. > - > > Key: MESOS-5588 > URL: https://issues.apache.org/jira/browse/MESOS-5588 > Project: Mesos > Issue Type: Improvement >Reporter: Joerg Schad >Assignee: Joerg Schad > > During parsing of the authorizer errors are ignored. This can lead to > undetected security issues. > Consider the following acl with an typo (usr instead of od user) > {code} >"view_frameworks": [ > { > "principals": { "type": "ANY" }, > "usr": { "type": "NONE" } > } > ] > {code} > When the master is started with these flags it will interprete the acl int he > following way which gives any principal access to any framework. > {noformat} > view_frameworks { > principals { > type: ANY > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)