[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144279#comment-15144279 ] Martin Bydzovsky commented on MESOS-4279: - Hello [~qianzhang], did you have time to check on this Vagrantfile? > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4162) SlaveTest.MetricsSlaveLaunchErrors is slow
[ https://issues.apache.org/jira/browse/MESOS-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144237#comment-15144237 ] haosdent commented on MESOS-4162: - This one is because it trigger RateLimiter of MetricsProcess. {code} MetricsProcess() : ProcessBase("metrics"), limiter(2, Seconds(1)) {} {code} > SlaveTest.MetricsSlaveLaunchErrors is slow > -- > > Key: MESOS-4162 > URL: https://issues.apache.org/jira/browse/MESOS-4162 > Project: Mesos > Issue Type: Improvement > Components: technical debt, test >Reporter: Alexander Rukletsov >Assignee: haosdent >Priority: Minor > Labels: mesosphere, newbie++, tech-debt > > The {{SlaveTest.MetricsSlaveLaunchErrors}} test takes around {{1s}} to finish > on my Mac OS 10.10.4: > {code} > SlaveTest.MetricsSlaveLaunchErrors (1009 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4160) Log recover tests are slow
[ https://issues.apache.org/jira/browse/MESOS-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144251#comment-15144251 ] haosdent commented on MESOS-4160: - I think could advance 1 sec here to avoid this delay. > Log recover tests are slow > -- > > Key: MESOS-4160 > URL: https://issues.apache.org/jira/browse/MESOS-4160 > Project: Mesos > Issue Type: Improvement > Components: technical debt, test >Reporter: Alexander Rukletsov >Assignee: Shuai Lin >Priority: Minor > Labels: mesosphere, newbie++, tech-debt > > On Mac OS 10.10.4, some tests take longer than {{1s}} to finish: > {code} > RecoverTest.AutoInitialization (1003 ms) > RecoverTest.AutoInitializationRetry (1000 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4663) Speed up ExamplesTest.PersistentVolumeFramework
haosdent created MESOS-4663: --- Summary: Speed up ExamplesTest.PersistentVolumeFramework Key: MESOS-4663 URL: https://issues.apache.org/jira/browse/MESOS-4663 Project: Mesos Issue Type: Improvement Reporter: haosdent Assignee: haosdent Priority: Minor Currently {{ExamplesTest.PersistentVolumeFramework}} elapsed time: {code} [ OK ] ExamplesTest.PersistentVolumeFramework (5860 ms) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3986) Tests for allocator recovery.
[ https://issues.apache.org/jira/browse/MESOS-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144504#comment-15144504 ] Joerg Schad commented on MESOS-3986: The test plan as discussed with AlexR: -Test that in presence of quota pauses allocations are paused after failover until either enough agents reregister or timeout completes. - Test that allocations are issued again after enough agents reregistered (and behavior is still correct after additional timeout). -Test that allocations are issued again after timeout. > Tests for allocator recovery. > - > > Key: MESOS-3986 > URL: https://issues.apache.org/jira/browse/MESOS-3986 > Project: Mesos > Issue Type: Task > Components: allocation >Reporter: Alexander Rukletsov >Assignee: Joerg Schad > Labels: mesosphere > > The allocator recover() call was introduced for correct recovery in presence > of quota. We should add test verifying the correct behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4441) Allocate revocable resources beyond quota guarantee.
[ https://issues.apache.org/jira/browse/MESOS-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-4441: --- Summary: Allocate revocable resources beyond quota guarantee. (was: Do not allocate non-revocable resources beyond quota guarantee.) > Allocate revocable resources beyond quota guarantee. > > > Key: MESOS-4441 > URL: https://issues.apache.org/jira/browse/MESOS-4441 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Alexander Rukletsov >Assignee: Michael Park >Priority: Blocker > Labels: mesosphere > > h4. Status Quo > Currently resources allocated to frameworks in a role with quota (aka > quota'ed role) beyond quota guarantee are marked non-revocable. This impacts > our flexibility for revoking them if we decide so in the future. > h4. Proposal > Once quota guarantee is satisfied we must not necessarily further allocate > resources as non-revocable. Instead we can mark all offers resources beyond > guarantee as revocable. When in the future {{RevocableInfo}} evolves > frameworks will get additional information about "revocability" of the > resource (i.e. allocation slack) > h4. Caveats > Though it seems like a simple change, it has several implications. > h6. Fairness > Currently the hierarchical allocator considers revocable resources as regular > resources when doing fairness calculations. This may prevent frameworks > getting non-revocable resources as part of their role's quota guarantee if > they accept some revocable resources as well. > Consider the following scenario. A single framework in a role with quota set > to {{10}} CPUs is allocated {{10}} CPUs as non-revocable resources as part of > its quota and additionally {{2}} revocable CPUs. Now a task using {{2}} > non-revocable CPUs finishes and its resources are returned. Total allocation > for the role is {{8}} non-revocable + {{2}} revocable. However, the role may > not be offered additional {{2}} non-revocable since its total allocation > satisfies quota. > h6. Resource math > If we allocate non-revocable resources as revocable, we should make sure we > do accounting right: either we should update total agent resources and mark > them as revocable as well, or bookkeep resources as non-revocable and convert > them to revocable when necessary. > h6. Coarse-grained nature of allocation > The hierarchical allocator performs "coarse-grained" allocation, meaning it > always allocates the entire remaining agent resources to a single framework. > This may lead to over-allocating some resources as non-revocable beyond quota > guarantee. > h6. Quotas smaller than fair share > If a quota set for a role is smaller than its fair share, it may reduce the > amount of resources offered to this role, if frameworks in it do not accept > revocable resources. This is probably the most important consequence of the > proposed change. Operators may set quota to get guarantees, but may observe a > decrease in amount of resources a role gets, which is not intuitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2742) Draft design doc on global resources
[ https://issues.apache.org/jira/browse/MESOS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Schad updated MESOS-2742: --- Summary: Draft design doc on global resources (was: Architecture doc on global resources) > Draft design doc on global resources > > > Key: MESOS-2742 > URL: https://issues.apache.org/jira/browse/MESOS-2742 > Project: Mesos > Issue Type: Task >Reporter: Niklas Quarfot Nielsen >Assignee: Joerg Schad > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3078) Recovered resources are not re-allocated until the next allocation delay.
[ https://issues.apache.org/jira/browse/MESOS-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-3078: --- Shepherd: Benjamin Mahler Assignee: (was: Klaus Ma) > Recovered resources are not re-allocated until the next allocation delay. > - > > Key: MESOS-3078 > URL: https://issues.apache.org/jira/browse/MESOS-3078 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler > > Currently, when resources are recovered, we do not perform an allocation for > that slave. Rather, we wait until the next allocation interval. > For small task, high throughput frameworks, this can have a significant > impact on overall throughput, see the following thread: > http://markmail.org/thread/y6mzfwzlurv6nik3 > We should consider immediately performing a re-allocation for the slave upon > resource recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3986) Tests for allocator recovery.
[ https://issues.apache.org/jira/browse/MESOS-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Schad updated MESOS-3986: --- Description: The allocator recover() call was introduced for correct recovery in presence of quota. We should add test verifying the correct behavior. (was: The allocator recovery() call was introduced for correct recovery in presence of quota. We should add test verifying the correct behavior.) > Tests for allocator recovery. > - > > Key: MESOS-3986 > URL: https://issues.apache.org/jira/browse/MESOS-3986 > Project: Mesos > Issue Type: Task > Components: allocation >Reporter: Alexander Rukletsov >Assignee: Joerg Schad > Labels: mesosphere > > The allocator recover() call was introduced for correct recovery in presence > of quota. We should add test verifying the correct behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4160) Log recover tests are slow
[ https://issues.apache.org/jira/browse/MESOS-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144389#comment-15144389 ] Shuai Lin commented on MESOS-4160: -- Yeah, your suggestion makes sense. I uploaded a patch based on it. > Log recover tests are slow > -- > > Key: MESOS-4160 > URL: https://issues.apache.org/jira/browse/MESOS-4160 > Project: Mesos > Issue Type: Improvement > Components: technical debt, test >Reporter: Alexander Rukletsov >Assignee: Shuai Lin >Priority: Minor > Labels: mesosphere, newbie++, tech-debt > > On Mac OS 10.10.4, some tests take longer than {{1s}} to finish: > {code} > RecoverTest.AutoInitialization (1003 ms) > RecoverTest.AutoInitializationRetry (1000 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1571) Signal escalation timeout is not configurable.
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-1571: --- Shepherd: Benjamin Mahler Assignee: Alexander Rukletsov Sprint: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Sprint 29 (was: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7) Story Points: 8 (was: 2) Summary: Signal escalation timeout is not configurable. (was: Signal escalation timeout is not configurable) > Signal escalation timeout is not configurable. > -- > > Key: MESOS-1571 > URL: https://issues.apache.org/jira/browse/MESOS-1571 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen >Assignee: Alexander Rukletsov > Labels: mesosphere > > Even though the executor shutdown grace period is set to a larger interval, > the signal escalation timeout will still be 3 seconds. It should either be > configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2742) Draft design doc on global resources.
[ https://issues.apache.org/jira/browse/MESOS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Schad updated MESOS-2742: --- Fix Version/s: (was: 0.28.0) > Draft design doc on global resources. > - > > Key: MESOS-2742 > URL: https://issues.apache.org/jira/browse/MESOS-2742 > Project: Mesos > Issue Type: Task >Reporter: Niklas Quarfot Nielsen >Assignee: Joerg Schad > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3986) Tests for allocator recovery.
[ https://issues.apache.org/jira/browse/MESOS-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Schad updated MESOS-3986: --- Description: The allocator recovery() call was introduced for correct recovery in presence of quota. We should add test verifying the correct behavior. > Tests for allocator recovery. > - > > Key: MESOS-3986 > URL: https://issues.apache.org/jira/browse/MESOS-3986 > Project: Mesos > Issue Type: Task > Components: allocation >Reporter: Alexander Rukletsov >Assignee: Joerg Schad > Labels: mesosphere > > The allocator recovery() call was introduced for correct recovery in presence > of quota. We should add test verifying the correct behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4664) Add allocator metrics.
Benjamin Mahler created MESOS-4664: -- Summary: Add allocator metrics. Key: MESOS-4664 URL: https://issues.apache.org/jira/browse/MESOS-4664 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Benjamin Mahler Priority: Critical There are currently no metrics that provide visibility into the allocator, except for the event queue size. This makes monitoring an debugging allocation behavior in a multi-framework setup difficult. Some thoughts for initial metrics to add: * How many allocation runs have completed? (counter) * Current allocation breakdown: allocated / available / total (gauges) * Current maximum shares (gauges) * How many active filters are there for the role / framework? (gauges) * How many frameworks are suppressing offers? (gauges) * How long does an allocation run take? (timers) * Maintenance related metrics: ** How many maintenance events are active? (gauges) ** How many maintenance events are scheduled but not active (gauges) * Quota related metrics: ** How much quota is set for each role? (gauges) ** How much quota is satisfied? How much unsatisfied? (gauges) Some of these are already exposed from the master's metrics, but we should not assume this within the allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2971) Implement OverlayFS based provisioner backend
[ https://issues.apache.org/jira/browse/MESOS-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin reassigned MESOS-2971: Assignee: Shuai Lin (was: Mei Wan) > Implement OverlayFS based provisioner backend > - > > Key: MESOS-2971 > URL: https://issues.apache.org/jira/browse/MESOS-2971 > Project: Mesos > Issue Type: Improvement >Reporter: Timothy Chen >Assignee: Shuai Lin > Labels: mesosphere, twitter, unified-containerizer-mvp > > Part of the image provisioning process is to call a backend to create a root > filesystem based on the image on disk layout. > The problem with the copy backend is that it's both waste of IO and space, > and bind only can deal with one layer. > Overlayfs backend allows us to utilize the filesystem to merge multiple > filesystems into one efficiently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4665) Reverse DNS for cert validation ?
[ https://issues.apache.org/jira/browse/MESOS-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pawan updated MESOS-4665: - Description: I have three mesos master nodes configured to use SSL and with cert validation enabled. All the machines are failing cert-validation and hence the peering with the following error: I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, log-replica(1)@192.168.1.30:5050 } I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with fd 27: Transport endpoint is not connected I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, verification error: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with fd 28: Transport endpoint is not connected -- >From my understanding and looking at the source, during cert validation, mesos >uses getnameinfo call to get the hostname of the connecting peer using the IP >address on the socket connection. Everything worked when I added host-ip >mappings of all peers to /etc/hosts on each host. Does mesos inherently expect reverse DNS (PTR records) to be provisioned ? If so, this is very challenging and unrealistic expectation. Even worse if you are deploying mesos in a firewalled/NAT-ed environment. Is my understanding right ? Am I missing anything here ? How would you recommend me to proceed ? Also, I use --hostname to set hostname of all mesos nodes and see the right [ip, hostname] info in zookeeper node. Looks like mesos is not using it during cert validation. was: I have three mesos master nodes configured to use SSL and with cert validation enabled. All the machines are failing cert-validation and hence the peering with the following error: I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, log-replica(1)@192.168.1.30:5050 } I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with fd 27: Transport endpoint is not connected I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, verification error: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with fd 28: Transport endpoint is not connected >From my understanding and looking at the source, during cert validation, mesos >uses getnameinfo call to get the hostname of the connecting peer using the IP >address on the socket
[jira] [Updated] (MESOS-4664) Add allocator metrics.
[ https://issues.apache.org/jira/browse/MESOS-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-4664: --- Description: There are currently no metrics that provide visibility into the allocator, except for the event queue size. This makes monitoring an debugging allocation behavior in a multi-framework setup difficult. Some thoughts for initial metrics to add: * How many allocation runs have completed? (counter) * How many allocations each framework got? (counter) * Current allocation breakdown: allocated / available / total (gauges) * Current maximum shares (gauges) * How many active filters are there for the role / framework? (gauges) * How many frameworks are suppressing offers? (gauges) * How long does an allocation run take? (timers) * Maintenance related metrics: ** How many maintenance events are active? (gauges) ** How many maintenance events are scheduled but not active (gauges) * Quota related metrics: ** How much quota is set for each role? (gauges) ** How much quota is satisfied? How much unsatisfied? (gauges) Some of these are already exposed from the master's metrics, but we should not assume this within the allocator. was: There are currently no metrics that provide visibility into the allocator, except for the event queue size. This makes monitoring an debugging allocation behavior in a multi-framework setup difficult. Some thoughts for initial metrics to add: * How many allocation runs have completed? (counter) * Current allocation breakdown: allocated / available / total (gauges) * Current maximum shares (gauges) * How many active filters are there for the role / framework? (gauges) * How many frameworks are suppressing offers? (gauges) * How long does an allocation run take? (timers) * Maintenance related metrics: ** How many maintenance events are active? (gauges) ** How many maintenance events are scheduled but not active (gauges) * Quota related metrics: ** How much quota is set for each role? (gauges) ** How much quota is satisfied? How much unsatisfied? (gauges) Some of these are already exposed from the master's metrics, but we should not assume this within the allocator. > Add allocator metrics. > -- > > Key: MESOS-4664 > URL: https://issues.apache.org/jira/browse/MESOS-4664 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler >Assignee: Benjamin Bannier >Priority: Critical > > There are currently no metrics that provide visibility into the allocator, > except for the event queue size. This makes monitoring an debugging > allocation behavior in a multi-framework setup difficult. > Some thoughts for initial metrics to add: > * How many allocation runs have completed? (counter) > * How many allocations each framework got? (counter) > * Current allocation breakdown: allocated / available / total (gauges) > * Current maximum shares (gauges) > * How many active filters are there for the role / framework? (gauges) > * How many frameworks are suppressing offers? (gauges) > * How long does an allocation run take? (timers) > * Maintenance related metrics: > ** How many maintenance events are active? (gauges) > ** How many maintenance events are scheduled but not active (gauges) > * Quota related metrics: > ** How much quota is set for each role? (gauges) > ** How much quota is satisfied? How much unsatisfied? (gauges) > > Some of these are already exposed from the master's metrics, but we should > not assume this within the allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4664) Add allocator metrics.
[ https://issues.apache.org/jira/browse/MESOS-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma reassigned MESOS-4664: --- Assignee: Klaus Ma > Add allocator metrics. > -- > > Key: MESOS-4664 > URL: https://issues.apache.org/jira/browse/MESOS-4664 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler >Assignee: Klaus Ma >Priority: Critical > > There are currently no metrics that provide visibility into the allocator, > except for the event queue size. This makes monitoring an debugging > allocation behavior in a multi-framework setup difficult. > Some thoughts for initial metrics to add: > * How many allocation runs have completed? (counter) > * Current allocation breakdown: allocated / available / total (gauges) > * Current maximum shares (gauges) > * How many active filters are there for the role / framework? (gauges) > * How many frameworks are suppressing offers? (gauges) > * How long does an allocation run take? (timers) > * Maintenance related metrics: > ** How many maintenance events are active? (gauges) > ** How many maintenance events are scheduled but not active (gauges) > * Quota related metrics: > ** How much quota is set for each role? (gauges) > ** How much quota is satisfied? How much unsatisfied? (gauges) > > Some of these are already exposed from the master's metrics, but we should > not assume this within the allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4665) Reverse DNS for cert validation ?
[ https://issues.apache.org/jira/browse/MESOS-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pawan updated MESOS-4665: - Description: I have three mesos master nodes configured to use SSL and with cert validation enabled. All the machines are failing cert-validation and hence the peering with the following error: I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, log-replica(1)@192.168.1.30:5050 } I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with fd 27: Transport endpoint is not connected I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, verification error: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with fd 28: Transport endpoint is not connected -- >From my understanding and looking at the source, during cert validation, mesos >uses getnameinfo call to get the hostname of the connecting peer using the IP >address on the socket connection. And this call would return the IP as a >string which is resulting in failures as our cert has a CN of only the peer >hostname. But, everything worked when I added host-ip mappings of all peers to >/etc/hosts on each host. Does mesos inherently expect reverse DNS (PTR records) to be provisioned ? If so, this is very challenging and unrealistic expectation. Even worse if you are deploying mesos in a firewalled/NAT-ed environment. Is my understanding right ? Am I missing anything here ? How would you recommend me to proceed ? Also, I use --hostname to set hostname of all mesos nodes and see the right [ip, hostname] info in zookeeper node. Looks like mesos is not using it during cert validation. was: I have three mesos master nodes configured to use SSL and with cert validation enabled. All the machines are failing cert-validation and hence the peering with the following error: I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, log-replica(1)@192.168.1.30:5050 } I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with fd 27: Transport endpoint is not connected I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, verification error: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with fd 28: Transport endpoint is not connected
[jira] [Updated] (MESOS-4658) process::Connection can lead to deadlock around execution in the same context.
[ https://issues.apache.org/jira/browse/MESOS-4658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-4658: -- Description: The {{Connection}} abstraction is prone to deadlocks arising from the object being destroyed inside the same execution context. Consider this example: {code} Option connection = process::http::connect(...).get(); connection.disconnected() .onAny(defer(self(), , connection)); connection.disconnect(); connection = None(); {code} In the above snippet, if the {{connection = None()}} gets executed first before the actual dispatch to {{ConnectionProcess}} happens. You might loose the only existing reference to {{Connection}} object inside {{ConnectionProcess::disconnect}}. This would lead to the destruction of the {{Connection}} object in the {{ConnectionProcess}} execution context. We do have a snippet in our existing code that alludes to such occurrences happening: https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1325 {code} // This is a one time request which will close the connection when // the response is received. Since 'Connection' is reference-counted, // we must keep a copy around until the disconnection occurs. Note // that in order to avoid a deadlock (Connection destruction occurring // from the ConnectionProcess execution context), we use 'async'. {code} AFAICT, for scenarios where we need to hold on to the {{Connection}} object for later, this approach does not suffice. was: The {{Connection}} abstraction is prone to deadlocks arising from the object being destroyed inside the same execution context. Consider this example: {code} Option connection = process::http::connect(...); connection.disconnected() .onAny(defer(self(), , connection)); connection.disconnect(); connection = None(); {code} In the above snippet, if the {{connection = None()}} gets executed first before the actual dispatch to {{ConnectionProcess}} happens. You might loose the only existing reference to {{Connection}} object inside {{ConnectionProcess::disconnect}}. This would lead to the destruction of the {{Connection}} object in the {{ConnectionProcess}} execution context. We do have a snippet in our existing code that alludes to such occurrences happening: https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1325 {code} // This is a one time request which will close the connection when // the response is received. Since 'Connection' is reference-counted, // we must keep a copy around until the disconnection occurs. Note // that in order to avoid a deadlock (Connection destruction occurring // from the ConnectionProcess execution context), we use 'async'. {code} AFAICT, for scenarios where we need to hold on to the {{Connection}} object for later, this approach does not suffice. > process::Connection can lead to deadlock around execution in the same context. > -- > > Key: MESOS-4658 > URL: https://issues.apache.org/jira/browse/MESOS-4658 > Project: Mesos > Issue Type: Bug > Components: HTTP API, libprocess >Reporter: Anand Mazumdar >Assignee: Shuai Lin > Labels: mesosphere > > The {{Connection}} abstraction is prone to deadlocks arising from the object > being destroyed inside the same execution context. > Consider this example: > {code} > Option connection = process::http::connect(...).get(); > connection.disconnected() > .onAny(defer(self(), , connection)); > connection.disconnect(); > connection = None(); > {code} > In the above snippet, if the {{connection = None()}} gets executed first > before the actual dispatch to {{ConnectionProcess}} happens. You might loose > the only existing reference to {{Connection}} object inside > {{ConnectionProcess::disconnect}}. This would lead to the destruction of the > {{Connection}} object in the {{ConnectionProcess}} execution context. > We do have a snippet in our existing code that alludes to such occurrences > happening: > https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1325 > {code} > // This is a one time request which will close the connection when > // the response is received. Since 'Connection' is reference-counted, > // we must keep a copy around until the disconnection occurs. Note > // that in order to avoid a deadlock (Connection destruction occurring > // from the ConnectionProcess execution context), we use 'async'. > {code} > AFAICT, for scenarios where we need to hold on to the {{Connection}} object > for later, this approach does not suffice. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4664) Add allocator metrics.
[ https://issues.apache.org/jira/browse/MESOS-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma updated MESOS-4664: Assignee: (was: Klaus Ma) > Add allocator metrics. > -- > > Key: MESOS-4664 > URL: https://issues.apache.org/jira/browse/MESOS-4664 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler >Priority: Critical > > There are currently no metrics that provide visibility into the allocator, > except for the event queue size. This makes monitoring an debugging > allocation behavior in a multi-framework setup difficult. > Some thoughts for initial metrics to add: > * How many allocation runs have completed? (counter) > * Current allocation breakdown: allocated / available / total (gauges) > * Current maximum shares (gauges) > * How many active filters are there for the role / framework? (gauges) > * How many frameworks are suppressing offers? (gauges) > * How long does an allocation run take? (timers) > * Maintenance related metrics: > ** How many maintenance events are active? (gauges) > ** How many maintenance events are scheduled but not active (gauges) > * Quota related metrics: > ** How much quota is set for each role? (gauges) > ** How much quota is satisfied? How much unsatisfied? (gauges) > > Some of these are already exposed from the master's metrics, but we should > not assume this within the allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2930) Allow the Resource Estimator to express over-allocation of revocable resources.
[ https://issues.apache.org/jira/browse/MESOS-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma updated MESOS-2930: Assignee: (was: Klaus Ma) > Allow the Resource Estimator to express over-allocation of revocable > resources. > --- > > Key: MESOS-2930 > URL: https://issues.apache.org/jira/browse/MESOS-2930 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Benjamin Mahler > > Currently the resource estimator returns the amount of oversubscription > resources that are available, since resources cannot be negative, this allows > the resource estimator to express the following: > (1) Return empty resources: We are fully allocated for oversubscription > resources. > (2) Return non-empty resources: We are under-allocated for oversubscription > resources. In other words, some are available. > However, there is an additional situation that we cannot express: > (3) Analogous to returning non-empty "negative" resources: We are > over-allocated for oversubscription resources. Do not re-offer any of the > over-allocated oversubscription resources that are recovered. > Without (3), the slave can only shrink the total pool of oversubscription > resources by returning (1) as resources are recovered, until the pool is > shrunk to the desired size. However, this approach is only best-effort, it's > possible for a framework to launch more tasks in the window of time (15 > seconds by default) that the slave polls the estimator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1961) Ensure executor state is correctly reconciled between master and slave.
[ https://issues.apache.org/jira/browse/MESOS-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma updated MESOS-1961: Assignee: (was: Klaus Ma) > Ensure executor state is correctly reconciled between master and slave. > --- > > Key: MESOS-1961 > URL: https://issues.apache.org/jira/browse/MESOS-1961 > Project: Mesos > Issue Type: Epic > Components: master, slave >Reporter: Benjamin Mahler > > The master and slave should correctly reconcile the state of executors, much > like the master and slave now correctly reconcile task state (MESOS-1407). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4588) Set title for documentation webpages.
[ https://issues.apache.org/jira/browse/MESOS-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-4588: --- Summary: Set title for documentation webpages. (was: Set title for documentation webpages) > Set title for documentation webpages. > - > > Key: MESOS-4588 > URL: https://issues.apache.org/jira/browse/MESOS-4588 > Project: Mesos > Issue Type: Improvement > Components: documentation, project website >Reporter: Neil Conway >Assignee: Abhishek Dasgupta > Labels: documentation, mesosphere, website > > The HTML we generate for the documentation pages (e.g., > https://mesos.apache.org/documentation/latest/authorization/) has an empty > {{}} tag. This seems bad: we probably lose search engine karma, it is > hard to identify a documentation page in your browser tabs, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4665) Reverse DNS for cert validation ?
pawan created MESOS-4665: Summary: Reverse DNS for cert validation ? Key: MESOS-4665 URL: https://issues.apache.org/jira/browse/MESOS-4665 Project: Mesos Issue Type: Bug Affects Versions: 0.26.0 Reporter: pawan I have three mesos master nodes configured to use SSL and with cert validation enabled. All the machines are failing cert-validation and hence the peering with the following error: I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, log-replica(1)@192.168.1.30:5050 } I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, verification error: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname name: 192.168.1.16 E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with fd 27: Transport endpoint is not connected I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, verification error: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname name: 192.168.1.27 E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with fd 28: Transport endpoint is not connected >From my understanding and looking at the source, during cert validation, mesos >uses getnameinfo call to get the hostname of the connecting peer using the IP >address on the socket connection. Everything worked when I added host-ip >mappings of all peers to /etc/hosts on each host. Does mesos inherently expect reverse DNS (PTR records) to be provisioned ? If so, this is very challenging and unrealistic expectation. Even worse if you are deploying mesos in a firewalled/NAT-ed environment. Is my understanding right ? Am I missing anything here ? How would you recommend me to proceed ? Also, I use --hostname to set hostname of all mesos nodes and see the right [ip, hostname] info in zookeeper node. Looks like mesos is not using it during cert validation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4664) Add allocator metrics.
[ https://issues.apache.org/jira/browse/MESOS-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier reassigned MESOS-4664: --- Assignee: Benjamin Bannier > Add allocator metrics. > -- > > Key: MESOS-4664 > URL: https://issues.apache.org/jira/browse/MESOS-4664 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler >Assignee: Benjamin Bannier >Priority: Critical > > There are currently no metrics that provide visibility into the allocator, > except for the event queue size. This makes monitoring an debugging > allocation behavior in a multi-framework setup difficult. > Some thoughts for initial metrics to add: > * How many allocation runs have completed? (counter) > * Current allocation breakdown: allocated / available / total (gauges) > * Current maximum shares (gauges) > * How many active filters are there for the role / framework? (gauges) > * How many frameworks are suppressing offers? (gauges) > * How long does an allocation run take? (timers) > * Maintenance related metrics: > ** How many maintenance events are active? (gauges) > ** How many maintenance events are scheduled but not active (gauges) > * Quota related metrics: > ** How much quota is set for each role? (gauges) > ** How much quota is satisfied? How much unsatisfied? (gauges) > > Some of these are already exposed from the master's metrics, but we should > not assume this within the allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4664) Add allocator metrics.
[ https://issues.apache.org/jira/browse/MESOS-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-4664: Sprint: Mesosphere Sprint 29 > Add allocator metrics. > -- > > Key: MESOS-4664 > URL: https://issues.apache.org/jira/browse/MESOS-4664 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler >Assignee: Benjamin Bannier >Priority: Critical > > There are currently no metrics that provide visibility into the allocator, > except for the event queue size. This makes monitoring an debugging > allocation behavior in a multi-framework setup difficult. > Some thoughts for initial metrics to add: > * How many allocation runs have completed? (counter) > * How many allocations each framework got? (counter) > * Current allocation breakdown: allocated / available / total (gauges) > * Current maximum shares (gauges) > * How many active filters are there for the role / framework? (gauges) > * How many frameworks are suppressing offers? (gauges) > * How long does an allocation run take? (timers) > * Maintenance related metrics: > ** How many maintenance events are active? (gauges) > ** How many maintenance events are scheduled but not active (gauges) > * Quota related metrics: > ** How much quota is set for each role? (gauges) > ** How much quota is satisfied? How much unsatisfied? (gauges) > > Some of these are already exposed from the master's metrics, but we should > not assume this within the allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions
Anindya Sinha created MESOS-4666: Summary: Expose total resources of a slave in offer for scheduling decisions Key: MESOS-4666 URL: https://issues.apache.org/jira/browse/MESOS-4666 Project: Mesos Issue Type: Improvement Components: general Affects Versions: 0.25.0 Reporter: Anindya Sinha Assignee: Anindya Sinha Priority: Minor To effectively schedule certain class of tasks, the scheduler might need to know not only the available resources (as exposed currently) but also the maximum resources available on that slave. This is specifically true for clusters having different configurations of the slave nodes in terms of resources such as cpu, memory, disk, etc. Certain class of tasks might have a need to be scheduled on the same slave (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a slave to a framework, the framework can make a very good determination if it had exposure to both available as well as total resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions
[ https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144942#comment-15144942 ] Anindya Sinha commented on MESOS-4666: -- We can expose `repeated Resource total_resources` in the protobuf `Offer` to notify frameworks of the maximum resources. This would be a summation of all resources based on cpu, mem, disk, etc. across all roles (+ unreserved), and would be the value that mesos-slave was started (either via --resources flag, or derived from system resources). To preserve the existing behavior, the total_resources can be a opt-in via a new `FrameworkInfo::Capability` set at registration of frameworks, or we could make it a slave flag as well (defaulting to current behavior of not exposing total_resources in Offer). > Expose total resources of a slave in offer for scheduling decisions > --- > > Key: MESOS-4666 > URL: https://issues.apache.org/jira/browse/MESOS-4666 > Project: Mesos > Issue Type: Improvement > Components: general >Affects Versions: 0.25.0 >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > To effectively schedule certain class of tasks, the scheduler might need to > know not only the available resources (as exposed currently) but also the > maximum resources available on that slave. This is specifically true for > clusters having different configurations of the slave nodes in terms of > resources such as cpu, memory, disk, etc. > Certain class of tasks might have a need to be scheduled on the same slave > (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a > slave to a framework, the framework can make a very good determination if it > had exposure to both available as well as total resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3915) Upgrade vendored Boost
[ https://issues.apache.org/jira/browse/MESOS-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-3915: --- Summary: Upgrade vendored Boost (was: Upgrade vendored Boost to 1.59) > Upgrade vendored Boost > -- > > Key: MESOS-3915 > URL: https://issues.apache.org/jira/browse/MESOS-3915 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway >Priority: Minor > Labels: boost, mesosphere, tech-debt > > We should upgrade the vendored version of Boost to a newer version. Benefits: > * -Should properly fix MESOS-688- > * -Should fix MESOS-3799- > * Generally speaking, using a more modern version of Boost means we can take > advantage of bug fixes, optimizations, and new features. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4095) Flaky test: MasterAllocatorTest/1.FrameworkExited
[ https://issues.apache.org/jira/browse/MESOS-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-4095: --- Labels: flaky flaky-test mesosphere (was: flaky-test mesosphere) > Flaky test: MasterAllocatorTest/1.FrameworkExited > - > > Key: MESOS-4095 > URL: https://issues.apache.org/jira/browse/MESOS-4095 > Project: Mesos > Issue Type: Bug > Components: allocation, master, test >Reporter: Neil Conway > Labels: flaky, flaky-test, mesosphere > Attachments: wily64_master_allocator_framework_exited-1.log > > > This test fails about ~10% of the time for me on Ubuntu 15.10 (running in a > crappy Virtualbox VM). Passes consistently on Mac OSX 10.10. > Verbose test log attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4591) `/reserve` endpoint allows reservations for any role
[ https://issues.apache.org/jira/browse/MESOS-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145035#comment-15145035 ] Neil Conway commented on MESOS-4591: Note that the same issue applies to the {{/create-volumes}} endpoint as well. > `/reserve` endpoint allows reservations for any role > > > Key: MESOS-4591 > URL: https://issues.apache.org/jira/browse/MESOS-4591 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.0 >Reporter: Greg Mann > Labels: mesosphere, reservations > > When frameworks reserve resources, the validation of the operation ensures > that the {{role}} of the reservation matches the {{role}} of the framework. > For the case of the {{/reserve}} operator endpoint, however, the operator has > no role to validate, so this check isn't performed. > This means that if an ACL exists which authorizes a framework's principal to > reserve resources, that same principal can be used to reserve resources for > _any_ role through the operator endpoint. > We should restrict reservations made through the operator endpoint to > specified roles. A few possibilities: > * The {{object}} of the {{reserve_resources}} ACL could be changed from > {{resources}} to {{roles}} > * A second ACL could be added for authorization of {{reserve}} operations, > with an {{object}} of {{role}} > * Our conception of the {{resources}} object in the {{reserve_resources}} ACL > could be expanded to include role information, i.e., > {{disk(role1);mem(role1)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4591) `/reserve` and `/create-volumes` endpoints allow reservations for any role
[ https://issues.apache.org/jira/browse/MESOS-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4591: - Summary: `/reserve` and `/create-volumes` endpoints allow reservations for any role (was: `/reserve` endpoint allows reservations for any role) > `/reserve` and `/create-volumes` endpoints allow reservations for any role > -- > > Key: MESOS-4591 > URL: https://issues.apache.org/jira/browse/MESOS-4591 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.0 >Reporter: Greg Mann > Labels: mesosphere, reservations > > When frameworks reserve resources, the validation of the operation ensures > that the {{role}} of the reservation matches the {{role}} of the framework. > For the case of the {{/reserve}} operator endpoint, however, the operator has > no role to validate, so this check isn't performed. > This means that if an ACL exists which authorizes a framework's principal to > reserve resources, that same principal can be used to reserve resources for > _any_ role through the operator endpoint. > We should restrict reservations made through the operator endpoint to > specified roles. A few possibilities: > * The {{object}} of the {{reserve_resources}} ACL could be changed from > {{resources}} to {{roles}} > * A second ACL could be added for authorization of {{reserve}} operations, > with an {{object}} of {{role}} > * Our conception of the {{resources}} object in the {{reserve_resources}} ACL > could be expanded to include role information, i.e., > {{disk(role1);mem(role1)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4591) `/reserve` and `/create-volumes` endpoints allow operations for any role
[ https://issues.apache.org/jira/browse/MESOS-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4591: - Summary: `/reserve` and `/create-volumes` endpoints allow operations for any role (was: `/reserve` and `/create-volumes` endpoints allow reservations for any role) > `/reserve` and `/create-volumes` endpoints allow operations for any role > > > Key: MESOS-4591 > URL: https://issues.apache.org/jira/browse/MESOS-4591 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.0 >Reporter: Greg Mann > Labels: mesosphere, reservations > > When frameworks reserve resources, the validation of the operation ensures > that the {{role}} of the reservation matches the {{role}} of the framework. > For the case of the {{/reserve}} operator endpoint, however, the operator has > no role to validate, so this check isn't performed. > This means that if an ACL exists which authorizes a framework's principal to > reserve resources, that same principal can be used to reserve resources for > _any_ role through the operator endpoint. > We should restrict reservations made through the operator endpoint to > specified roles. A few possibilities: > * The {{object}} of the {{reserve_resources}} ACL could be changed from > {{resources}} to {{roles}} > * A second ACL could be added for authorization of {{reserve}} operations, > with an {{object}} of {{role}} > * Our conception of the {{resources}} object in the {{reserve_resources}} ACL > could be expanded to include role information, i.e., > {{disk(role1);mem(role1)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3801) Flaky test on Ubuntu Wily: ReservationTest.DropReserveTooLarge
[ https://issues.apache.org/jira/browse/MESOS-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-3801: --- Labels: flaky flaky-test mesosphere (was: flaky-test mesosphere) > Flaky test on Ubuntu Wily: ReservationTest.DropReserveTooLarge > -- > > Key: MESOS-3801 > URL: https://issues.apache.org/jira/browse/MESOS-3801 > Project: Mesos > Issue Type: Bug > Environment: Linux vagrant-ubuntu-wily-64 4.2.0-16-generic #19-Ubuntu > SMP Thu Oct 8 15:35:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Neil Conway >Priority: Minor > Labels: flaky, flaky-test, mesosphere > Attachments: test_fail_verbose.txt, test_run_15sec.txt > > > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from ReservationTest > [ RUN ] ReservationTest.DropReserveTooLarge > /mesos/src/tests/reservation_tests.cpp:449: Failure > Failed to wait 15secs for offers > /mesos/src/tests/reservation_tests.cpp:439: Failure > Actual function call count doesn't match EXPECT_CALL(sched, > resourceOffers(, _))... > Expected: to be called once >Actual: never called - unsatisfied and active > /mesos/src/tests/reservation_tests.cpp:421: Failure > Actual function call count doesn't match EXPECT_CALL(allocator, addSlave(_, > _, _, _, _))... > Expected: to be called once >Actual: never called - unsatisfied and active > [ FAILED ] ReservationTest.DropReserveTooLarge (15302 ms) > [--] 1 test from ReservationTest (15303 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (15308 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] ReservationTest.DropReserveTooLarge > 1 FAILED TEST > {noformat} > Repro'd via "mesos-tests --gtest_filter=ReservationTest.DropReserveTooLarge > --gtest_repeat=100". ~4 runs out of 100 resulted in the error. Note that test > runtime varied pretty widely: most test runs completed in < 500ms, but many > (1/3?) of runs took 5000ms or longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4667) Expose persistent volume information in state.json
Neil Conway created MESOS-4667: -- Summary: Expose persistent volume information in state.json Key: MESOS-4667 URL: https://issues.apache.org/jira/browse/MESOS-4667 Project: Mesos Issue Type: Bug Components: master Reporter: Neil Conway Priority: Minor The per-slave {{reserved_resources}} information returned by {{/state}} does not seem to include information about persistent volumes. This makes it hard for operators to use the {{/destroy-volumes}} endpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4667) Expose persistent volume information in state.json
[ https://issues.apache.org/jira/browse/MESOS-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145146#comment-15145146 ] Neil Conway commented on MESOS-4667: Probably also need to expose the {{reserver_principal}} for dynamically reserved resources. > Expose persistent volume information in state.json > -- > > Key: MESOS-4667 > URL: https://issues.apache.org/jira/browse/MESOS-4667 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Neil Conway >Priority: Minor > Labels: endpoint, mesosphere > > The per-slave {{reserved_resources}} information returned by {{/state}} does > not seem to include information about persistent volumes. This makes it hard > for operators to use the {{/destroy-volumes}} endpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4669) Add common compression utility
Jojy Varghese created MESOS-4669: Summary: Add common compression utility Key: MESOS-4669 URL: https://issues.apache.org/jira/browse/MESOS-4669 Project: Mesos Issue Type: Bug Components: containerization Reporter: Jojy Varghese Assignee: Jojy Varghese We need GZIP uncompress utility for Appc image fetching functionality. The images are tar + gzip'ed and they needs to be first uncompressed so that we can compute sha 512 checksum on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions
[ https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145359#comment-15145359 ] Anindya Sinha commented on MESOS-4666: -- Yes if all these "related tasks" are launched by the same framework, but not if they are launched by different frameworks (and/or different schedulers). > Expose total resources of a slave in offer for scheduling decisions > --- > > Key: MESOS-4666 > URL: https://issues.apache.org/jira/browse/MESOS-4666 > Project: Mesos > Issue Type: Improvement > Components: general >Affects Versions: 0.25.0 >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > To effectively schedule certain class of tasks, the scheduler might need to > know not only the available resources (as exposed currently) but also the > maximum resources available on that slave. This is specifically true for > clusters having different configurations of the slave nodes in terms of > resources such as cpu, memory, disk, etc. > Certain class of tasks might have a need to be scheduled on the same slave > (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a > slave to a framework, the framework can make a very good determination if it > had exposure to both available as well as total resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions
[ https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145376#comment-15145376 ] Vinod Kone commented on MESOS-4666: --- Hmm. Having to know what other frameworks have launched/reserved on a slave seems like a security issue. Can you be more specific of your use case? Do you have 2 different frameworks in the same role that launch tasks (or make dynamic reservations) that depend on each other? Or are you using static reservations? > Expose total resources of a slave in offer for scheduling decisions > --- > > Key: MESOS-4666 > URL: https://issues.apache.org/jira/browse/MESOS-4666 > Project: Mesos > Issue Type: Improvement > Components: general >Affects Versions: 0.25.0 >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > To effectively schedule certain class of tasks, the scheduler might need to > know not only the available resources (as exposed currently) but also the > maximum resources available on that slave. This is specifically true for > clusters having different configurations of the slave nodes in terms of > resources such as cpu, memory, disk, etc. > Certain class of tasks might have a need to be scheduled on the same slave > (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a > slave to a framework, the framework can make a very good determination if it > had exposure to both available as well as total resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4665) Reverse DNS for cert validation ?
[ https://issues.apache.org/jira/browse/MESOS-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145306#comment-15145306 ] Vinod Kone commented on MESOS-4665: --- Any comments here [~kaysoky] [~jvanremoortere] ? > Reverse DNS for cert validation ? > - > > Key: MESOS-4665 > URL: https://issues.apache.org/jira/browse/MESOS-4665 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.26.0 >Reporter: pawan > > I have three mesos master nodes configured to use SSL and with cert > validation enabled. All the machines are failing cert-validation and hence > the peering with the following error: > > I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { > log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, > log-replica(1)@192.168.1.30:5050 } > I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, > verification error: Presented Certificate Name: mesos01.p.qa.a.com does not > match peer hostname name: 192.168.1.16 > I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, > verification error: Presented Certificate Name: mesos02.p.qa.a.com does not > match peer hostname name: 192.168.1.27 > I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, > verification error: Presented Certificate Name: mesos01.p.qa.a.com does not > match peer hostname name: 192.168.1.16 > I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, > verification error: Presented Certificate Name: mesos01.p.qa.a.com does not > match peer hostname name: 192.168.1.16 > I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: > Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname > name: 192.168.1.16 > E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with > fd 27: Transport endpoint is not connected > I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, > verification error: Presented Certificate Name: mesos02.p.qa.a.com does not > match peer hostname name: 192.168.1.27 > I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: > Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname > name: 192.168.1.27 > E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with > fd 28: Transport endpoint is not connected > -- > From my understanding and looking at the source, during cert validation, > mesos uses getnameinfo call to get the hostname of the connecting peer using > the IP address on the socket connection. And this call would return the IP as > a string which is resulting in failures as our cert has a CN of only the peer > hostname. But, everything worked when I added host-ip mappings of all peers > to /etc/hosts on each host. > Does mesos inherently expect reverse DNS (PTR records) to be provisioned ? If > so, this is very challenging and unrealistic expectation. Even worse if you > are deploying mesos in a firewalled/NAT-ed environment. > Is my understanding right ? Am I missing anything here ? How would you > recommend me to proceed ? > Also, I use --hostname to set hostname of all mesos nodes and see the right > [ip, hostname] info in zookeeper node. Looks like mesos is not using it > during cert validation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4670) `cgroup_info` not being exposed in state.json when ComposingContainerizer is used.
Avinash Sridharan created MESOS-4670: Summary: `cgroup_info` not being exposed in state.json when ComposingContainerizer is used. Key: MESOS-4670 URL: https://issues.apache.org/jira/browse/MESOS-4670 Project: Mesos Issue Type: Bug Reporter: Avinash Sridharan Assignee: Avinash Sridharan The ComposingContainerizer currently does not have a `status` method. This results in no `ContainerStatus` being updated in the agent, when uses `ComposingContainerizer` to launch containers. This would specifically happen when the agent is launched with `--containerizer=docker,mesos` -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4668) Agent's /state endpoint does not include reservation information
Neil Conway created MESOS-4668: -- Summary: Agent's /state endpoint does not include reservation information Key: MESOS-4668 URL: https://issues.apache.org/jira/browse/MESOS-4668 Project: Mesos Issue Type: Bug Components: slave Reporter: Neil Conway Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4353) Limit the number of processes created by libprocess
[ https://issues.apache.org/jira/browse/MESOS-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145367#comment-15145367 ] Maged Michael commented on MESOS-4353: -- If this is OK, then I propose the following design: * Introduce a new environment variable to allow the operator to set the number of libprocess worker threads. * The environment variable is named LIBPROCESS_WORKER_THREADS * Valid values of the environment variable are integers in the range 1 to 1024. * All other values are invalid and generate a warning. * The proposed environment variable can be set directly for Mesos master, agents (slaves), and tests. * For executors, the proposed environment variable can be set indirectly by including it in the setting of the agent (slave) --executor_environment_variables option (See documentation of Mesos configuration http://mesos.apache.org/documentation/latest/configuration/). * Update documentation of Mesos configuration to reflect the addition of this libprocess environment variable. > Limit the number of processes created by libprocess > --- > > Key: MESOS-4353 > URL: https://issues.apache.org/jira/browse/MESOS-4353 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Qian Zhang >Assignee: Qian Zhang > > Currently libprocess will create {{max(8, number of CPU cores)}} processes > during the initialization, see > https://github.com/apache/mesos/blob/0.26.0/3rdparty/libprocess/src/process.cpp#L2146 > for details. This should be OK for a normal machine which has no much cores > (e.g., 16, 32), but for a powerful machine which may have a large number of > cores (e.g., an IBM Power machine may have 192 cores), this will cause too > much worker threads which are not necessary. > And since libprocess is widely used in Mesos (master, agent, scheduler, > executor), it may also cause some performance issue. For example, when user > creates a Docker container via Mesos in a Mesos agent which is running on a > powerful machine with 192 cores, the DockerContainerizer in Mesos agent will > create a dedicated executor for the container, and there will be 192 worker > threads in that executor. And if user creates 1000 Docker containers in that > machine, then there will be 1000 executors, i.e., 1000 * 192 worker threads > which is a large number and may thrash the OS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions
[ https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145515#comment-15145515 ] Anindya Sinha commented on MESOS-4666: -- btw, in total resources, the proposal is to send what the mesos-slave started off with (either --resources flag or derived from the system). Say a slave starts with cpus(*): 8;mem(*):16384;disk(*): 163840, we send this info in Offer::total_resources to denote that this is the max capability for this slave in addition to available resources for this specific framework. Another example is if a slave starts with cpus(*): 6;cpus(role1): 2;mem(*):8192;mem(role1): 8192;disk(*): 102400;disk(role1): 61440, we send the Offer::total_resources as a summation of all resource types across roles, ie. cpus(*): 8;mem(*):16384;disk(*): 163840. So we want to send the total resources the slave started off with when it registered. I do not think it exposes what other frameworks have launched/reserved. So I am not sure if this would still qualify as a security issue. My use case is that we have 2 different frameworks in the same role to launch these related tasks on the same slave. Also, we do not want any other tasks to be running on this slave. We do not use static reservations but reserve resources dynamically. So the idea is that when the 1st framework comes up, we reserve all the resources on the slave if the Offer::total_resources == Offer::resources (which suggests that the slave is not running any tasks across any frameworks). Once the reservation is successful, we launch task1 from framework1. Then framework2 comes up and uses the reserved resources to launch task2. > Expose total resources of a slave in offer for scheduling decisions > --- > > Key: MESOS-4666 > URL: https://issues.apache.org/jira/browse/MESOS-4666 > Project: Mesos > Issue Type: Improvement > Components: general >Affects Versions: 0.25.0 >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > To effectively schedule certain class of tasks, the scheduler might need to > know not only the available resources (as exposed currently) but also the > maximum resources available on that slave. This is specifically true for > clusters having different configurations of the slave nodes in terms of > resources such as cpu, memory, disk, etc. > Certain class of tasks might have a need to be scheduled on the same slave > (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a > slave to a framework, the framework can make a very good determination if it > had exposure to both available as well as total resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4671) Status updates from executor can be forwarded out of order by the Agent.
[ https://issues.apache.org/jira/browse/MESOS-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145712#comment-15145712 ] Anand Mazumdar commented on MESOS-4671: --- cc: [~avin...@mesosphere.io], [~jieyu] > Status updates from executor can be forwarded out of order by the Agent. > > > Key: MESOS-4671 > URL: https://issues.apache.org/jira/browse/MESOS-4671 > Project: Mesos > Issue Type: Bug > Components: containerization, HTTP API >Affects Versions: 0.28.0 >Reporter: Anand Mazumdar > Labels: mesosphere > > Previously, all status update messages from the executor were forwarded by > the agent to the master in the order that they had been received. > However, that seems to be no longer valid due to a recently introduced change > in the agent: > {code} > // Before sending update, we need to retrieve the container status. > containerizer->status(executor->containerId) > .onAny(defer(self(), > ::_statusUpdate, > update, > pid, > executor->id, > lambda::_1)); > {code} > This can sometimes lead to status updates being sent out of order depending > on the order the {{Future}} is fulfilled from the call to {{status(...)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4671) Status updates from executor can be forwarded out of order by the Agent.
[ https://issues.apache.org/jira/browse/MESOS-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-4671: -- Description: Previously, all status update messages from the executor were forwarded by the agent to the master in the order that they had been received. However, that seems to be no longer valid due to a recently introduced change in the agent: {code} // Before sending update, we need to retrieve the container status. containerizer->status(executor->containerId) .onAny(defer(self(), ::_statusUpdate, update, pid, executor->id, lambda::_1)); {code} This can sometimes lead to status updates being sent out of order depending on the order the {{Future}} is fulfilled from the call to {{status(...)}}. was: Previously, all status update message from the executor were forwarded by the agent to the master in the order that they had been received. However, that seems to be no longer valid due to a recently introduced change in the agent: {code} // Before sending update, we need to retrieve the container status. containerizer->status(executor->containerId) .onAny(defer(self(), ::_statusUpdate, update, pid, executor->id, lambda::_1)); {code} This can sometimes lead to status updates being sent out of order depending on the order the {{Future}} is fulfilled from the call to {{status(...)}}. > Status updates from executor can be forwarded out of order by the Agent. > > > Key: MESOS-4671 > URL: https://issues.apache.org/jira/browse/MESOS-4671 > Project: Mesos > Issue Type: Bug > Components: containerization, HTTP API >Affects Versions: 0.28.0 >Reporter: Anand Mazumdar > Labels: mesosphere > > Previously, all status update messages from the executor were forwarded by > the agent to the master in the order that they had been received. > However, that seems to be no longer valid due to a recently introduced change > in the agent: > {code} > // Before sending update, we need to retrieve the container status. > containerizer->status(executor->containerId) > .onAny(defer(self(), > ::_statusUpdate, > update, > pid, > executor->id, > lambda::_1)); > {code} > This can sometimes lead to status updates being sent out of order depending > on the order the {{Future}} is fulfilled from the call to {{status(...)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4671) Status updates from executor can be forwarded out of order by the Agent.
Anand Mazumdar created MESOS-4671: - Summary: Status updates from executor can be forwarded out of order by the Agent. Key: MESOS-4671 URL: https://issues.apache.org/jira/browse/MESOS-4671 Project: Mesos Issue Type: Bug Components: containerization, HTTP API Affects Versions: 0.28.0 Reporter: Anand Mazumdar Previously, all status update message from the executor were forwarded by the agent to the master in the order that they had been received. However, that seems to be no longer valid due to a recently introduced change in the agent: {code} // Before sending update, we need to retrieve the container status. containerizer->status(executor->containerId) .onAny(defer(self(), ::_statusUpdate, update, pid, executor->id, lambda::_1)); {code} This can sometimes lead to status updates being sent out of order depending on the order the {{Future}} is fulfilled from the call to {{status(...)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions
[ https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145590#comment-15145590 ] Vinod Kone commented on MESOS-4666: --- {quote} So we want to send the total resources the slave started off with when it registered. I do not think it exposes what other frameworks have launched/reserved {quote} My concern is that if a slave has reservations (static/dynamic) for 'roleA' it doesn't make sense (without some sort of ACLs) to send that information to a framework that was registered with 'roleB'. IIUC, you have two slightly orthogonal requirements 1) Two frameworks to launch tasks that are dependent on each other. 2) Needing exclusive access to a slave *dynamically* instead of static reservations. For 1) you can simply use dynamic reservations. Framework1 can make a reservation for a resources required for task1 and task2. It sounds like Framework1 already knows task2's resources out of band, which is a bit weird but not too bad. For 2) Why do you want exclusive access? Is it because the current isolation (resource and security) in Mesos is not good/compliant enough or something else? Anyway, the ability to somehow dynamically reserve a *whole* slave is an interesting use case. For that we might have to expose the total *amount* of resources, without exposing the reservation information. > Expose total resources of a slave in offer for scheduling decisions > --- > > Key: MESOS-4666 > URL: https://issues.apache.org/jira/browse/MESOS-4666 > Project: Mesos > Issue Type: Improvement > Components: general >Affects Versions: 0.25.0 >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > To effectively schedule certain class of tasks, the scheduler might need to > know not only the available resources (as exposed currently) but also the > maximum resources available on that slave. This is specifically true for > clusters having different configurations of the slave nodes in terms of > resources such as cpu, memory, disk, etc. > Certain class of tasks might have a need to be scheduled on the same slave > (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a > slave to a framework, the framework can make a very good determination if it > had exposure to both available as well as total resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions
[ https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145745#comment-15145745 ] Guangya Liu commented on MESOS-4666: Does this use case kind of {{Exclusive Resources}} and the {{Exclusive Resources}} is host level? There is indeed a document talking about {{Exclusive Resources}} but only focusing on a special kind of resources and not a whole host. https://docs.google.com/document/d/1Aby-U3-MPKE51s4aYd41L4Co2S97eM6LPtyzjyR_ecI/edit# > Expose total resources of a slave in offer for scheduling decisions > --- > > Key: MESOS-4666 > URL: https://issues.apache.org/jira/browse/MESOS-4666 > Project: Mesos > Issue Type: Improvement > Components: general >Affects Versions: 0.25.0 >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > To effectively schedule certain class of tasks, the scheduler might need to > know not only the available resources (as exposed currently) but also the > maximum resources available on that slave. This is specifically true for > clusters having different configurations of the slave nodes in terms of > resources such as cpu, memory, disk, etc. > Certain class of tasks might have a need to be scheduled on the same slave > (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a > slave to a framework, the framework can make a very good determination if it > had exposure to both available as well as total resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4545) Propose design doc for reliable floating point behavior
[ https://issues.apache.org/jira/browse/MESOS-4545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway reassigned MESOS-4545: -- Assignee: Neil Conway > Propose design doc for reliable floating point behavior > --- > > Key: MESOS-4545 > URL: https://issues.apache.org/jira/browse/MESOS-4545 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Neil Conway >Assignee: Neil Conway > Labels: mesosphere, resources > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4591) `/reserve` and `/create-volumes` endpoints allow operations for any role
[ https://issues.apache.org/jira/browse/MESOS-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145700#comment-15145700 ] Guangya Liu commented on MESOS-4591: [~neilc], I think that [~greggomann] gave some explanation why he think that the {{create-volumes}} is low priority: With regard to the /create-volumes endpoint, the difference there is that an operator can only create volumes using resources that have already been reserved for a particular role. You raise a good point, and perhaps we should restrict the creation of volumes to certain roles as well. However, that case seems less harmful to me since the operator can't create any persistent volume for any arbitrary role, they can only create volumes on disk resources that have already been reserved for a particular role. > `/reserve` and `/create-volumes` endpoints allow operations for any role > > > Key: MESOS-4591 > URL: https://issues.apache.org/jira/browse/MESOS-4591 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.0 >Reporter: Greg Mann > Labels: mesosphere, reservations > > When frameworks reserve resources, the validation of the operation ensures > that the {{role}} of the reservation matches the {{role}} of the framework. > For the case of the {{/reserve}} operator endpoint, however, the operator has > no role to validate, so this check isn't performed. > This means that if an ACL exists which authorizes a framework's principal to > reserve resources, that same principal can be used to reserve resources for > _any_ role through the operator endpoint. > We should restrict reservations made through the operator endpoint to > specified roles. A few possibilities: > * The {{object}} of the {{reserve_resources}} ACL could be changed from > {{resources}} to {{roles}} > * A second ACL could be added for authorization of {{reserve}} operations, > with an {{object}} of {{role}} > * Our conception of the {{resources}} object in the {{reserve_resources}} ACL > could be expanded to include role information, i.e., > {{disk(role1);mem(role1)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3307) Configurable size of completed task / framework history
[ https://issues.apache.org/jira/browse/MESOS-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145572#comment-15145572 ] Alexander Rukletsov commented on MESOS-3307: We currently do not work on the event streaming, hence the JSON endpoint is the best you can get now. I think adding filters to the endpoint is a good idea. > Configurable size of completed task / framework history > --- > > Key: MESOS-3307 > URL: https://issues.apache.org/jira/browse/MESOS-3307 > Project: Mesos > Issue Type: Bug >Reporter: Ian Babrou >Assignee: Kevin Klues > Labels: mesosphere > Fix For: 0.27.0 > > > We try to make Mesos work with multiple frameworks and mesos-dns at the same > time. The goal is to have set of frameworks per team / project on a single > Mesos cluster. > At this point our mesos state.json is at 4mb and it takes a while to > assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively > pushing mesos-master CPU usage through the roof. It's at 100%+ all the time. > Here's the problem: > {noformat} > mesos λ curl -s http://mesos-master:5050/master/state.json | jq > .frameworks[].completed_tasks[].framework_id | sort | uniq -c | sort -n >1 "20150606-001827-252388362-5050-5982-0003" > 16 "20150606-001827-252388362-5050-5982-0005" > 18 "20150606-001827-252388362-5050-5982-0029" > 73 "20150606-001827-252388362-5050-5982-0007" > 141 "20150606-001827-252388362-5050-5982-0009" > 154 "20150820-154817-302720010-5050-15320-" > 289 "20150606-001827-252388362-5050-5982-0004" > 510 "20150606-001827-252388362-5050-5982-0012" > 666 "20150606-001827-252388362-5050-5982-0028" > 923 "20150116-002612-269165578-5050-32204-0003" > 1000 "20150606-001827-252388362-5050-5982-0001" > 1000 "20150606-001827-252388362-5050-5982-0006" > 1000 "20150606-001827-252388362-5050-5982-0010" > 1000 "20150606-001827-252388362-5050-5982-0011" > 1000 "20150606-001827-252388362-5050-5982-0027" > mesos λ fgrep 1000 -r src/master > src/master/constants.cpp:const size_t MAX_REMOVED_SLAVES = 10; > src/master/constants.cpp:const uint32_t MAX_COMPLETED_TASKS_PER_FRAMEWORK = > 1000; > {noformat} > Active tasks are just 6% of state.json response: > {noformat} > mesos λ cat ~/temp/mesos-state.json | jq -c . | wc >1 14796 4138942 > mesos λ cat ~/temp/mesos-state.json | jq .frameworks[].tasks | jq -c . | wc > 16 37 252774 > {noformat} > I see four options that can improve the situation: > 1. Add query string param to exclude completed tasks from state.json and use > it in mesos-dns and similar tools. There is no need for mesos-dns to know > about completed tasks, it's just extra load on master and mesos-dns. > 2. Make history size configurable. > 3. Make JSON serialization faster. With 1s of tasks even without history > it would take a lot of time to serialize tasks for mesos-dns. Doing it every > 60 seconds instead of every 5 seconds isn't really an option. > 4. Create event bus for mesos master. Marathon has it and it'd be nice to > have it in Mesos. This way mesos-dns could avoid polling master state and > switch to listening for events. > All can be done independently. > Note to mesosphere folks: please start distributing debug symbols with your > distribution. I was asking for it for a while and it is really helpful: > https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501 > Perf report for leading master: > !http://i.imgur.com/iz7C3o0.png! > I'm on 0.23.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions
[ https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145769#comment-15145769 ] Klaus Ma commented on MESOS-4666: - Agree with [~vinodkone]; if there's any use case required to reserve a whole slave, Mesos support it instead of providing `total_resources` in offer. > Expose total resources of a slave in offer for scheduling decisions > --- > > Key: MESOS-4666 > URL: https://issues.apache.org/jira/browse/MESOS-4666 > Project: Mesos > Issue Type: Improvement > Components: general >Affects Versions: 0.25.0 >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > To effectively schedule certain class of tasks, the scheduler might need to > know not only the available resources (as exposed currently) but also the > maximum resources available on that slave. This is specifically true for > clusters having different configurations of the slave nodes in terms of > resources such as cpu, memory, disk, etc. > Certain class of tasks might have a need to be scheduled on the same slave > (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a > slave to a framework, the framework can make a very good determination if it > had exposure to both available as well as total resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4667) Expose persistent volume information in state.json
[ https://issues.apache.org/jira/browse/MESOS-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145771#comment-15145771 ] Klaus Ma commented on MESOS-4667: - is there any secure concern if expose {{reserver_principal}}. > Expose persistent volume information in state.json > -- > > Key: MESOS-4667 > URL: https://issues.apache.org/jira/browse/MESOS-4667 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Neil Conway >Priority: Minor > Labels: endpoint, mesosphere > > The per-slave {{reserved_resources}} information returned by {{/state}} does > not seem to include information about persistent volumes. This makes it hard > for operators to use the {{/destroy-volumes}} endpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)