[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-02-12 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144279#comment-15144279
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Hello [~qianzhang], did you have time to check on this Vagrantfile?

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4162) SlaveTest.MetricsSlaveLaunchErrors is slow

2016-02-12 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144237#comment-15144237
 ] 

haosdent commented on MESOS-4162:
-

This one is because it trigger RateLimiter of MetricsProcess.
{code}
  MetricsProcess()
: ProcessBase("metrics"),
  limiter(2, Seconds(1))
  {}
{code}

> SlaveTest.MetricsSlaveLaunchErrors is slow
> --
>
> Key: MESOS-4162
> URL: https://issues.apache.org/jira/browse/MESOS-4162
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
>
> The {{SlaveTest.MetricsSlaveLaunchErrors}} test takes around {{1s}} to finish 
> on my Mac OS 10.10.4:
> {code}
> SlaveTest.MetricsSlaveLaunchErrors (1009 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4160) Log recover tests are slow

2016-02-12 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144251#comment-15144251
 ] 

haosdent commented on MESOS-4160:
-

I think could advance 1 sec here to avoid this delay.

> Log recover tests are slow
> --
>
> Key: MESOS-4160
> URL: https://issues.apache.org/jira/browse/MESOS-4160
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: Shuai Lin
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
>
> On Mac OS 10.10.4, some tests take longer than {{1s}} to finish:
> {code}
> RecoverTest.AutoInitialization (1003 ms)
> RecoverTest.AutoInitializationRetry (1000 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4663) Speed up ExamplesTest.PersistentVolumeFramework

2016-02-12 Thread haosdent (JIRA)
haosdent created MESOS-4663:
---

 Summary: Speed up ExamplesTest.PersistentVolumeFramework
 Key: MESOS-4663
 URL: https://issues.apache.org/jira/browse/MESOS-4663
 Project: Mesos
  Issue Type: Improvement
Reporter: haosdent
Assignee: haosdent
Priority: Minor


Currently {{ExamplesTest.PersistentVolumeFramework}} elapsed time: 
{code}
[   OK ] ExamplesTest.PersistentVolumeFramework (5860 ms)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3986) Tests for allocator recovery.

2016-02-12 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144504#comment-15144504
 ] 

Joerg Schad commented on MESOS-3986:


The test plan as discussed with AlexR:
-Test that in presence of quota pauses allocations are paused after failover 
until either enough agents reregister or timeout completes.
- Test that allocations are issued again after enough agents reregistered (and 
behavior is still correct after additional timeout).
-Test that allocations are issued again after timeout.



> Tests for allocator recovery.
> -
>
> Key: MESOS-3986
> URL: https://issues.apache.org/jira/browse/MESOS-3986
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Alexander Rukletsov
>Assignee: Joerg Schad
>  Labels: mesosphere
>
> The allocator recover() call was introduced for correct recovery in presence 
> of quota. We should add test verifying the correct behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4441) Allocate revocable resources beyond quota guarantee.

2016-02-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4441:
---
Summary: Allocate revocable resources beyond quota guarantee.  (was: Do not 
allocate non-revocable resources beyond quota guarantee.)

> Allocate revocable resources beyond quota guarantee.
> 
>
> Key: MESOS-4441
> URL: https://issues.apache.org/jira/browse/MESOS-4441
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Alexander Rukletsov
>Assignee: Michael Park
>Priority: Blocker
>  Labels: mesosphere
>
> h4. Status Quo
> Currently resources allocated to frameworks in a role with quota (aka 
> quota'ed role) beyond quota guarantee are marked non-revocable. This impacts 
> our flexibility for revoking them if we decide so in the future.
> h4. Proposal
> Once quota guarantee is satisfied we must not necessarily further allocate 
> resources as non-revocable. Instead we can mark all offers resources beyond 
> guarantee as revocable. When in the future {{RevocableInfo}} evolves 
> frameworks will get additional information about "revocability" of the 
> resource (i.e. allocation slack)
> h4. Caveats
> Though it seems like a simple change, it has several implications.
> h6. Fairness
> Currently the hierarchical allocator considers revocable resources as regular 
> resources when doing fairness calculations. This may prevent frameworks 
> getting non-revocable resources as part of their role's quota guarantee if 
> they accept some revocable resources as well.
> Consider the following scenario. A single framework in a role with quota set 
> to {{10}} CPUs is allocated {{10}} CPUs as non-revocable resources as part of 
> its quota and additionally {{2}} revocable CPUs. Now a task using {{2}} 
> non-revocable CPUs finishes and its resources are returned. Total allocation 
> for the role is {{8}} non-revocable + {{2}} revocable. However, the role may 
> not be offered additional {{2}} non-revocable since its total allocation 
> satisfies quota.
> h6. Resource math
> If we allocate non-revocable resources as revocable, we should make sure we 
> do accounting right: either we should update total agent resources and mark 
> them as revocable as well, or bookkeep resources as non-revocable and convert 
> them to revocable when necessary.
> h6. Coarse-grained nature of allocation
> The hierarchical allocator performs "coarse-grained" allocation, meaning it 
> always allocates the entire remaining agent resources to a single framework. 
> This may lead to over-allocating some resources as non-revocable beyond quota 
> guarantee.
> h6. Quotas smaller than fair share
> If a quota set for a role is smaller than its fair share, it may reduce the 
> amount of resources offered to this role, if frameworks in it do not accept 
> revocable resources. This is probably the most important consequence of the 
> proposed change. Operators may set quota to get guarantees, but may observe a 
> decrease in amount of resources a role gets, which is not intuitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2742) Draft design doc on global resources

2016-02-12 Thread Joerg Schad (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Schad updated MESOS-2742:
---
Summary: Draft design doc on global resources  (was: Architecture doc on 
global resources)

> Draft design doc on global resources
> 
>
> Key: MESOS-2742
> URL: https://issues.apache.org/jira/browse/MESOS-2742
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>Assignee: Joerg Schad
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3078) Recovered resources are not re-allocated until the next allocation delay.

2016-02-12 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-3078:
---
Shepherd: Benjamin Mahler
Assignee: (was: Klaus Ma)

> Recovered resources are not re-allocated until the next allocation delay.
> -
>
> Key: MESOS-3078
> URL: https://issues.apache.org/jira/browse/MESOS-3078
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Mahler
>
> Currently, when resources are recovered, we do not perform an allocation for 
> that slave. Rather, we wait until the next allocation interval.
> For small task, high throughput frameworks, this can have a significant 
> impact on overall throughput, see the following thread:
> http://markmail.org/thread/y6mzfwzlurv6nik3
> We should consider immediately performing a re-allocation for the slave upon 
> resource recovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3986) Tests for allocator recovery.

2016-02-12 Thread Joerg Schad (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Schad updated MESOS-3986:
---
Description: The allocator recover() call was introduced for correct 
recovery in presence of quota. We should add test verifying the correct 
behavior.  (was: The allocator recovery() call was introduced for correct 
recovery in presence of quota. We should add test verifying the correct 
behavior.)

> Tests for allocator recovery.
> -
>
> Key: MESOS-3986
> URL: https://issues.apache.org/jira/browse/MESOS-3986
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Alexander Rukletsov
>Assignee: Joerg Schad
>  Labels: mesosphere
>
> The allocator recover() call was introduced for correct recovery in presence 
> of quota. We should add test verifying the correct behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4160) Log recover tests are slow

2016-02-12 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144389#comment-15144389
 ] 

Shuai Lin commented on MESOS-4160:
--

Yeah, your suggestion makes sense. I uploaded a patch based on it.

> Log recover tests are slow
> --
>
> Key: MESOS-4160
> URL: https://issues.apache.org/jira/browse/MESOS-4160
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: Shuai Lin
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
>
> On Mac OS 10.10.4, some tests take longer than {{1s}} to finish:
> {code}
> RecoverTest.AutoInitialization (1003 ms)
> RecoverTest.AutoInitializationRetry (1000 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1571) Signal escalation timeout is not configurable.

2016-02-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-1571:
---
Shepherd: Benjamin Mahler
Assignee: Alexander Rukletsov
  Sprint: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 
12/7, Mesosphere Sprint 29  (was: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 
Sprint 3 - 12/7)
Story Points: 8  (was: 2)
 Summary: Signal escalation timeout is not configurable.  (was: Signal 
escalation timeout is not configurable)

> Signal escalation timeout is not configurable.
> --
>
> Key: MESOS-1571
> URL: https://issues.apache.org/jira/browse/MESOS-1571
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> Even though the executor shutdown grace period is set to a larger interval, 
> the signal escalation timeout will still be 3 seconds. It should either be 
> configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2742) Draft design doc on global resources.

2016-02-12 Thread Joerg Schad (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Schad updated MESOS-2742:
---
Fix Version/s: (was: 0.28.0)

> Draft design doc on global resources.
> -
>
> Key: MESOS-2742
> URL: https://issues.apache.org/jira/browse/MESOS-2742
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>Assignee: Joerg Schad
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3986) Tests for allocator recovery.

2016-02-12 Thread Joerg Schad (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Schad updated MESOS-3986:
---
Description: The allocator recovery() call was introduced for correct 
recovery in presence of quota. We should add test verifying the correct 
behavior.

> Tests for allocator recovery.
> -
>
> Key: MESOS-3986
> URL: https://issues.apache.org/jira/browse/MESOS-3986
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Alexander Rukletsov
>Assignee: Joerg Schad
>  Labels: mesosphere
>
> The allocator recovery() call was introduced for correct recovery in presence 
> of quota. We should add test verifying the correct behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4664) Add allocator metrics.

2016-02-12 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-4664:
--

 Summary: Add allocator metrics.
 Key: MESOS-4664
 URL: https://issues.apache.org/jira/browse/MESOS-4664
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Benjamin Mahler
Priority: Critical


There are currently no metrics that provide visibility into the allocator, 
except for the event queue size. This makes monitoring an debugging allocation 
behavior in a multi-framework setup difficult.

Some thoughts for initial metrics to add:

* How many allocation runs have completed? (counter)
* Current allocation breakdown: allocated / available / total (gauges)
* Current maximum shares (gauges)
* How many active filters are there for the role / framework? (gauges)
* How many frameworks are suppressing offers? (gauges)
* How long does an allocation run take? (timers)
* Maintenance related metrics:
** How many maintenance events are active? (gauges)
** How many maintenance events are scheduled but not active (gauges)
* Quota related metrics:
** How much quota is set for each role? (gauges)
** How much quota is satisfied? How much unsatisfied? (gauges)
 
Some of these are already exposed from the master's metrics, but we should not 
assume this within the allocator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-2971) Implement OverlayFS based provisioner backend

2016-02-12 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin reassigned MESOS-2971:


Assignee: Shuai Lin  (was: Mei Wan)

> Implement OverlayFS based provisioner backend
> -
>
> Key: MESOS-2971
> URL: https://issues.apache.org/jira/browse/MESOS-2971
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Timothy Chen
>Assignee: Shuai Lin
>  Labels: mesosphere, twitter, unified-containerizer-mvp
>
> Part of the image provisioning process is to call a backend to create a root 
> filesystem based on the image on disk layout.
> The problem with the copy backend is that it's both waste of IO and space, 
> and bind only can deal with one layer.
> Overlayfs backend allows us to utilize the filesystem to merge multiple 
> filesystems into one efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4665) Reverse DNS for cert validation ?

2016-02-12 Thread pawan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pawan updated MESOS-4665:
-
Description: 
I have three mesos master nodes configured to use SSL and with cert validation 
enabled. All the machines are failing cert-validation and hence the peering 
with the following error:


I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { 
log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, 
log-replica(1)@192.168.1.30:5050 }
I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
match peer hostname name: 192.168.1.27
I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: 
Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname 
name: 192.168.1.16
E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with fd 
27: Transport endpoint is not connected
I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, 
verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
match peer hostname name: 192.168.1.27
I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: 
Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname 
name: 192.168.1.27
E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with fd 
28: Transport endpoint is not connected
--

>From my understanding and looking at the source, during cert validation, mesos 
>uses getnameinfo call to get the hostname of the connecting peer using the IP 
>address on the socket connection. Everything worked when I added host-ip 
>mappings of all peers to /etc/hosts on each host.

Does mesos inherently expect reverse DNS (PTR records) to be provisioned ? If 
so, this is very challenging and unrealistic expectation. Even worse if you are 
deploying mesos in a firewalled/NAT-ed environment.

Is my understanding right ? Am I missing anything here ? How would you 
recommend me to proceed ?

Also, I use --hostname to set hostname of all mesos nodes and see the right 
[ip, hostname] info in zookeeper node. Looks like mesos is not using it during 
cert validation.

  was:
I have three mesos master nodes configured to use SSL and with cert validation 
enabled. All the machines are failing cert-validation and hence the peering 
with the following error:

I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { 
log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, 
log-replica(1)@192.168.1.30:5050 }
I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
match peer hostname name: 192.168.1.27
I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: 
Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname 
name: 192.168.1.16
E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with fd 
27: Transport endpoint is not connected
I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, 
verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
match peer hostname name: 192.168.1.27
I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: 
Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname 
name: 192.168.1.27
E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with fd 
28: Transport endpoint is not connected

>From my understanding and looking at the source, during cert validation, mesos 
>uses getnameinfo call to get the hostname of the connecting peer using the IP 
>address on the socket 

[jira] [Updated] (MESOS-4664) Add allocator metrics.

2016-02-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4664:
---
Description: 
There are currently no metrics that provide visibility into the allocator, 
except for the event queue size. This makes monitoring an debugging allocation 
behavior in a multi-framework setup difficult.

Some thoughts for initial metrics to add:

* How many allocation runs have completed? (counter)
* How many allocations each framework got? (counter)
* Current allocation breakdown: allocated / available / total (gauges)
* Current maximum shares (gauges)
* How many active filters are there for the role / framework? (gauges)
* How many frameworks are suppressing offers? (gauges)
* How long does an allocation run take? (timers)
* Maintenance related metrics:
** How many maintenance events are active? (gauges)
** How many maintenance events are scheduled but not active (gauges)
* Quota related metrics:
** How much quota is set for each role? (gauges)
** How much quota is satisfied? How much unsatisfied? (gauges)
 
Some of these are already exposed from the master's metrics, but we should not 
assume this within the allocator.

  was:
There are currently no metrics that provide visibility into the allocator, 
except for the event queue size. This makes monitoring an debugging allocation 
behavior in a multi-framework setup difficult.

Some thoughts for initial metrics to add:

* How many allocation runs have completed? (counter)
* Current allocation breakdown: allocated / available / total (gauges)
* Current maximum shares (gauges)
* How many active filters are there for the role / framework? (gauges)
* How many frameworks are suppressing offers? (gauges)
* How long does an allocation run take? (timers)
* Maintenance related metrics:
** How many maintenance events are active? (gauges)
** How many maintenance events are scheduled but not active (gauges)
* Quota related metrics:
** How much quota is set for each role? (gauges)
** How much quota is satisfied? How much unsatisfied? (gauges)
 
Some of these are already exposed from the master's metrics, but we should not 
assume this within the allocator.


> Add allocator metrics.
> --
>
> Key: MESOS-4664
> URL: https://issues.apache.org/jira/browse/MESOS-4664
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Benjamin Bannier
>Priority: Critical
>
> There are currently no metrics that provide visibility into the allocator, 
> except for the event queue size. This makes monitoring an debugging 
> allocation behavior in a multi-framework setup difficult.
> Some thoughts for initial metrics to add:
> * How many allocation runs have completed? (counter)
> * How many allocations each framework got? (counter)
> * Current allocation breakdown: allocated / available / total (gauges)
> * Current maximum shares (gauges)
> * How many active filters are there for the role / framework? (gauges)
> * How many frameworks are suppressing offers? (gauges)
> * How long does an allocation run take? (timers)
> * Maintenance related metrics:
> ** How many maintenance events are active? (gauges)
> ** How many maintenance events are scheduled but not active (gauges)
> * Quota related metrics:
> ** How much quota is set for each role? (gauges)
> ** How much quota is satisfied? How much unsatisfied? (gauges)
>  
> Some of these are already exposed from the master's metrics, but we should 
> not assume this within the allocator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4664) Add allocator metrics.

2016-02-12 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma reassigned MESOS-4664:
---

Assignee: Klaus Ma

> Add allocator metrics.
> --
>
> Key: MESOS-4664
> URL: https://issues.apache.org/jira/browse/MESOS-4664
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Klaus Ma
>Priority: Critical
>
> There are currently no metrics that provide visibility into the allocator, 
> except for the event queue size. This makes monitoring an debugging 
> allocation behavior in a multi-framework setup difficult.
> Some thoughts for initial metrics to add:
> * How many allocation runs have completed? (counter)
> * Current allocation breakdown: allocated / available / total (gauges)
> * Current maximum shares (gauges)
> * How many active filters are there for the role / framework? (gauges)
> * How many frameworks are suppressing offers? (gauges)
> * How long does an allocation run take? (timers)
> * Maintenance related metrics:
> ** How many maintenance events are active? (gauges)
> ** How many maintenance events are scheduled but not active (gauges)
> * Quota related metrics:
> ** How much quota is set for each role? (gauges)
> ** How much quota is satisfied? How much unsatisfied? (gauges)
>  
> Some of these are already exposed from the master's metrics, but we should 
> not assume this within the allocator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4665) Reverse DNS for cert validation ?

2016-02-12 Thread pawan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pawan updated MESOS-4665:
-
Description: 
I have three mesos master nodes configured to use SSL and with cert validation 
enabled. All the machines are failing cert-validation and hence the peering 
with the following error:


I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { 
log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, 
log-replica(1)@192.168.1.30:5050 }
I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
match peer hostname name: 192.168.1.27
I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: 
Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname 
name: 192.168.1.16
E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with fd 
27: Transport endpoint is not connected
I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, 
verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
match peer hostname name: 192.168.1.27
I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: 
Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname 
name: 192.168.1.27
E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with fd 
28: Transport endpoint is not connected
--

>From my understanding and looking at the source, during cert validation, mesos 
>uses getnameinfo call to get the hostname of the connecting peer using the IP 
>address on the socket connection. And this call would return the IP as a 
>string which is resulting in failures as our cert has a CN of only the peer 
>hostname. But, everything worked when I added host-ip mappings of all peers to 
>/etc/hosts on each host.

Does mesos inherently expect reverse DNS (PTR records) to be provisioned ? If 
so, this is very challenging and unrealistic expectation. Even worse if you are 
deploying mesos in a firewalled/NAT-ed environment.

Is my understanding right ? Am I missing anything here ? How would you 
recommend me to proceed ?

Also, I use --hostname to set hostname of all mesos nodes and see the right 
[ip, hostname] info in zookeeper node. Looks like mesos is not using it during 
cert validation.

  was:
I have three mesos master nodes configured to use SSL and with cert validation 
enabled. All the machines are failing cert-validation and hence the peering 
with the following error:


I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { 
log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, 
log-replica(1)@192.168.1.30:5050 }
I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
match peer hostname name: 192.168.1.27
I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: 
Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname 
name: 192.168.1.16
E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with fd 
27: Transport endpoint is not connected
I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, 
verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
match peer hostname name: 192.168.1.27
I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: 
Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname 
name: 192.168.1.27
E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with fd 
28: Transport endpoint is not connected

[jira] [Updated] (MESOS-4658) process::Connection can lead to deadlock around execution in the same context.

2016-02-12 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-4658:
--
Description: 
The {{Connection}} abstraction is prone to deadlocks arising from the object 
being destroyed inside the same execution context.

Consider this example:

{code}
Option connection = process::http::connect(...).get();
connection.disconnected()
  .onAny(defer(self(), , connection));

connection.disconnect();
connection = None();
{code}

In the above snippet, if the {{connection = None()}} gets executed first before 
the actual dispatch to {{ConnectionProcess}} happens. You might loose the only 
existing reference to {{Connection}} object inside 
{{ConnectionProcess::disconnect}}. This would lead to the destruction of the 
{{Connection}} object in the {{ConnectionProcess}} execution context.

We do have a snippet in our existing code that alludes to such occurrences 
happening: 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1325

{code}
  // This is a one time request which will close the connection when
  // the response is received. Since 'Connection' is reference-counted,
  // we must keep a copy around until the disconnection occurs. Note
  // that in order to avoid a deadlock (Connection destruction occurring
  // from the ConnectionProcess execution context), we use 'async'.
{code}

AFAICT, for scenarios where we need to hold on to the {{Connection}} object for 
later, this approach does not suffice.


  was:
The {{Connection}} abstraction is prone to deadlocks arising from the object 
being destroyed inside the same execution context.

Consider this example:

{code}
Option connection = process::http::connect(...);
connection.disconnected()
  .onAny(defer(self(), , connection));

connection.disconnect();
connection = None();
{code}

In the above snippet, if the {{connection = None()}} gets executed first before 
the actual dispatch to {{ConnectionProcess}} happens. You might loose the only 
existing reference to {{Connection}} object inside 
{{ConnectionProcess::disconnect}}. This would lead to the destruction of the 
{{Connection}} object in the {{ConnectionProcess}} execution context.

We do have a snippet in our existing code that alludes to such occurrences 
happening: 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1325

{code}
  // This is a one time request which will close the connection when
  // the response is received. Since 'Connection' is reference-counted,
  // we must keep a copy around until the disconnection occurs. Note
  // that in order to avoid a deadlock (Connection destruction occurring
  // from the ConnectionProcess execution context), we use 'async'.
{code}

AFAICT, for scenarios where we need to hold on to the {{Connection}} object for 
later, this approach does not suffice.



> process::Connection can lead to deadlock around execution in the same context.
> --
>
> Key: MESOS-4658
> URL: https://issues.apache.org/jira/browse/MESOS-4658
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, libprocess
>Reporter: Anand Mazumdar
>Assignee: Shuai Lin
>  Labels: mesosphere
>
> The {{Connection}} abstraction is prone to deadlocks arising from the object 
> being destroyed inside the same execution context.
> Consider this example:
> {code}
> Option connection = process::http::connect(...).get();
> connection.disconnected()
>   .onAny(defer(self(), , connection));
> connection.disconnect();
> connection = None();
> {code}
> In the above snippet, if the {{connection = None()}} gets executed first 
> before the actual dispatch to {{ConnectionProcess}} happens. You might loose 
> the only existing reference to {{Connection}} object inside 
> {{ConnectionProcess::disconnect}}. This would lead to the destruction of the 
> {{Connection}} object in the {{ConnectionProcess}} execution context.
> We do have a snippet in our existing code that alludes to such occurrences 
> happening: 
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1325
> {code}
>   // This is a one time request which will close the connection when
>   // the response is received. Since 'Connection' is reference-counted,
>   // we must keep a copy around until the disconnection occurs. Note
>   // that in order to avoid a deadlock (Connection destruction occurring
>   // from the ConnectionProcess execution context), we use 'async'.
> {code}
> AFAICT, for scenarios where we need to hold on to the {{Connection}} object 
> for later, this approach does not suffice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4664) Add allocator metrics.

2016-02-12 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma updated MESOS-4664:

Assignee: (was: Klaus Ma)

> Add allocator metrics.
> --
>
> Key: MESOS-4664
> URL: https://issues.apache.org/jira/browse/MESOS-4664
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Mahler
>Priority: Critical
>
> There are currently no metrics that provide visibility into the allocator, 
> except for the event queue size. This makes monitoring an debugging 
> allocation behavior in a multi-framework setup difficult.
> Some thoughts for initial metrics to add:
> * How many allocation runs have completed? (counter)
> * Current allocation breakdown: allocated / available / total (gauges)
> * Current maximum shares (gauges)
> * How many active filters are there for the role / framework? (gauges)
> * How many frameworks are suppressing offers? (gauges)
> * How long does an allocation run take? (timers)
> * Maintenance related metrics:
> ** How many maintenance events are active? (gauges)
> ** How many maintenance events are scheduled but not active (gauges)
> * Quota related metrics:
> ** How much quota is set for each role? (gauges)
> ** How much quota is satisfied? How much unsatisfied? (gauges)
>  
> Some of these are already exposed from the master's metrics, but we should 
> not assume this within the allocator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2930) Allow the Resource Estimator to express over-allocation of revocable resources.

2016-02-12 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma updated MESOS-2930:

Assignee: (was: Klaus Ma)

> Allow the Resource Estimator to express over-allocation of revocable 
> resources.
> ---
>
> Key: MESOS-2930
> URL: https://issues.apache.org/jira/browse/MESOS-2930
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Benjamin Mahler
>
> Currently the resource estimator returns the amount of oversubscription 
> resources that are available, since resources cannot be negative, this allows 
> the resource estimator to express the following:
> (1) Return empty resources: We are fully allocated for oversubscription 
> resources.
> (2) Return non-empty resources: We are under-allocated for oversubscription 
> resources. In other words, some are available.
> However, there is an additional situation that we cannot express:
> (3) Analogous to returning non-empty "negative" resources: We are 
> over-allocated for oversubscription resources. Do not re-offer any of the 
> over-allocated oversubscription resources that are recovered.
> Without (3), the slave can only shrink the total pool of oversubscription 
> resources by returning (1) as resources are recovered, until the pool is 
> shrunk to the desired size. However, this approach is only best-effort, it's 
> possible for a framework to launch more tasks in the window of time (15 
> seconds by default) that the slave polls the estimator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1961) Ensure executor state is correctly reconciled between master and slave.

2016-02-12 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma updated MESOS-1961:

Assignee: (was: Klaus Ma)

> Ensure executor state is correctly reconciled between master and slave.
> ---
>
> Key: MESOS-1961
> URL: https://issues.apache.org/jira/browse/MESOS-1961
> Project: Mesos
>  Issue Type: Epic
>  Components: master, slave
>Reporter: Benjamin Mahler
>
> The master and slave should correctly reconcile the state of executors, much 
> like the master and slave now correctly reconcile task state (MESOS-1407).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4588) Set title for documentation webpages.

2016-02-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4588:
---
Summary: Set title for documentation webpages.  (was: Set title for 
documentation webpages)

> Set title for documentation webpages.
> -
>
> Key: MESOS-4588
> URL: https://issues.apache.org/jira/browse/MESOS-4588
> Project: Mesos
>  Issue Type: Improvement
>  Components: documentation, project website
>Reporter: Neil Conway
>Assignee: Abhishek Dasgupta
>  Labels: documentation, mesosphere, website
>
> The HTML we generate for the documentation pages (e.g., 
> https://mesos.apache.org/documentation/latest/authorization/) has an empty 
> {{}} tag. This seems bad: we probably lose search engine karma, it is 
> hard to identify a documentation page in your browser tabs, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4665) Reverse DNS for cert validation ?

2016-02-12 Thread pawan (JIRA)
pawan created MESOS-4665:


 Summary: Reverse DNS for cert validation ?
 Key: MESOS-4665
 URL: https://issues.apache.org/jira/browse/MESOS-4665
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.26.0
Reporter: pawan


I have three mesos master nodes configured to use SSL and with cert validation 
enabled. All the machines are failing cert-validation and hence the peering 
with the following error:

I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { 
log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, 
log-replica(1)@192.168.1.30:5050 }
I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
match peer hostname name: 192.168.1.27
I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, 
verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
match peer hostname name: 192.168.1.16
I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: 
Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname 
name: 192.168.1.16
E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with fd 
27: Transport endpoint is not connected
I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, 
verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
match peer hostname name: 192.168.1.27
I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: 
Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname 
name: 192.168.1.27
E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with fd 
28: Transport endpoint is not connected

>From my understanding and looking at the source, during cert validation, mesos 
>uses getnameinfo call to get the hostname of the connecting peer using the IP 
>address on the socket connection. Everything worked when I added host-ip 
>mappings of all peers to /etc/hosts on each host.

Does mesos inherently expect reverse DNS (PTR records) to be provisioned ? If 
so, this is very challenging and unrealistic expectation. Even worse if you are 
deploying mesos in a firewalled/NAT-ed environment.

Is my understanding right ? Am I missing anything here ? How would you 
recommend me to proceed ?

Also, I use --hostname to set hostname of all mesos nodes and see the right 
[ip, hostname] info in zookeeper node. Looks like mesos is not using it during 
cert validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4664) Add allocator metrics.

2016-02-12 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-4664:
---

Assignee: Benjamin Bannier

> Add allocator metrics.
> --
>
> Key: MESOS-4664
> URL: https://issues.apache.org/jira/browse/MESOS-4664
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Benjamin Bannier
>Priority: Critical
>
> There are currently no metrics that provide visibility into the allocator, 
> except for the event queue size. This makes monitoring an debugging 
> allocation behavior in a multi-framework setup difficult.
> Some thoughts for initial metrics to add:
> * How many allocation runs have completed? (counter)
> * Current allocation breakdown: allocated / available / total (gauges)
> * Current maximum shares (gauges)
> * How many active filters are there for the role / framework? (gauges)
> * How many frameworks are suppressing offers? (gauges)
> * How long does an allocation run take? (timers)
> * Maintenance related metrics:
> ** How many maintenance events are active? (gauges)
> ** How many maintenance events are scheduled but not active (gauges)
> * Quota related metrics:
> ** How much quota is set for each role? (gauges)
> ** How much quota is satisfied? How much unsatisfied? (gauges)
>  
> Some of these are already exposed from the master's metrics, but we should 
> not assume this within the allocator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4664) Add allocator metrics.

2016-02-12 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-4664:

Sprint: Mesosphere Sprint 29

> Add allocator metrics.
> --
>
> Key: MESOS-4664
> URL: https://issues.apache.org/jira/browse/MESOS-4664
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Benjamin Bannier
>Priority: Critical
>
> There are currently no metrics that provide visibility into the allocator, 
> except for the event queue size. This makes monitoring an debugging 
> allocation behavior in a multi-framework setup difficult.
> Some thoughts for initial metrics to add:
> * How many allocation runs have completed? (counter)
> * How many allocations each framework got? (counter)
> * Current allocation breakdown: allocated / available / total (gauges)
> * Current maximum shares (gauges)
> * How many active filters are there for the role / framework? (gauges)
> * How many frameworks are suppressing offers? (gauges)
> * How long does an allocation run take? (timers)
> * Maintenance related metrics:
> ** How many maintenance events are active? (gauges)
> ** How many maintenance events are scheduled but not active (gauges)
> * Quota related metrics:
> ** How much quota is set for each role? (gauges)
> ** How much quota is satisfied? How much unsatisfied? (gauges)
>  
> Some of these are already exposed from the master's metrics, but we should 
> not assume this within the allocator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions

2016-02-12 Thread Anindya Sinha (JIRA)
Anindya Sinha created MESOS-4666:


 Summary: Expose total resources of a slave in offer for scheduling 
decisions
 Key: MESOS-4666
 URL: https://issues.apache.org/jira/browse/MESOS-4666
 Project: Mesos
  Issue Type: Improvement
  Components: general
Affects Versions: 0.25.0
Reporter: Anindya Sinha
Assignee: Anindya Sinha
Priority: Minor


To effectively schedule certain class of tasks, the scheduler might need to 
know not only the available resources (as exposed currently) but also the 
maximum resources available on that slave. This is specifically true for 
clusters having different configurations of the slave nodes in terms of 
resources such as cpu, memory, disk, etc.
Certain class of tasks might have a need to be scheduled on the same slave (esp 
needing shared persistent volumes, MESOS-3421). Instead of dedicating a slave 
to a framework, the framework can make a very good determination if it had 
exposure to both available as well as total resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions

2016-02-12 Thread Anindya Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144942#comment-15144942
 ] 

Anindya Sinha commented on MESOS-4666:
--

We can expose `repeated Resource total_resources` in the protobuf `Offer` to 
notify frameworks of the maximum resources. This would be a summation of all 
resources based on cpu, mem, disk, etc. across all roles (+ unreserved), and 
would be the value that mesos-slave was started (either via --resources flag, 
or derived from system resources).

To preserve the existing behavior, the total_resources can be a opt-in via a 
new `FrameworkInfo::Capability` set at registration of frameworks, or we could 
make it a slave flag as well (defaulting to current behavior of not exposing 
total_resources in Offer).

> Expose total resources of a slave in offer for scheduling decisions
> ---
>
> Key: MESOS-4666
> URL: https://issues.apache.org/jira/browse/MESOS-4666
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.25.0
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> To effectively schedule certain class of tasks, the scheduler might need to 
> know not only the available resources (as exposed currently) but also the 
> maximum resources available on that slave. This is specifically true for 
> clusters having different configurations of the slave nodes in terms of 
> resources such as cpu, memory, disk, etc.
> Certain class of tasks might have a need to be scheduled on the same slave 
> (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a 
> slave to a framework, the framework can make a very good determination if it 
> had exposure to both available as well as total resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3915) Upgrade vendored Boost

2016-02-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-3915:
---
Summary: Upgrade vendored Boost  (was: Upgrade vendored Boost to 1.59)

> Upgrade vendored Boost
> --
>
> Key: MESOS-3915
> URL: https://issues.apache.org/jira/browse/MESOS-3915
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Priority: Minor
>  Labels: boost, mesosphere, tech-debt
>
> We should upgrade the vendored version of Boost to a newer version. Benefits:
> * -Should properly fix MESOS-688-
> * -Should fix MESOS-3799-
> * Generally speaking, using a more modern version of Boost means we can take 
> advantage of bug fixes, optimizations, and new features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4095) Flaky test: MasterAllocatorTest/1.FrameworkExited

2016-02-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-4095:
---
Labels: flaky flaky-test mesosphere  (was: flaky-test mesosphere)

> Flaky test: MasterAllocatorTest/1.FrameworkExited
> -
>
> Key: MESOS-4095
> URL: https://issues.apache.org/jira/browse/MESOS-4095
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master, test
>Reporter: Neil Conway
>  Labels: flaky, flaky-test, mesosphere
> Attachments: wily64_master_allocator_framework_exited-1.log
>
>
> This test fails about ~10% of the time for me on Ubuntu 15.10 (running in a 
> crappy Virtualbox VM). Passes consistently on Mac OSX 10.10.
> Verbose test log attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4591) `/reserve` endpoint allows reservations for any role

2016-02-12 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145035#comment-15145035
 ] 

Neil Conway commented on MESOS-4591:


Note that the same issue applies to the {{/create-volumes}} endpoint as well.

> `/reserve` endpoint allows reservations for any role
> 
>
> Key: MESOS-4591
> URL: https://issues.apache.org/jira/browse/MESOS-4591
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Greg Mann
>  Labels: mesosphere, reservations
>
> When frameworks reserve resources, the validation of the operation ensures 
> that the {{role}} of the reservation matches the {{role}} of the framework. 
> For the case of the {{/reserve}} operator endpoint, however, the operator has 
> no role to validate, so this check isn't performed.
> This means that if an ACL exists which authorizes a framework's principal to 
> reserve resources, that same principal can be used to reserve resources for 
> _any_ role through the operator endpoint.
> We should restrict reservations made through the operator endpoint to 
> specified roles. A few possibilities:
> * The {{object}} of the {{reserve_resources}} ACL could be changed from 
> {{resources}} to {{roles}}
> * A second ACL could be added for authorization of {{reserve}} operations, 
> with an {{object}} of {{role}}
> * Our conception of the {{resources}} object in the {{reserve_resources}} ACL 
> could be expanded to include role information, i.e., 
> {{disk(role1);mem(role1)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4591) `/reserve` and `/create-volumes` endpoints allow reservations for any role

2016-02-12 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-4591:
-
Summary: `/reserve` and `/create-volumes` endpoints allow reservations for 
any role  (was: `/reserve` endpoint allows reservations for any role)

> `/reserve` and `/create-volumes` endpoints allow reservations for any role
> --
>
> Key: MESOS-4591
> URL: https://issues.apache.org/jira/browse/MESOS-4591
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Greg Mann
>  Labels: mesosphere, reservations
>
> When frameworks reserve resources, the validation of the operation ensures 
> that the {{role}} of the reservation matches the {{role}} of the framework. 
> For the case of the {{/reserve}} operator endpoint, however, the operator has 
> no role to validate, so this check isn't performed.
> This means that if an ACL exists which authorizes a framework's principal to 
> reserve resources, that same principal can be used to reserve resources for 
> _any_ role through the operator endpoint.
> We should restrict reservations made through the operator endpoint to 
> specified roles. A few possibilities:
> * The {{object}} of the {{reserve_resources}} ACL could be changed from 
> {{resources}} to {{roles}}
> * A second ACL could be added for authorization of {{reserve}} operations, 
> with an {{object}} of {{role}}
> * Our conception of the {{resources}} object in the {{reserve_resources}} ACL 
> could be expanded to include role information, i.e., 
> {{disk(role1);mem(role1)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4591) `/reserve` and `/create-volumes` endpoints allow operations for any role

2016-02-12 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-4591:
-
Summary: `/reserve` and `/create-volumes` endpoints allow operations for 
any role  (was: `/reserve` and `/create-volumes` endpoints allow reservations 
for any role)

> `/reserve` and `/create-volumes` endpoints allow operations for any role
> 
>
> Key: MESOS-4591
> URL: https://issues.apache.org/jira/browse/MESOS-4591
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Greg Mann
>  Labels: mesosphere, reservations
>
> When frameworks reserve resources, the validation of the operation ensures 
> that the {{role}} of the reservation matches the {{role}} of the framework. 
> For the case of the {{/reserve}} operator endpoint, however, the operator has 
> no role to validate, so this check isn't performed.
> This means that if an ACL exists which authorizes a framework's principal to 
> reserve resources, that same principal can be used to reserve resources for 
> _any_ role through the operator endpoint.
> We should restrict reservations made through the operator endpoint to 
> specified roles. A few possibilities:
> * The {{object}} of the {{reserve_resources}} ACL could be changed from 
> {{resources}} to {{roles}}
> * A second ACL could be added for authorization of {{reserve}} operations, 
> with an {{object}} of {{role}}
> * Our conception of the {{resources}} object in the {{reserve_resources}} ACL 
> could be expanded to include role information, i.e., 
> {{disk(role1);mem(role1)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3801) Flaky test on Ubuntu Wily: ReservationTest.DropReserveTooLarge

2016-02-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-3801:
---
Labels: flaky flaky-test mesosphere  (was: flaky-test mesosphere)

> Flaky test on Ubuntu Wily: ReservationTest.DropReserveTooLarge
> --
>
> Key: MESOS-3801
> URL: https://issues.apache.org/jira/browse/MESOS-3801
> Project: Mesos
>  Issue Type: Bug
> Environment: Linux vagrant-ubuntu-wily-64 4.2.0-16-generic #19-Ubuntu 
> SMP Thu Oct 8 15:35:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Neil Conway
>Priority: Minor
>  Labels: flaky, flaky-test, mesosphere
> Attachments: test_fail_verbose.txt, test_run_15sec.txt
>
>
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ReservationTest
> [ RUN  ] ReservationTest.DropReserveTooLarge
> /mesos/src/tests/reservation_tests.cpp:449: Failure
> Failed to wait 15secs for offers
> /mesos/src/tests/reservation_tests.cpp:439: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> resourceOffers(, _))...
>  Expected: to be called once
>Actual: never called - unsatisfied and active
> /mesos/src/tests/reservation_tests.cpp:421: Failure
> Actual function call count doesn't match EXPECT_CALL(allocator, addSlave(_, 
> _, _, _, _))...
>  Expected: to be called once
>Actual: never called - unsatisfied and active
> [  FAILED  ] ReservationTest.DropReserveTooLarge (15302 ms)
> [--] 1 test from ReservationTest (15303 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (15308 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] ReservationTest.DropReserveTooLarge
>  1 FAILED TEST
> {noformat}
> Repro'd via "mesos-tests --gtest_filter=ReservationTest.DropReserveTooLarge 
> --gtest_repeat=100". ~4 runs out of 100 resulted in the error. Note that test 
> runtime varied pretty widely: most test runs completed in < 500ms, but many 
> (1/3?) of runs took 5000ms or longer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4667) Expose persistent volume information in state.json

2016-02-12 Thread Neil Conway (JIRA)
Neil Conway created MESOS-4667:
--

 Summary: Expose persistent volume information in state.json
 Key: MESOS-4667
 URL: https://issues.apache.org/jira/browse/MESOS-4667
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Neil Conway
Priority: Minor


The per-slave {{reserved_resources}} information returned by {{/state}} does 
not seem to include information about persistent volumes. This makes it hard 
for operators to use the {{/destroy-volumes}} endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4667) Expose persistent volume information in state.json

2016-02-12 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145146#comment-15145146
 ] 

Neil Conway commented on MESOS-4667:


Probably also need to expose the {{reserver_principal}} for dynamically 
reserved resources.

> Expose persistent volume information in state.json
> --
>
> Key: MESOS-4667
> URL: https://issues.apache.org/jira/browse/MESOS-4667
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Priority: Minor
>  Labels: endpoint, mesosphere
>
> The per-slave {{reserved_resources}} information returned by {{/state}} does 
> not seem to include information about persistent volumes. This makes it hard 
> for operators to use the {{/destroy-volumes}} endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4669) Add common compression utility

2016-02-12 Thread Jojy Varghese (JIRA)
Jojy Varghese created MESOS-4669:


 Summary: Add common compression utility
 Key: MESOS-4669
 URL: https://issues.apache.org/jira/browse/MESOS-4669
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Jojy Varghese
Assignee: Jojy Varghese


We need GZIP uncompress utility for Appc image fetching functionality. The 
images are tar + gzip'ed and they needs to be first uncompressed so that we can 
compute sha 512 checksum on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions

2016-02-12 Thread Anindya Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145359#comment-15145359
 ] 

Anindya Sinha commented on MESOS-4666:
--

Yes if all these "related tasks" are launched by the same framework, but not if 
they are launched by different frameworks (and/or different schedulers).

> Expose total resources of a slave in offer for scheduling decisions
> ---
>
> Key: MESOS-4666
> URL: https://issues.apache.org/jira/browse/MESOS-4666
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.25.0
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> To effectively schedule certain class of tasks, the scheduler might need to 
> know not only the available resources (as exposed currently) but also the 
> maximum resources available on that slave. This is specifically true for 
> clusters having different configurations of the slave nodes in terms of 
> resources such as cpu, memory, disk, etc.
> Certain class of tasks might have a need to be scheduled on the same slave 
> (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a 
> slave to a framework, the framework can make a very good determination if it 
> had exposure to both available as well as total resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions

2016-02-12 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145376#comment-15145376
 ] 

Vinod Kone commented on MESOS-4666:
---

Hmm. Having to know what other frameworks have launched/reserved on a slave 
seems like a security issue.

Can you be more specific of your use case? Do you have 2 different frameworks 
in the same role that launch tasks (or make dynamic reservations) that depend 
on each other? Or are you using static reservations?

> Expose total resources of a slave in offer for scheduling decisions
> ---
>
> Key: MESOS-4666
> URL: https://issues.apache.org/jira/browse/MESOS-4666
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.25.0
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> To effectively schedule certain class of tasks, the scheduler might need to 
> know not only the available resources (as exposed currently) but also the 
> maximum resources available on that slave. This is specifically true for 
> clusters having different configurations of the slave nodes in terms of 
> resources such as cpu, memory, disk, etc.
> Certain class of tasks might have a need to be scheduled on the same slave 
> (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a 
> slave to a framework, the framework can make a very good determination if it 
> had exposure to both available as well as total resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4665) Reverse DNS for cert validation ?

2016-02-12 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145306#comment-15145306
 ] 

Vinod Kone commented on MESOS-4665:
---

Any comments here [~kaysoky] [~jvanremoortere] ?

> Reverse DNS for cert validation ?
> -
>
> Key: MESOS-4665
> URL: https://issues.apache.org/jira/browse/MESOS-4665
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.26.0
>Reporter: pawan
>
> I have three mesos master nodes configured to use SSL and with cert 
> validation enabled. All the machines are failing cert-validation and hence 
> the peering with the following error:
> 
> I0212 14:02:22.019564 20544 network.hpp:463] ZooKeeper group PIDs: { 
> log-replica(1)@192.168.1.16:5050, log-replica(1)@192.168.1.27:5050, 
> log-replica(1)@192.168.1.30:5050 }
> I0212 14:02:22.037328 20545 libevent_ssl_socket.cpp:973] Failed accept, 
> verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
> match peer hostname name: 192.168.1.16
> I0212 14:02:22.041191 20545 libevent_ssl_socket.cpp:973] Failed accept, 
> verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
> match peer hostname name: 192.168.1.27
> I0212 14:02:22.061522 20545 libevent_ssl_socket.cpp:973] Failed accept, 
> verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
> match peer hostname name: 192.168.1.16
> I0212 14:02:22.065572 20545 libevent_ssl_socket.cpp:373] Failed connect, 
> verification error: Presented Certificate Name: mesos01.p.qa.a.com does not 
> match peer hostname name: 192.168.1.16
> I0212 14:02:22.065839 20545 process.cpp:1281] Failed to link, connect: 
> Presented Certificate Name: mesos01.p.qa.a.com does not match peer hostname 
> name: 192.168.1.16
> E0212 14:02:22.065994 20545 process.cpp:1911] Failed to shutdown socket with 
> fd 27: Transport endpoint is not connected
> I0212 14:02:22.068665 20545 libevent_ssl_socket.cpp:373] Failed connect, 
> verification error: Presented Certificate Name: mesos02.p.qa.a.com does not 
> match peer hostname name: 192.168.1.27
> I0212 14:02:22.068761 20545 process.cpp:1281] Failed to link, connect: 
> Presented Certificate Name: mesos02.p.qa.a.com does not match peer hostname 
> name: 192.168.1.27
> E0212 14:02:22.068830 20545 process.cpp:1911] Failed to shutdown socket with 
> fd 28: Transport endpoint is not connected
> --
> From my understanding and looking at the source, during cert validation, 
> mesos uses getnameinfo call to get the hostname of the connecting peer using 
> the IP address on the socket connection. And this call would return the IP as 
> a string which is resulting in failures as our cert has a CN of only the peer 
> hostname. But, everything worked when I added host-ip mappings of all peers 
> to /etc/hosts on each host.
> Does mesos inherently expect reverse DNS (PTR records) to be provisioned ? If 
> so, this is very challenging and unrealistic expectation. Even worse if you 
> are deploying mesos in a firewalled/NAT-ed environment.
> Is my understanding right ? Am I missing anything here ? How would you 
> recommend me to proceed ?
> Also, I use --hostname to set hostname of all mesos nodes and see the right 
> [ip, hostname] info in zookeeper node. Looks like mesos is not using it 
> during cert validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4670) `cgroup_info` not being exposed in state.json when ComposingContainerizer is used.

2016-02-12 Thread Avinash Sridharan (JIRA)
Avinash Sridharan created MESOS-4670:


 Summary: `cgroup_info` not being exposed in state.json when 
ComposingContainerizer is used.
 Key: MESOS-4670
 URL: https://issues.apache.org/jira/browse/MESOS-4670
 Project: Mesos
  Issue Type: Bug
Reporter: Avinash Sridharan
Assignee: Avinash Sridharan


The ComposingContainerizer currently does not have a `status` method. This 
results in no `ContainerStatus` being updated in the agent, when uses 
`ComposingContainerizer` to launch containers. This would specifically happen 
when the agent is launched with `--containerizer=docker,mesos`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4668) Agent's /state endpoint does not include reservation information

2016-02-12 Thread Neil Conway (JIRA)
Neil Conway created MESOS-4668:
--

 Summary: Agent's /state endpoint does not include reservation 
information
 Key: MESOS-4668
 URL: https://issues.apache.org/jira/browse/MESOS-4668
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Neil Conway
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4353) Limit the number of processes created by libprocess

2016-02-12 Thread Maged Michael (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145367#comment-15145367
 ] 

Maged Michael commented on MESOS-4353:
--

If this is OK, then I propose the following design:

* Introduce a new environment variable to allow the operator to set the number 
of libprocess worker threads.
* The environment variable is named LIBPROCESS_WORKER_THREADS
* Valid values of the environment variable are integers in the range 1 to 1024. 
* All other values are invalid and generate a warning.
* The proposed environment variable can be set directly for Mesos master, 
agents (slaves), and tests.
* For executors, the proposed environment variable can be set indirectly by 
including it in the setting of the agent (slave) 
--executor_environment_variables option (See documentation of Mesos 
configuration http://mesos.apache.org/documentation/latest/configuration/).
* Update documentation of Mesos configuration to reflect the addition of this 
libprocess environment variable.


> Limit the number of processes created by libprocess
> ---
>
> Key: MESOS-4353
> URL: https://issues.apache.org/jira/browse/MESOS-4353
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> Currently libprocess will create {{max(8, number of CPU cores)}} processes 
> during the initialization, see 
> https://github.com/apache/mesos/blob/0.26.0/3rdparty/libprocess/src/process.cpp#L2146
>  for details. This should be OK for a normal machine which has no much cores 
> (e.g., 16, 32), but for a powerful machine which may have a large number of 
> cores (e.g., an IBM Power machine may have 192 cores), this will cause too 
> much worker threads which are not necessary.
> And since libprocess is widely used in Mesos (master, agent, scheduler, 
> executor), it may also cause some performance issue. For example, when user 
> creates a Docker container via Mesos in a Mesos agent which is running on a 
> powerful machine with 192 cores, the DockerContainerizer in Mesos agent will 
> create a dedicated executor for the container, and there will be 192 worker 
> threads in that executor. And if user creates 1000 Docker containers in that 
> machine, then there will be 1000 executors, i.e., 1000 * 192 worker threads 
> which is a large number and may thrash the OS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions

2016-02-12 Thread Anindya Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145515#comment-15145515
 ] 

Anindya Sinha commented on MESOS-4666:
--

btw, in total resources, the proposal is to send what the mesos-slave started 
off with (either --resources flag or derived from the system).
Say a slave starts with cpus(*): 8;mem(*):16384;disk(*): 163840, we send this 
info in Offer::total_resources to denote that this is the max capability for 
this slave in addition to available resources for this specific framework.
Another example is if a slave starts with cpus(*): 6;cpus(role1): 
2;mem(*):8192;mem(role1): 8192;disk(*): 102400;disk(role1): 61440, we send the 
Offer::total_resources as a summation of all resource types across roles, ie. 
cpus(*): 8;mem(*):16384;disk(*): 163840.

So we want to send the total resources the slave started off with when it 
registered. I do not think it exposes what other frameworks have 
launched/reserved. So I am not sure if this would still qualify as a security 
issue.

My use case is that we have 2 different frameworks in the same role to launch 
these related tasks on the same slave. Also, we do not want any other tasks to 
be running on this slave. We do not use static reservations but reserve 
resources dynamically.

So the idea is that when the 1st framework comes up, we reserve all the 
resources on the slave if the Offer::total_resources == Offer::resources (which 
suggests that the slave is not running any tasks across any frameworks). Once 
the reservation is successful, we launch task1 from framework1. Then framework2 
comes up and uses the reserved resources to launch task2.


> Expose total resources of a slave in offer for scheduling decisions
> ---
>
> Key: MESOS-4666
> URL: https://issues.apache.org/jira/browse/MESOS-4666
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.25.0
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> To effectively schedule certain class of tasks, the scheduler might need to 
> know not only the available resources (as exposed currently) but also the 
> maximum resources available on that slave. This is specifically true for 
> clusters having different configurations of the slave nodes in terms of 
> resources such as cpu, memory, disk, etc.
> Certain class of tasks might have a need to be scheduled on the same slave 
> (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a 
> slave to a framework, the framework can make a very good determination if it 
> had exposure to both available as well as total resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4671) Status updates from executor can be forwarded out of order by the Agent.

2016-02-12 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145712#comment-15145712
 ] 

Anand Mazumdar commented on MESOS-4671:
---

cc: [~avin...@mesosphere.io], [~jieyu]

> Status updates from executor can be forwarded out of order by the Agent.
> 
>
> Key: MESOS-4671
> URL: https://issues.apache.org/jira/browse/MESOS-4671
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, HTTP API
>Affects Versions: 0.28.0
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> Previously, all status update messages from the executor were forwarded by 
> the agent to the master in the order that they had been received. 
> However, that seems to be no longer valid due to a recently introduced change 
> in the agent:
> {code}
> // Before sending update, we need to retrieve the container status.
>   containerizer->status(executor->containerId)
> .onAny(defer(self(),
>  ::_statusUpdate,
>  update,
>  pid,
>  executor->id,
>  lambda::_1));
> {code}
> This can sometimes lead to status updates being sent out of order depending 
> on the order the {{Future}} is fulfilled from the call to {{status(...)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4671) Status updates from executor can be forwarded out of order by the Agent.

2016-02-12 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-4671:
--
Description: 
Previously, all status update messages from the executor were forwarded by the 
agent to the master in the order that they had been received. 

However, that seems to be no longer valid due to a recently introduced change 
in the agent:

{code}
// Before sending update, we need to retrieve the container status.
  containerizer->status(executor->containerId)
.onAny(defer(self(),
 ::_statusUpdate,
 update,
 pid,
 executor->id,
 lambda::_1));
{code}

This can sometimes lead to status updates being sent out of order depending on 
the order the {{Future}} is fulfilled from the call to {{status(...)}}.

  was:
Previously, all status update message from the executor were forwarded by the 
agent to the master in the order that they had been received. 

However, that seems to be no longer valid due to a recently introduced change 
in the agent:

{code}
// Before sending update, we need to retrieve the container status.
  containerizer->status(executor->containerId)
.onAny(defer(self(),
 ::_statusUpdate,
 update,
 pid,
 executor->id,
 lambda::_1));
{code}

This can sometimes lead to status updates being sent out of order depending on 
the order the {{Future}} is fulfilled from the call to {{status(...)}}.


> Status updates from executor can be forwarded out of order by the Agent.
> 
>
> Key: MESOS-4671
> URL: https://issues.apache.org/jira/browse/MESOS-4671
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, HTTP API
>Affects Versions: 0.28.0
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> Previously, all status update messages from the executor were forwarded by 
> the agent to the master in the order that they had been received. 
> However, that seems to be no longer valid due to a recently introduced change 
> in the agent:
> {code}
> // Before sending update, we need to retrieve the container status.
>   containerizer->status(executor->containerId)
> .onAny(defer(self(),
>  ::_statusUpdate,
>  update,
>  pid,
>  executor->id,
>  lambda::_1));
> {code}
> This can sometimes lead to status updates being sent out of order depending 
> on the order the {{Future}} is fulfilled from the call to {{status(...)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4671) Status updates from executor can be forwarded out of order by the Agent.

2016-02-12 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-4671:
-

 Summary: Status updates from executor can be forwarded out of 
order by the Agent.
 Key: MESOS-4671
 URL: https://issues.apache.org/jira/browse/MESOS-4671
 Project: Mesos
  Issue Type: Bug
  Components: containerization, HTTP API
Affects Versions: 0.28.0
Reporter: Anand Mazumdar


Previously, all status update message from the executor were forwarded by the 
agent to the master in the order that they had been received. 

However, that seems to be no longer valid due to a recently introduced change 
in the agent:

{code}
// Before sending update, we need to retrieve the container status.
  containerizer->status(executor->containerId)
.onAny(defer(self(),
 ::_statusUpdate,
 update,
 pid,
 executor->id,
 lambda::_1));
{code}

This can sometimes lead to status updates being sent out of order depending on 
the order the {{Future}} is fulfilled from the call to {{status(...)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions

2016-02-12 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145590#comment-15145590
 ] 

Vinod Kone commented on MESOS-4666:
---

{quote}
So we want to send the total resources the slave started off with when it 
registered. I do not think it exposes what other frameworks have 
launched/reserved
{quote}

My concern is that if a slave has reservations (static/dynamic) for 'roleA' it 
doesn't make sense (without some sort of ACLs) to send that information to a 
framework that was registered with 'roleB'.

IIUC,  you have two slightly orthogonal requirements
1) Two frameworks to launch tasks that are dependent on each other.
2) Needing exclusive access to a slave *dynamically* instead of static 
reservations.

For 1) you can simply use dynamic reservations. Framework1 can make a 
reservation for a resources required for task1 and task2. It sounds like 
Framework1 already knows task2's resources out of band, which is a bit weird 
but not too bad.

For 2) Why do you want exclusive access? Is it because the current isolation 
(resource and security) in Mesos is not good/compliant enough or something else?

Anyway, the ability to somehow dynamically reserve a *whole* slave is an 
interesting use case. For that we might have to expose the total *amount* of 
resources, without exposing the reservation information.


> Expose total resources of a slave in offer for scheduling decisions
> ---
>
> Key: MESOS-4666
> URL: https://issues.apache.org/jira/browse/MESOS-4666
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.25.0
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> To effectively schedule certain class of tasks, the scheduler might need to 
> know not only the available resources (as exposed currently) but also the 
> maximum resources available on that slave. This is specifically true for 
> clusters having different configurations of the slave nodes in terms of 
> resources such as cpu, memory, disk, etc.
> Certain class of tasks might have a need to be scheduled on the same slave 
> (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a 
> slave to a framework, the framework can make a very good determination if it 
> had exposure to both available as well as total resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions

2016-02-12 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145745#comment-15145745
 ] 

Guangya Liu commented on MESOS-4666:


Does this use case kind of {{Exclusive Resources}} and the {{Exclusive 
Resources}} is host level? There is indeed a document talking about {{Exclusive 
Resources}} but only focusing on a special kind of resources and not a whole 
host. 
https://docs.google.com/document/d/1Aby-U3-MPKE51s4aYd41L4Co2S97eM6LPtyzjyR_ecI/edit#

> Expose total resources of a slave in offer for scheduling decisions
> ---
>
> Key: MESOS-4666
> URL: https://issues.apache.org/jira/browse/MESOS-4666
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.25.0
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> To effectively schedule certain class of tasks, the scheduler might need to 
> know not only the available resources (as exposed currently) but also the 
> maximum resources available on that slave. This is specifically true for 
> clusters having different configurations of the slave nodes in terms of 
> resources such as cpu, memory, disk, etc.
> Certain class of tasks might have a need to be scheduled on the same slave 
> (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a 
> slave to a framework, the framework can make a very good determination if it 
> had exposure to both available as well as total resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4545) Propose design doc for reliable floating point behavior

2016-02-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway reassigned MESOS-4545:
--

Assignee: Neil Conway

> Propose design doc for reliable floating point behavior
> ---
>
> Key: MESOS-4545
> URL: https://issues.apache.org/jira/browse/MESOS-4545
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere, resources
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4591) `/reserve` and `/create-volumes` endpoints allow operations for any role

2016-02-12 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145700#comment-15145700
 ] 

Guangya Liu commented on MESOS-4591:


[~neilc], I think that [~greggomann] gave some explanation why he think that 
the {{create-volumes}} is low priority: With regard to the /create-volumes 
endpoint, the difference there is that an operator can only create volumes 
using resources that have already been reserved for a particular role. You 
raise a good point, and perhaps we should restrict the creation of volumes to 
certain roles as well. However, that case seems less harmful to me since the 
operator can't create any persistent volume for any arbitrary role, they can 
only create volumes on disk resources that have already been reserved for a 
particular role.

> `/reserve` and `/create-volumes` endpoints allow operations for any role
> 
>
> Key: MESOS-4591
> URL: https://issues.apache.org/jira/browse/MESOS-4591
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Greg Mann
>  Labels: mesosphere, reservations
>
> When frameworks reserve resources, the validation of the operation ensures 
> that the {{role}} of the reservation matches the {{role}} of the framework. 
> For the case of the {{/reserve}} operator endpoint, however, the operator has 
> no role to validate, so this check isn't performed.
> This means that if an ACL exists which authorizes a framework's principal to 
> reserve resources, that same principal can be used to reserve resources for 
> _any_ role through the operator endpoint.
> We should restrict reservations made through the operator endpoint to 
> specified roles. A few possibilities:
> * The {{object}} of the {{reserve_resources}} ACL could be changed from 
> {{resources}} to {{roles}}
> * A second ACL could be added for authorization of {{reserve}} operations, 
> with an {{object}} of {{role}}
> * Our conception of the {{resources}} object in the {{reserve_resources}} ACL 
> could be expanded to include role information, i.e., 
> {{disk(role1);mem(role1)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3307) Configurable size of completed task / framework history

2016-02-12 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145572#comment-15145572
 ] 

Alexander Rukletsov commented on MESOS-3307:


We currently do not work on the event streaming, hence the JSON endpoint is the 
best you can get now. I think adding filters to the endpoint is a good idea.

> Configurable size of completed task / framework history
> ---
>
> Key: MESOS-3307
> URL: https://issues.apache.org/jira/browse/MESOS-3307
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ian Babrou
>Assignee: Kevin Klues
>  Labels: mesosphere
> Fix For: 0.27.0
>
>
> We try to make Mesos work with multiple frameworks and mesos-dns at the same 
> time. The goal is to have set of frameworks per team / project on a single 
> Mesos cluster.
> At this point our mesos state.json is at 4mb and it takes a while to 
> assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively 
> pushing mesos-master CPU usage through the roof. It's at 100%+ all the time.
> Here's the problem:
> {noformat}
> mesos λ curl -s http://mesos-master:5050/master/state.json | jq 
> .frameworks[].completed_tasks[].framework_id | sort | uniq -c | sort -n
>1 "20150606-001827-252388362-5050-5982-0003"
>   16 "20150606-001827-252388362-5050-5982-0005"
>   18 "20150606-001827-252388362-5050-5982-0029"
>   73 "20150606-001827-252388362-5050-5982-0007"
>  141 "20150606-001827-252388362-5050-5982-0009"
>  154 "20150820-154817-302720010-5050-15320-"
>  289 "20150606-001827-252388362-5050-5982-0004"
>  510 "20150606-001827-252388362-5050-5982-0012"
>  666 "20150606-001827-252388362-5050-5982-0028"
>  923 "20150116-002612-269165578-5050-32204-0003"
> 1000 "20150606-001827-252388362-5050-5982-0001"
> 1000 "20150606-001827-252388362-5050-5982-0006"
> 1000 "20150606-001827-252388362-5050-5982-0010"
> 1000 "20150606-001827-252388362-5050-5982-0011"
> 1000 "20150606-001827-252388362-5050-5982-0027"
> mesos λ fgrep 1000 -r src/master
> src/master/constants.cpp:const size_t MAX_REMOVED_SLAVES = 10;
> src/master/constants.cpp:const uint32_t MAX_COMPLETED_TASKS_PER_FRAMEWORK = 
> 1000;
> {noformat}
> Active tasks are just 6% of state.json response:
> {noformat}
> mesos λ cat ~/temp/mesos-state.json | jq -c . | wc
>1   14796 4138942
> mesos λ cat ~/temp/mesos-state.json | jq .frameworks[].tasks | jq -c . | wc
>   16  37  252774
> {noformat}
> I see four options that can improve the situation:
> 1. Add query string param to exclude completed tasks from state.json and use 
> it in mesos-dns and similar tools. There is no need for mesos-dns to know 
> about completed tasks, it's just extra load on master and mesos-dns.
> 2. Make history size configurable.
> 3. Make JSON serialization faster. With 1s of tasks even without history 
> it would take a lot of time to serialize tasks for mesos-dns. Doing it every 
> 60 seconds instead of every 5 seconds isn't really an option.
> 4. Create event bus for mesos master. Marathon has it and it'd be nice to 
> have it in Mesos. This way mesos-dns could avoid polling master state and 
> switch to listening for events.
> All can be done independently.
> Note to mesosphere folks: please start distributing debug symbols with your 
> distribution. I was asking for it for a while and it is really helpful: 
> https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501
> Perf report for leading master: 
> !http://i.imgur.com/iz7C3o0.png!
> I'm on 0.23.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4666) Expose total resources of a slave in offer for scheduling decisions

2016-02-12 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145769#comment-15145769
 ] 

Klaus Ma commented on MESOS-4666:
-

Agree with [~vinodkone]; if there's any use case required to reserve a whole 
slave, Mesos support it instead of providing `total_resources` in offer.

> Expose total resources of a slave in offer for scheduling decisions
> ---
>
> Key: MESOS-4666
> URL: https://issues.apache.org/jira/browse/MESOS-4666
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.25.0
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> To effectively schedule certain class of tasks, the scheduler might need to 
> know not only the available resources (as exposed currently) but also the 
> maximum resources available on that slave. This is specifically true for 
> clusters having different configurations of the slave nodes in terms of 
> resources such as cpu, memory, disk, etc.
> Certain class of tasks might have a need to be scheduled on the same slave 
> (esp needing shared persistent volumes, MESOS-3421). Instead of dedicating a 
> slave to a framework, the framework can make a very good determination if it 
> had exposure to both available as well as total resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4667) Expose persistent volume information in state.json

2016-02-12 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145771#comment-15145771
 ] 

Klaus Ma commented on MESOS-4667:
-

is there any secure concern if expose {{reserver_principal}}.

> Expose persistent volume information in state.json
> --
>
> Key: MESOS-4667
> URL: https://issues.apache.org/jira/browse/MESOS-4667
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Priority: Minor
>  Labels: endpoint, mesosphere
>
> The per-slave {{reserved_resources}} information returned by {{/state}} does 
> not seem to include information about persistent volumes. This makes it hard 
> for operators to use the {{/destroy-volumes}} endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)