[jira] [Commented] (MESOS-1660) Lower ReaperProcess::wait() delay to 500ms or 250ms

2014-09-23 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144532#comment-14144532
 ] 

Till Toenshoff commented on MESOS-1660:
---

Hey Guys, this is very promising. Would you need any help to get this committed?

 Lower ReaperProcess::wait() delay to 500ms or 250ms
 ---

 Key: MESOS-1660
 URL: https://issues.apache.org/jira/browse/MESOS-1660
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Craig Hansen-Sturm
Assignee: Craig Hansen-Sturm

 See https://issues.apache.org/jira/browse/MESOS-1199



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1143) Add a TASK_ERROR task status.

2014-09-23 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144634#comment-14144634
 ] 

Alexander Rukletsov commented on MESOS-1143:


Maybe we should agree on a general approach to adding new task states or 
substates. There are several related issues, like [tasks that are implicitly 
killed|https://issues.apache.org/jira/browse/MESOS-1736] as a reaction to 
{{UnregisterFrameworkMessage}}, or [exposing additional 
info|https://issues.apache.org/jira/browse/MESOS-343] through substates.

I would suggest we introduce new task states rather sparingly, but rather 
create substates instead. I would opt for a single protobuf enum, something 
like {{TaskStateExplained}}. There we can have fine-grained states, e.g. 
whether the task failed because of an OOM event or whether it was killed 
explicitly or implicitly.

 Add a TASK_ERROR task status.
 -

 Key: MESOS-1143
 URL: https://issues.apache.org/jira/browse/MESOS-1143
 Project: Mesos
  Issue Type: Improvement
  Components: framework, master
Reporter: Benjamin Hindman

 During task validation we drop tasks that have errors and send TASK_LOST 
 status updates. In most circumstances a framework will want to relaunch a 
 task that has gone lost, and in the event the task is actually malformed 
 (thus invalid) this will result in an infinite loop of sending a task and 
 having it go lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1791) Introduce Master / Offer Resource Reservations

2014-09-23 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144879#comment-14144879
 ] 

Timothy St. Clair commented on MESOS-1791:
--

Doesn't reservations indirectly imply preemption?  


 Introduce Master / Offer Resource Reservations
 --

 Key: MESOS-1791
 URL: https://issues.apache.org/jira/browse/MESOS-1791
 Project: Mesos
  Issue Type: Epic
Reporter: Tom Arnfeld

 Currently Mesos supports the ability to reserve resources (for a given role) 
 on a per-slave basis, as introduced in MESOS-505. This allows you to almost 
 statically partition off a set of resources on a set of machines, to 
 guarantee certain types of frameworks get some resources.
 This is very useful, though it is also very useful to be able to control 
 these reservations through the master (instead of per-slave) for when I don't 
 care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of 
 (X,Y).
 I'm not sure what structure this could take, but apparently it has already 
 been discussed. Would this be a CLI flag? Could there be a (authenticated) 
 web interface to control these reservations?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1827) Blog post link is incorrect

2014-09-23 Thread Jon Bringhurst (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jon Bringhurst updated MESOS-1827:
--
Description: 
On the page at:

https://mesos.apache.org/blog/mesos-0-20-0-released/

The following text:

{noformat}
On top of network monitoring, evaluating and providing support for 
per-container network isolation (MESOS-1585)
{noformat}

Links to MESOS-1407. However, it should probably link to MESOS-1585 instead.

  was:
On the page at:

https://mesos.apache.org/blog/mesos-0-20-0-released/

The following text:

On top of network monitoring, evaluating and providing support for 
per-container network isolation (MESOS-1585)

Links to MESOS-1407. However, it should probably link to MESOS-1585 instead.


 Blog post link is incorrect
 ---

 Key: MESOS-1827
 URL: https://issues.apache.org/jira/browse/MESOS-1827
 Project: Mesos
  Issue Type: Bug
  Components: project website
Reporter: Jon Bringhurst

 On the page at:
 https://mesos.apache.org/blog/mesos-0-20-0-released/
 The following text:
 {noformat}
 On top of network monitoring, evaluating and providing support for 
 per-container network isolation (MESOS-1585)
 {noformat}
 Links to MESOS-1407. However, it should probably link to MESOS-1585 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1240) Update Mesos in Wikipedia

2014-09-23 Thread Alina Nicolae (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145079#comment-14145079
 ] 

Alina Nicolae commented on MESOS-1240:
--

I would like to try and solve this issue, but I have one question: what 
official Mesos docs should appear on the Wikipedia page?


 Update Mesos in Wikipedia
 -

 Key: MESOS-1240
 URL: https://issues.apache.org/jira/browse/MESOS-1240
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Adam B
Priority: Minor
  Labels: documentation, wikipedia

 Right now, Wikipedia's Mesos disambiguation page 
 (https://en.wikipedia.org/wiki/Mesos) links to a section of the 
 GrandLogic/JobServer page: 
 https://en.wikipedia.org/wiki/JobServer#Mesos_Clustering
 We should grab some of the official Mesos docs and create a legit Mesos 
 Wikipedia page (with relevant links).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-09-23 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145123#comment-14145123
 ] 

Ian Downes commented on MESOS-1199:
---

Understood. This race has existed in the codebase for a long time. We could 
consider looking at /proc/{pid}/exe to confirm that the pid at least 
corresponds to the expected executable - still not perfect though.

 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm
 Attachments: wiatpid.pdf


 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-09-23 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145147#comment-14145147
 ] 

Ian Downes commented on MESOS-1199:
---

new review with dynamic poll interval: https://reviews.apache.org/r/25947/

 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm
 Attachments: wiatpid.pdf


 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-09-23 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145123#comment-14145123
 ] 

Ian Downes edited comment on MESOS-1199 at 9/23/14 6:07 PM:


Understood. This race has existed in the codebase for a long time. We could 
consider looking at /proc/\{pid\}/exe to confirm that the pid at least 
corresponds to the expected executable - still not perfect though.


was (Author: idownes):
Understood. This race has existed in the codebase for a long time. We could 
consider looking at /proc/{pid}/exe to confirm that the pid at least 
corresponds to the expected executable - still not perfect though.

 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm
 Attachments: wiatpid.pdf


 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-09-23 Thread Ian Downes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Downes updated MESOS-1199:
--
Story Points: 1

 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Ian Downes
 Attachments: wiatpid.pdf


 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1107) Create screencast for Getting Started with Mesos

2014-09-23 Thread Alina Nicolae (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145154#comment-14145154
 ] 

Alina Nicolae commented on MESOS-1107:
--

Should it also be with audio instructions or just video?

 Create screencast for Getting Started with Mesos
 

 Key: MESOS-1107
 URL: https://issues.apache.org/jira/browse/MESOS-1107
 Project: Mesos
  Issue Type: Task
  Components: documentation
Reporter: Dave Lester
Priority: Minor

 It'd be great to have a video screencast that walks people through the steps 
 of building Mesos and running example framework instructions that are 
 included on the Getting Started page http://mesos.apache.org/gettingstarted/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version

2014-09-23 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145273#comment-14145273
 ] 

Timothy St. Clair commented on MESOS-1675:
--

If we remove the non-standard major version #-ing that is tacked on to the .so 
file, then I would say yes.  

I think I will need to revise the patch to reflect. 

 Decouple version of the mesos library from the package release version
 --

 Key: MESOS-1675
 URL: https://issues.apache.org/jira/browse/MESOS-1675
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone

 This discussion should be rolled into the larger discussion around how to 
 version Mesos (APIs, packages, libraries etc).
 Some notes from libtool docs.
 http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html
 http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1791) Introduce Master / Offer Resource Reservations

2014-09-23 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145449#comment-14145449
 ] 

Adam B commented on MESOS-1791:
---

Preemption (or at least inverse offers) would certainly help in the case where 
there are not enough available resources to satisfy a new reservation, but the 
reservations could also be seen as a desired state to try to reach once tasks 
complete and enough resources become available to reallocate to the new 
role/framework.

A few definitions/clarifications:
Master reservations are a way for the master to provide centralized control 
over the reservations on every slave. These may be the same static role-based 
reservations, except they are configurable through the master, perhaps as an 
(authenticated) web/REST interface. It could be as simple as the same 
reservation info forwarded to all slaves (reserve 1GB RAM on every slave), or 
configurable per-slave.

Offer reservations are not tied to particular slaves, but allow a framework 
to request a certain amount of global resources (perhaps split into multiple 
sets, one per host), regardless of placement. An example would be make sure my 
framework/role always has 20cpus and 20GB RAM available somewhere in the 
cluster, but I don't care where. In this scenario, slaves do not need to 
specifically reserve any of their resources for that framework, so long as the 
offer reservation can be satisfied by resources from the rest of the cluster. 
The master will be responsible for enforcing these reservations. The 
framework/role gets the benefit of defining its desired resources at a higher 
level of abstraction, without knowledge of how many machines there are or how 
much resources are available on each.

 Introduce Master / Offer Resource Reservations
 --

 Key: MESOS-1791
 URL: https://issues.apache.org/jira/browse/MESOS-1791
 Project: Mesos
  Issue Type: Epic
Reporter: Tom Arnfeld

 Currently Mesos supports the ability to reserve resources (for a given role) 
 on a per-slave basis, as introduced in MESOS-505. This allows you to almost 
 statically partition off a set of resources on a set of machines, to 
 guarantee certain types of frameworks get some resources.
 This is very useful, though it is also very useful to be able to control 
 these reservations through the master (instead of per-slave) for when I don't 
 care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of 
 (X,Y).
 I'm not sure what structure this could take, but apparently it has already 
 been discussed. Would this be a CLI flag? Could there be a (authenticated) 
 web interface to control these reservations?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1828) The Mesos UI should link to framework UIs

2014-09-23 Thread Tobi Knaup (JIRA)
Tobi Knaup created MESOS-1828:
-

 Summary: The Mesos UI should link to framework UIs
 Key: MESOS-1828
 URL: https://issues.apache.org/jira/browse/MESOS-1828
 Project: Mesos
  Issue Type: Wish
Reporter: Tobi Knaup


Most frameworks have a web UI or HTTP API. It would be nice to show a direct 
link from the Mesos web UI so it's easy to navigate there. Currently a user 
needs to know where schedulers are running, Mesos doesn't have that knowledge.

A framework should provide a URL to its API/UI when it connects, as part of 
FrameworkInfo. This could be explicit proto fields (e.g. gui_url, api_url), or 
more generic key/value pairs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)