Re: Slave gets oom killed when using cgroups isolation?

2013-08-22 Thread Benjamin Mahler
Hi Li, Why do you think the slave was OOM killed? Is there something that pointed you to that conclusion? All I see is the slave launched an executor, and the executor was killed by framework a few seconds after the task was launched. Also, what version are you running? Ben On Thu, Aug 22,

Re: conf flag change from .12 to .13

2013-09-25 Thread Benjamin Mahler
Looks like 0.13.0 does not have the fix cherry-picked. 0.14.0 is incoming and will include the fix, until then you can cherry-pick the following commit, although it may depend on additional patches! commit 028d500c6d00023c8a56b37f885207adfd1e9a50 Author: Charles Reiss wog...@apache.org Date:

Re: [VOTE][Result] Release Apache Mesos 0.14.0 (rc6)

2013-10-16 Thread Benjamin Mahler
: Hi, I'm happy to announce the passing of 0.14.0 vote with 3 +1 binding votes, no -1 and no 0 votes: +1: Benjamin Hindman (binding) Benjamin Mahler (binding) Dave Lester (binding) -1: 0: Please find the release at http://www.apache.org/dist/mesos/0.14.0 (or preferably please use

[VOTE][RESULT] Release Apache Mesos 0.14.1 (rc1)

2013-10-21 Thread Benjamin Mahler
Benjamin Mahler benjamin.mah...@gmail.com gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: E3A6 E5EF 7B67 C142 5B53 F072 D0BE BB95 D141 A5B6 [chipotle:~/tmp/apache-mesos-0.14.1

Re: UI remote Task Sandbox displays error

2013-10-23 Thread Benjamin Mahler
There could be an issue related to using the ':' character in your executor id (which ultimately gets mapped to a path). Awhile back I filed this: https://issues.apache.org/jira/browse/MESOS-361 That's all I have to go on with the given information, the slave log might be more informative here,

Stateful Master

2013-10-31 Thread Benjamin Mahler
Hi All, I'd like to mention some changes that have been discussed amongst the committers but have not yet been shared broadly with the list. The central component of Mesos is the Master. The Master is responsible for administering slaves, frameworks, and resource offers. It also handles task

Re: Jenkins mesos plugin failing

2013-11-07 Thread Benjamin Mahler
We should fix that so that it reconnects with Mesos after a restart of Jenkins! Can you file an issue for this? On Thu, Nov 7, 2013 at 12:31 PM, Whitney Sorenson wsoren...@hubspot.comwrote: I should also point out the scheduler didn't seem to survive a reboot of Jenkins - I had to delete the

Re: Jenkins mesos plugin failing

2013-11-07 Thread Benjamin Mahler
, Nov 7, 2013 at 5:10 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: We should fix that so that it reconnects with Mesos after a restart of Jenkins! Can you file an issue for this? On Thu, Nov 7, 2013 at 12:31 PM, Whitney Sorenson wsoren...@hubspot.comwrote: I should also point out

Re: resource reservation through a long running service

2013-11-25 Thread Benjamin Mahler
In the longer term, utilization will be improved through oversubscription of the slave's resources. This means, giving out resources inside existing resource allocations if they remain unused. These over-subscribed resources would be revokable at any time. On Mon, Nov 25, 2013 at 12:29 PM, ricky

Re: Porting an app

2014-01-13 Thread Benjamin Mahler
On Sun, Jan 12, 2014 at 9:14 PM, Ankur Chauhan an...@malloc64.com wrote: Thanks everyone for all the help. Marathon does seem like a good framework but my use case requires the app to evaluate it's own health and scale up based on internal load stats (SLA requirements) and I don't know if

Re: Framework is not offered any resources

2014-01-13 Thread Benjamin Mahler
Can you provide the commands you're using to run the master and slaves? Can you provide the master and slave logs? On Mon, Jan 13, 2014 at 6:07 PM, Sai Sagar jsaisa...@gmail.com wrote: Hi all, I am implementing a framework on the top of mesos. The framework is registered successfully but

Re: Framework is not offered any resources

2014-01-14 Thread Benjamin Mahler
Running the master / slave with GLOG_v=2 in your environment may prove more insightful. Can you provide the full textual version of the logs instead of screenshots please? :) On Mon, Jan 13, 2014 at 8:36 PM, Sai Sagar jsaisa...@gmail.com wrote: Hi, Please find the attached images of master

Re: Mesos logging configuration questions

2014-01-21 Thread Benjamin Mahler
I'm afraid that document is out of date, please ignore the comments related to MESOS_HOME in the interim of us fixing that document. https://issues.apache.org/jira/browse/MESOS-934 --log_dir is a parameter of the mesos-master and mesos-slave binaries. The directory you linked is to an executor

Re: Long running storage service on mesos.

2014-01-23 Thread Benjamin Mahler
On Wed, Jan 22, 2014 at 10:27 PM, coocood cooc...@gmail.com wrote: I want to run redis server cluster on mesos, but have some problems. The first problem is the storage path, since it is storage service, I need to set the storage path out of the sandbox, so the next run of the service will

Re: Compiling Mesos on Mac OSX Mountain Lion 10.9

2014-01-30 Thread Benjamin Mahler
I don't believe that we compile with C++11 on gcc 4.2, and C++11 support did not land in 0.16.0 IIRC. You should remove your -std=c++11 flag. Let us know if that does not work. On Thu, Jan 30, 2014 at 11:40 AM, Tom Arnfeld t...@duedil.com wrote: I'm trying to get going with Mesos to do a bit

Re: Compiling Mesos on Mac OSX Mountain Lion 10.9

2014-01-30 Thread Benjamin Mahler
]: *** [stl_logging_unittest-stl_logging_unittest.o] Error 1 # -- Tom Arnfeld Developer // DueDil On 30 Jan 2014, at 19:46, Benjamin Mahler benjamin.mah...@gmail.com wrote: I don't believe that we compile with C++11

Re: [VOTE] Release Apache Mesos 0.16.0 (rc5)

2014-02-04 Thread Benjamin Mahler
Looks like we the 'm4' directory in stout was not added to the distribution: $ ./bootstrap ... m4/acx_pthread.m4:363: ACX_PTHREAD is expanded from... configure.ac:87: the top level autoreconf: configure.ac: adding subdirectory 3rdparty/stout to autoreconf autoreconf: Entering directory

Re: [VOTE] Release Apache Mesos 0.16.0 (rc5)

2014-02-05 Thread Benjamin Mahler
to track the fix? Meanwhile, I don't think this should be a blocker for releasing this? If you agree, I'm still looking for 2 more votes. On Tue, Feb 4, 2014 at 12:47 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: 0.15.0 has this same issue actually. Probably we should either

Re: GitLab CI on Mesos

2014-02-07 Thread Benjamin Mahler
Any link to the project? :) On Fri, Feb 7, 2014 at 9:09 AM, Benjamin Hindman benjamin.hind...@gmail.com wrote: Very cool Tomas! On Fri, Feb 7, 2014 at 8:15 AM, Tomas Barton barton.to...@gmail.comwrote: Hi, I wrote a Mesos framework for executing CI tasks from GitLab CI. The GitLab

Re: http://mesos.apache.org/downloads/ is saying the latest release is 0.16.0

2014-03-06 Thread Benjamin Mahler
+jie who has been doing the 0.17.0 release On Wed, Mar 5, 2014 at 11:44 PM, Chengwei Yang chengwei.yang...@gmail.comwrote: Hi List, I'm new to mesos and trying to download one from http://mesos.apache.org/downloads/ And it's saying that the most recent stable release is 0.16.0, however,

Re: Question on executors

2014-03-10 Thread Benjamin Mahler
I have the status of TASK_COMPLETED being sent via the driver, followed by a wait of about 5 secs This is needed because of https://issues.apache.org/jira/browse/MESOS-243. A 1 second sleep should be ample. However, I'd still like to hear any thoughts on the approach of using one task per

Re: Load simulator/benchmark tool

2014-03-21 Thread Benjamin Mahler
You can run N slaves on one machine, or you can run meta-slaves (slaves within slaves). We've used meta-slaves in the past to run scaling simulations as it is more accurate and easier than stubbing out the task launching. On Fri, Mar 21, 2014 at 12:55 PM, Sharma Podila spod...@netflix.com wrote:

Re: [VOTE] Release Apache Mesos 0.18.1 (rc2)

2014-05-02 Thread Benjamin Mahler
Looks like --without-cxx11 is broken for gcc-4.2.1, not sure if that should be a blocker for this because I don't believe there is a fix for this yet! [bmahler@smf1-aye-26-sr4 mesos-0.18.1]$ ./configure --disable-optimize --without-cxx11 make check -j8 ... libtool: compile: g++

Re: callback port

2014-05-16 Thread Benjamin Mahler
You can set LIBPROCESS_PORT in the environment. On Wed, May 14, 2014 at 1:58 PM, Scott Clasen sc...@heroku.com wrote: I raised this question on the Spark ML but it may be more a Mesos question. I would like to be able to configure the port used to communicate between the Mesos master and

Re: Question on resource offers and framework failover

2014-05-16 Thread Benjamin Mahler
. What about tasks that Mesos is running for my framework, but my framework lost track of them (there could be some operational causes for this, even if we assume my code is bug free)? How are frameworks handling such a scenario? On Wed, May 14, 2014 at 4:05 PM, Benjamin Mahler benjamin.mah

Re: n00b isolation docs?

2014-06-09 Thread Benjamin Mahler
In addition to cpu and memory isolation, you will get process isolation. With posix isolation, processes can escape from the slave (e.g. something that double-forks and uses setsid). On Mon, Jun 9, 2014 at 9:02 AM, Jie Yu yujie@gmail.com wrote: Hi Dick, what croup isolation provides

Re: [VOTE] Release Apache Mesos 0.19.0 (rc3)

2014-06-09 Thread Benjamin Mahler
OS X (clang-503.0.40) and Ubuntu 13.10 (gcc 4.8.1) Thanks for the hard work everyone - this is going to be a great release! Niklas On Thu, Jun 5, 2014 at 11:43 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: Hi all, Please vote on releasing the following candidate

Re: cgroups isolation: at most vs at least

2014-06-23 Thread Benjamin Mahler
On Mon, Jun 23, 2014 at 9:24 AM, Chris Burroughs chris.burrou...@gmail.com wrote: I have two related questions about the cpu and memory subsystems and at most vs at least semantics. ## cpu As I understand it, by default cpu.shares is used to give tasks at least the specified cpu resources.

Re: Mesos 0.19.0 stats.json endpoint Collectd plugin

2014-06-26 Thread Benjamin Mahler
This is great! If only supporting 0.19.0+ I would recommend just collecting from /metrics/snapshot because it obviates /stats.json. On Thu, Jun 26, 2014 at 12:13 PM, Ray Rodriguez rayrod2...@gmail.com wrote: Hey everyone just thought I'd post a simple collectd python plugin that I just

Re: Mesos 19 startup error

2014-06-27 Thread Benjamin Mahler
I created a ticket to track it for 0.19.1: https://issues.apache.org/jira/browse/MESOS-1551 On Fri, Jun 27, 2014 at 8:26 AM, Benjamin Hindman benjamin.hind...@gmail.com wrote: Yes, this was fixed in this commit https://github.com/apache/mesos/commit/1ce8d31fda545d69aea0637107f507c2b512adc9

Re: number of masters and quorum

2014-07-07 Thread Benjamin Mahler
Just to clarify: You can run a single master (--quorum=1) if you are looking to experiment and don't care about high availability. You can run 3 masters (--quorum=2) if you want to remain operating with 1 machine being down (planned or unplanned). You can also operate in the face of the complete

Re: 0.19.1

2014-07-09 Thread Benjamin Mahler
I've added it to the 0.19.1 list since it's trivial and helps those using S3. On Fri, Jul 4, 2014 at 12:52 PM, Tom Arnfeld t...@duedil.com wrote: Happy to. It surprised me that this wasn't supported, especially considering the fetcher is supposed to be able to download URIs from any URL using

Re: Mesos 19 startup error

2014-07-09 Thread Benjamin Mahler
/ to --work_dir the process crashes as well. For ex: --work_dir=/mnt/data/mesos/ On Fri, Jun 27, 2014 at 2:02 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: I created a ticket to track it for 0.19.1: https://issues.apache.org/jira/browse/MESOS-1551 On Fri, Jun 27, 2014 at 8:26 AM

Re: Mesos language bindings in the wild

2014-07-11 Thread Benjamin Mahler
Naming suggestion, let's call these pure language bindings. Native is overloaded. On Fri, Jul 11, 2014 at 8:40 AM, Niklas Nielsen nik...@mesosphere.io wrote: Embraced language repos is a great path too. +1 for not having to tie automake into the respective language build systems. Ensuring

Re: 0.19.1

2014-07-14 Thread Benjamin Mahler
Will also pull in MESOS-1538 https://issues.apache.org/jira/browse/MESOS-1538 for this. On Wed, Jul 9, 2014 at 11:12 AM, Benjamin Mahler benjamin.mah...@gmail.com wrote: I've added it to the 0.19.1 list since it's trivial and helps those using S3. On Fri, Jul 4, 2014 at 12:52 PM, Tom

[VOTE] Release Apache Mesos 0.19.1 (rc1)

2014-07-14 Thread Benjamin Mahler
Hi all, Please vote on releasing the following candidate as Apache Mesos 0.19.1. 0.19.1 includes the following: Fixes a long standing critical bug in the JNI bindings that can lead to framework unregistration.

[RESULT][VOTE] Release Apache Mesos 0.19.1 (rc1)

2014-07-18 Thread Benjamin Mahler
vinodk...@gmail.com wrote: +1 (binding) Tested on OSX Mavericks w/ gcc-4.8 On Mon, Jul 14, 2014 at 2:35 PM, Timothy Chen tnac...@gmail.com wrote: +1 (non-binding). Tim On Mon, Jul 14, 2014 at 2:32 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: Hi all

Re: Mesos 0.19 registrar upgrade

2014-07-22 Thread Benjamin Mahler
At the current time, you need an odd number of masters as there is an assumption built into the replicated that the number of masters = 2*quorum - 1. This assumption is present when bootstrapping the log from no data. To recover from this, you need to run an odd number of masters, and set your

Re: Breaking JSON changes (0.19.1)

2014-07-24 Thread Benjamin Mahler
Hey Whitney, sorry to hear you were bit by this! I just realized that I misclassified MESOS-1406, the problem ran deeper than I initially thought. In particular, all of the boolean JSON values that were being exported implicitly changed from 1/0 to true/false due to this change: commit

Re: why does mesos require resolving all zookeeper hostnames?

2014-07-29 Thread Benjamin Mahler
Thanks for bringing this up! This is part of the ZK C library. We have seen failing slaves with sporadic DNS lookup failures in our clusters. After speaking to a ZK expert, I believe one of the things going into 3.5.0 is the ability to only need to resolve one of the zk hosts correctly, as you

Re: Python bindings are changing!

2014-08-04 Thread Benjamin Mahler
It might work to use 0.19 with a 0.20 mesos (or visa versa), but there be dragons =) Is there a deprecation cycle? How should folks be upgrading Python schedulers and executors to 0.20.0 if they are not statically bundling libmesos? Is there an upgrade order required? We will need to

Re: Getting sandbox data from Web UI

2014-08-13 Thread Benjamin Mahler
Just to confirm I understand correctly: You have a framework, it needs to be a framework because it launches tasks in the cluster. As part of this framework, you need to look into all of the sandboxes of all currently running mesos tasks. This includes those tasks that don't belong to your

Re: mesos scheduling

2014-08-18 Thread Benjamin Mahler
Mesos also provides the ability to reserve resources, if you need guarantees about the resources available to a particular framework. For now, resources can be reserved at the per-slave level and they will *only* be offered to the role that has them reserved. On Mon, Aug 18, 2014 at 2:13 AM,

Design Review: Maintenance Primitives

2014-08-25 Thread Benjamin Mahler
Hi all, I wanted to take a moment to thank Alexandra Sava, who completed her OPW internship this past week. We worked together in the second half of her internship to create a design document for maintenance primitives in Mesos (the original ticket is MESOS-1474

Re: Design Review: Maintenance Primitives

2014-08-26 Thread Benjamin Mahler
even when the deadline has passed if tasks are still running? This is not explicit in the document and we want to make sure operators have the information about this and could avoid unfortunate rolling restarts. On Aug 25, 2014 9:25 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: Hi all

Re: OOM not always detected by Mesos Slave

2014-09-02 Thread Benjamin Mahler
Looks like you're using the JVM, can you set all of your JVM flags to limit the memory consumption? This would favor an OutOfMemoryError instead of OOMing the cgroup. On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson wsoren...@hubspot.com wrote: Recently, I've seen at least one case where a

Re: Design Review: Maintenance Primitives

2014-09-02 Thread Benjamin Mahler
a link between a previous offerID and this but then I saw the Resource field. Wouldn't it be clearer to have InverseOfferID? Thanks for the work! I really want to have these primitives. On Aug 26, 2014 10:59 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: You're right, we don't account

Re: OOM not always detected by Mesos Slave

2014-09-12 Thread Benjamin Mahler
to the total usage of memory for the task. So you can't have the same amount of memory for the task as you pass to java, -Xmx parameter. On 2 September 2014 20:43, Benjamin Mahler benjamin.mah...@gmail.com wrote: Looks like you're using the JVM, can you set all of your JVM flags to limit

Re: TASK_LOST on storm task

2014-09-17 Thread Benjamin Mahler
Can you show us the the slave log and more of the master log? There should be a TASK_LOST somewhere within them. On Wed, Sep 17, 2014 at 10:43 AM, Luyi Wang wangluyi1...@gmail.com wrote: Have anyone experience TASK_LOST status for storm tasks on mesos. I checked the stderr. Everything seems

Re: TASK_LOST on storm task

2014-09-17 Thread Benjamin Mahler
-185627-326871232-5050-8074-2/frameworks/20140915-230424-326871232-5050-13574-' for gc 6.9960553185days in the future -Luyi. On Wed, Sep 17, 2014 at 12:56 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: Can you show us the the slave log and more of the master log

Re: Task Reconciliation [MESOS-1453]

2014-09-29 Thread Benjamin Mahler
We want reconciliation to be a process that eventually terminates. In = 0.19.0, the following two cases are conflated through no update being sent: (1) No state difference. (2) Master temporarily cannot reply / dropped message. As a result, a scheduler cannot determine when it is finished

Re: Task Reconciliation [MESOS-1453]

2014-10-01 Thread Benjamin Mahler
that the taskStatus objects sent as a result of calling reconciliation themselves are the exact same object as was (possibly) sent before, or would they have updated timestamps? On Tue, Sep 30, 2014 at 4:14 AM, Benjamin Mahler benjamin.mah...@gmail.com wrote: We want reconciliation to be a process

Re: Task Reconciliation [MESOS-1453]

2014-10-07 Thread Benjamin Mahler
also slaveId?) without knowing the current taskState or sending other metadata? Thanks, -Whitney On Wed, Oct 1, 2014 at 8:33 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: Updated timestamps will be generated. On Tue, Sep 30, 2014 at 6:21 AM, Whitney Sorenson wsoren...@hubspot.com

Re: Make Check error for 0.20.1

2014-10-07 Thread Benjamin Mahler
Can you link to the full logs from make check? P.S. This ended up in my spam folder. On Tue, Oct 7, 2014 at 10:47 AM, Sammy Steele sammy_ste...@stanford.edu wrote: Hi, I am trying to install the newest version of Mesos using Ubuntu 14.04 and libsasl2-dev 2.1.25.dfsg1-17build1. When I run

Re: Task Reconciliation [MESOS-1453]

2014-10-13 Thread Benjamin Mahler
But as a note, I would recommend against relying on updated timestamps, because the clocks may not be synchronized in the system. On Wed, Oct 1, 2014 at 12:33 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: Updated timestamps will be generated. On Tue, Sep 30, 2014 at 6:21 AM, Whitney

Reconciliation Document

2014-10-15 Thread Benjamin Mahler
Hi all, I've sent a review out for a document describing reconciliation, you can see the draft here: https://gist.github.com/bmahler/18409fc4f052df43f403 Would love to gather high level feedback on it from framework developers. Feel free to reply here, or on the review:

Re: Reconciliation Document

2014-10-21 Thread Benjamin Mahler
status update was ACKed, but the scheduler failed over before this information could be persisted. What task status (if any) does Mesos respond with? -- Connor Doyle http://mesosphere.io On Oct 15, 2014, at 14:05, Benjamin Mahler benjamin.mah...@gmail.com wrote: Hi all, I've sent

Re: Reconciliation Document

2014-10-21 Thread Benjamin Mahler
Inline. On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila spod...@netflix.com wrote: Response inline, below. On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: Thanks for the thoughtful questions, I will take these into account in the document. Addressing

Re: Reconciliation Document

2014-11-03 Thread Benjamin Mahler
for its task's either. On Mon, Nov 3, 2014 at 6:15 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: I'm pretty confused by what's occurring in your scheduler, let's start by looking at a particular task: https://gist.github.com/bmahler/6f6bdb0385ec245b2346 You received an update from

Re: A problem with resource offers

2014-11-06 Thread Benjamin Mahler
Which version of the master are you using and do you have the logs? The fact that no offers were coming back sounds like a bug! As for using O1 after a disconnection, all offers are invalid once a disconnection occurs. The scheduler driver does not automatically rescind offers upon disconnection,

Re: Problems of running mesos-0.20.0 with zookeeper

2014-11-07 Thread Benjamin Mahler
In addition to what Dick said, you need to make sure that you have a quorum of masters *online* in order for a master to recover correctly. This means you'll want to run the master under a tool (e.g. Monit) that restarts it promptly upon failure. You'll want to do this for the slaves as well. On

Re: Design Review: Maintenance Primitives

2014-11-07 Thread Benjamin Mahler
. There is no longer a split between the planned schedule and the actual draining, also useful for persistent frameworks. The updated high level design is here: https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing On Mon, Aug 25, 2014 at 12:24 PM, Benjamin Mahler

Re: [VOTE] Release Apache Mesos 0.21.0 (rc2)

2014-11-11 Thread Benjamin Mahler
+till FYI for Mac OS X 10.10. Had to install subversion and apr: $ brew tap homebrew/versions $ brew tap homebrew/apache $ brew install subversion apr There is a flaky test in libprocess: [ RUN ] IO.Write make[5]: *** [check-local] Broken pipe: 13 Filed:

Re: [VOTE] Release Apache Mesos 0.21.0 (rc2)

2014-11-11 Thread Benjamin Mahler
Here's the patch for 10.10 that we should cherry-pick: commit 8adb36e3f72b575dea53013e7e790cb6c7957ae0 Author: Benjamin Mahler bmah...@twitter.com Date: Tue Nov 11 15:11:32 2014 -0800 Added missing ZooKeeper patch file to the Makefile. I'll mark https://issues.apache.org/jira/browse/MESOS

Re: OOM not always detected by Mesos Slave

2014-11-12 Thread Benjamin Mahler
://gist.github.com/wsorenson/d2e49b96e84af86c9492 On Fri, Sep 12, 2014 at 9:12 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: +Ian Sorry for the delay, when your cgroup OOMs a few things will occur: (1) The kernel will notify mesos-slave about the OOM event. (2) The kernel's OOM killer will pick

Re: [VOTE] Release Apache Mesos 0.21.0 (rc3)

2014-11-13 Thread Benjamin Mahler
+1 Thanks Ian! Compiles on Mac OS X 10.10 now, all tests pass. On Thu, Nov 13, 2014 at 1:07 PM, Cosmin Lehene cleh...@adobe.com wrote: Ian, Are there rpms in some public repo for this RC? Thanks, Cosmin From: Ian Downes

Re: Master memory usage

2014-11-20 Thread Benjamin Mahler
It shouldn't be that high, especially with the size of the cluster I see in your stats. Which scheduler(s) are you running, and do they create large TaskInfo objects? Just a hunch, as I do not recall any leaks in 0.19.1. On Tue, Nov 18, 2014 at 1:00 AM, Tom Arnfeld t...@duedil.com wrote: I've

Re: Master memory usage

2014-11-20 Thread Benjamin Mahler
On Thu, Nov 20, 2014 at 10:33 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: It shouldn't be that high, especially with the size of the cluster I see in your stats. Which scheduler(s) are you running, and do they create large TaskInfo objects? Just a hunch, as I do not recall any

Re: Task Checkpointing with Mesos, Marathon and Docker containers

2014-12-01 Thread Benjamin Mahler
Benjamin Mahler benjamin.mah...@gmail.com: I would like to be able to shutdown a mesos-slave for maintenance without altering the current tasks. What are you trying to do? If your maintenance operation does not affect the tasks, why do you need to stop the slave in the first place? On Wed

Re: Implementing an Executor

2014-12-01 Thread Benjamin Mahler
Sorry this is a bit of a tangent to the thread: For ex, if the executor crashes Mesos would get a TASK_LOST while the container might still be running. Mesos should destroy the container when the executor exits, are you seeing otherwise? So we are doing something similar to Aurora's GC

Re: Implementing an Executor

2014-12-01 Thread Benjamin Mahler
in a certain way, for example attach elastic network interfaces to containers, mount ZFS volumes etc, and that's why we had to write a custom executor in the first place. On Mon, Dec 1, 2014 at 1:01 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: Sorry this is a bit of a tangent

Re: Timeline for 0.22.0?

2014-12-02 Thread Benjamin Mahler
If anyone is interested in driving a 0.21.1 bug fix release, we could get bug fixes out more quickly than waiting for 0.22.0. On Tue, Dec 2, 2014 at 2:28 PM, Tim Chen t...@mesosphere.io wrote: Hi Scott, The patch for MESOS-1925 is already merged into master, so you should be able to just

Re: Authorization via a Module

2014-12-16 Thread Benjamin Mahler
Sorry for the delay, yes it's possible but I don't think anyone has looked at modularizing it. Hopefully others will speak up otherwise! We have only one authorizer implementation in the code (a LocalAuthorizer https://github.com/apache/mesos/blob/0.21.0/src/authorizer/authorizer.hpp#L64 that

Re: Python bindings are changing!

2015-02-02 Thread Benjamin Mahler
Hey Thomas, Could you share the scripts you're using to publish to pypi? It's not part of the release process as of yet: http://mesos.apache.org/documentation/latest/release-guide/ The 0.21.1 eggs were never published: https://issues.apache.org/jira/browse/MESOS-2310 On Thu, Aug 14, 2014 at

Re: Mesos 0.21.1 startup errors on OSX

2015-01-16 Thread Benjamin Mahler
It doesn't look like the exception here is being caught correctly: https://github.com/apache/mesos/blob/0.21.1/3rdparty/libprocess/3rdparty/stout/include/stout/numify.hpp#L32 Just to be sure, can you show your compilation output, in particular, that you're not seeing -fno-exceptions as a compiler

Upcoming change to the Scheduler API

2015-02-13 Thread Benjamin Mahler
Hi all, As part of https://issues.apache.org/jira/browse/MESOS-2347, there is a scalability concern with the reconciliation API. Performing an implicit reconciliation results in a status update being sent for each task in the cluster. For large clusters in the tens of thousands of slaves, this

Re: [VOTE] Release Apache Mesos 0.21.1 (rc2)

2015-02-16 Thread Benjamin Mahler
Hey Tim, Can you release 0.21.1 on JIRA with the correct date? https://issues.apache.org/jira/plugins/servlet/project-config/MESOS/versions Thanks! Ben On Fri, Jan 2, 2015 at 12:30 PM, Dave Lester daveles...@gmail.com wrote: Any of the recent blog posts are good templates, located in the

Re: Upcoming change to the Scheduler API

2015-02-16 Thread Benjamin Mahler
' is set. On Fri, Feb 13, 2015 at 11:46 AM, Benjamin Mahler benjamin.mah...@gmail.com wrote: Ok, don't see any issues with that approach, since executors will have to create their own UUIDs anyway with pure language bindings. Will go back to just updating TaskStatus to keep the API compatible

Re: Resize Mesos master quorum

2015-01-10 Thread Benjamin Mahler
I posted a review of the proper documentation for this: https://reviews.apache.org/r/29796/ On Tue, Jan 6, 2015 at 5:38 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: Sorry for the delay, operational documentation for the replicated log has been badly needed, I'll get some basic stuff

Re: Resize Mesos master quorum

2015-01-06 Thread Benjamin Mahler
Sorry for the delay, operational documentation for the replicated log has been badly needed, I'll get some basic stuff up on the website by next week. In the interim, if you're using *--registry_strict=false (the default)*, you can simply stop the original N masters, rm -rf all the data in

Re: Upcoming change to the Scheduler API

2015-02-13 Thread Benjamin Mahler
:35 AM, Kevin Sweeney kevi...@apache.org wrote: Regarding the backwards-compatibility concern, would it make sense to add a TaskStatusID field to the existing TaskStatus message instead of changing the Scheduler signature? On Friday, February 13, 2015, Benjamin Mahler benjamin.mah...@gmail.com

Re: Upcoming change to the Scheduler API

2015-02-13 Thread Benjamin Mahler
Ok, don't see any issues with that approach, since executors will have to create their own UUIDs anyway with pure language bindings. Will go back to just updating TaskStatus to keep the API compatible. Thanks Kevin! On Fri, Feb 13, 2015 at 10:33 AM, Benjamin Mahler benjamin.mah...@gmail.com

Re: Mesos resource allocation

2015-01-05 Thread Benjamin Mahler
Did you find what you were looking for? At a quick glance, this kind of configuration is left to the framework (Spark). Mesos doesn't make any decisions with respect to how the resources offered to Spark are being used. On Tue, Dec 23, 2014 at 5:44 AM, Josh Devins j...@soundcloud.com wrote:

Re: Mesos resource sharing between frameworks

2015-03-19 Thread Benjamin Mahler
Guaranteeing fairness in such situations requires pre-emption of running tasks/executors, which is not yet provided in mesos. For now, you can try reserving a minimum amount of resources for each framework, note however that this may reduce your efficiency if you over-estimate the minimum

Re: Log rotation

2015-02-25 Thread Benjamin Mahler
that, etc... our plan is to log all tasks remotely to a fluentd or logstash agent on each slave. On 25 February 2015 at 18:05, Benjamin Mahler benjamin.mah...@gmail.com wrote: +drob 0.22.0 introduces a flag on the master / slave called --external_log_file. This allows you to point

Re: Help us review #MesosCon 2015 proposals

2015-02-20 Thread Benjamin Mahler
Great to see so many proposals! Is it intentional that we have to review them in small subsets? It's hard to tell what to consider as an Average proposal when you can only see a small subset at a time. Just curious on the reasoning behind that. On Wed, Feb 18, 2015 at 2:44 PM, Dave Lester

Re: libsubversion-1 is required for mesos to build.

2015-01-29 Thread Benjamin Mahler
$ sudo apt-get install libsvn-dev From here: http://mesos.apache.org/gettingstarted/ On Thu, Jan 29, 2015 at 1:35 PM, Dan Dong dongda...@gmail.com wrote: Hi, When I tried to build mesos-0.21.0 on Ubuntu-14.04, I get this error:

Re: SEGV in 'make check'

2015-04-30 Thread Benjamin Mahler
This message can be a bit misleading, do you have perf installed? On Thu, Apr 30, 2015 at 11:18 AM, Brian Topping brian.topp...@gmail.com wrote: Getting closer. After finding http://garyzhu.net/notes/CentOS7-Systemd-Mesos-Marathon.html, I set up another new CentOS 7 machine, got a lot further

Re: Mesos 0.21.0 release page correction

2015-05-04 Thread Benjamin Mahler
Fixed. On Mon, May 4, 2015 at 12:56 AM, Ryan Thomas r.n.tho...@gmail.com wrote: Whilst this is a bit old, the docs here http://mesos.apache.org/blog/mesos-0-21-0-released/ for 0.21.0 link to the wrong ticket for the shared filesystem isolator. Cheers, ryan

Re: Brigade :: Powered By Mesos

2015-05-11 Thread Benjamin Mahler
Glad to hear it, are you able to share a little more with the list about how you're using mesos? :) Sent from my iPhone On May 7, 2015, at 10:26 AM, John Miller john.mil...@brigade.com wrote: We're utilizing Mesos within our organization for multiple projects. Anyone with access please

Re: fatal error: ac_nonexistent.h: No such file or directory in build from source

2015-05-19 Thread Benjamin Mahler
It sounds like that is expected, since it says nonexistent: http://lists.gnu.org/archive/html/autoconf/2011-03/msg9.html On Sat, May 16, 2015 at 2:38 PM, Joerg Maurer dev-ma...@gmx.net wrote: Hello, I am building from source git clone --branch 0.22.1 --depth 1

Re: Threading model of mesos API (C++)

2015-06-09 Thread Benjamin Mahler
If that's really what you're seeing, it is a bug and a very surprising one, so please provide evidence :) See the detailed description here: http://mesos.apache.org/api/latest/c++/classmesos_1_1Scheduler.html The scheduler driver will serially invoke methods on your Scheduler implementation.

RFC: Framework - Executor Message Passing Optimization Removal

2015-06-23 Thread Benjamin Mahler
The existing Mesos API provides unreliable messaging passing for framework - executor communication: -- Schedulers can call 'sendFrameworkMessage(executor, slave, data)' on the driver [1], this sends a message to the executor. This has a best-effort optimization to bypass the master, and send the

Re: Broken link report

2015-06-17 Thread Benjamin Mahler
Should be removed from the website now. On Wed, Jun 17, 2015 at 7:21 AM, Ken Sipe kens...@gmail.com wrote: thanks! https://github.com/apache/mesos/pull/46 On Jun 17, 2015, at 5:22 AM, Brian Candler b.cand...@pobox.com wrote: At

Re: Executor Resource Requirements

2015-06-17 Thread Benjamin Mahler
The offer will contain the running executors on that slave: https://github.com/apache/mesos/blob/0.22.1/include/mesos/mesos.proto#L619 However, there are a few bugs to note if you're planning to use this in any kind of production setup: (1) If an executor exits, or a new one is started, we won't

Re: Setting minimum offer size

2015-07-06 Thread Benjamin Mahler
Interesting to see HTCondor has a defragmentation feature, this kind of thing has come up before for Mesos as well. Specifically, adding Inverse Offers as a generic mechanism for obtaining resources back from a framework unlocks a lot of functionality. The first use case was cluster maintenance.

Re: Framework still active after calling driver.stop

2015-05-19 Thread Benjamin Mahler
This is because stop() is asynchronously processed. A message is sent to the scheduler process and it will eventually send the message to the master. This is why you've noticed that sleeping helps to ensure that this occurs. There is no scheduler driver specific issue for this, but the executor

Re: MesosCon Seattle attendee introduction thread

2015-08-18 Thread Benjamin Mahler
Hey all! I'm an engineer at Twitter and Mesos committer / PMC member. Looking forward to seeing some of the new faces this year and re-connect with old friends :) I'll be talking about the upcoming maintenance primitives in mesos http://sched.co/35Cu, which is critical for operating

Re: High latency when scheduling and executing many tiny tasks.

2015-07-17 Thread Benjamin Mahler
Currently, recovered resources are not immediately re-offered as you noticed, and the default allocation interval is 1 second. I'd recommend lowering that (e.g. --allocation_interval=50ms), that should improve the second bullet you listed. Although, in your case it would be better to immediately

Re: High latency when scheduling and executing many tiny tasks.

2015-07-17 Thread Benjamin Mahler
there. I will try to find some logs that provide some insight into the execution times. I am using a command task. I haven't looked into executors yet; I had a hard time finding some examples in my language (Scala). On Fri, Jul 17, 2015 at 2:00 PM, Benjamin Mahler benjamin.mah...@gmail.com

  1   2   3   4   >