Re: Set LIBPROCESS_IP for frameworks launched with marathon

2016-06-07 Thread Tom Arnfeld
>>> Maybe it would be a good idea if Mesos was setting env variables named 
>>> AGENT_IP_0, AGENT_IP_1 and so on for every IP interface on the agent, maybe 
>>> AGENT_BIND_IP if bind IP is different than 0.0.0.0

That said, it’d be tricky to always be sure that IP_0 was the one you wanted. 
If the different interfaces are in distinct network subnets, then have you 
considered wrapping your framework (e.g if you’re relying on the ENTRYPOINT for 
the docker image) in a script that simply plucks out the right IP address by 
looking at the interfaces, and doing a simple grep for the right looking range?

We did some experiments the other week pertaining to your issue, to see if we 
could find a way of exposing the LIBPROCES_IP variable that the mesos agent 
provides to the executor (in this case, the docker-executor) with some fun env 
var hacks, but doesn’t look like any shell expansion happens long the way (for 
good reason, really) so we couldn’t find a way.

Given that you’re using host networking, i’d suggest trying to detect the right 
interface to bind to yourself, on the executor side, and set LIBPROCESS_IP= to 
the result of that logic before spawning the framework. Alternatively you could 
ensure the “public” bind interface of the agent is announced via a reverse PTR 
record (allowing you to do a simple `host $HOST`).

Not sure if this helps, little late to the thread. We essentially have the same 
problem when allowing user devices to connect to the cluster to run frameworks 
via a VPN, their machines have multiple IPs but only one is routable correctly 
from the cluster. A similar grep and LIBPROCESS_IP variable does the trick 
there.

> On 7 Jun 2016, at 15:43, Eli Jordan  wrote:
> 
> Currently I have it configured to use host networking
> 
> Thanks
> Eli
> 
> On 7 Jun 2016, at 11:25, Radoslaw Gruchalski  > wrote:
> 
>> Yes, because that runs in host network. This leads to a question: your 
>> docker task, is it bridge or host network.
>> 
>> -- 
>> Best regards,
>> Rad
>> 
>> 
>> 
>> 
>> On Tue, Jun 7, 2016 at 3:21 AM +0200, "Eli Jordan" > > wrote:
>> 
>> It's important to note that if you run a task with the command executor 
>> (I.e. Not using docker) LIBPROCESS_IP is defined, along with several other 
>> variables that are not defined in docker.
>> 
>> Thanks
>> Eli
>> 
>> On 7 Jun 2016, at 10:05, Radoslaw Gruchalski > > wrote:
>> 
>>> I think the problem is that it is not known which agent the task is running 
>>> on until the task i in the running state.
>>> Hence the master can’t pass that as an env variable to the task.
>>> However, I see your point. There is an agent host name avaialble in the 
>>> task as $HOST. Maybe it would be a good idea if Mesos was setting env 
>>> variables named AGENT_IP_0, AGENT_IP_1 and so on for every IP interface on 
>>> the agent, maybe AGENT_BIND_IP if bind IP is different than 0.0.0.0. OTOH, 
>>> I can see how this could be considered as some security issue. I am not 
>>> sure what the implications could be.
>>> 
>>> Anybody else care to comment?
>>> 
>>> – 
>>> Best regards,
>>> 
>>> Radek Gruchalski
>>> 
>>> ra...@gruchalski.com 
>>> de.linkedin.com/in/radgruchalski 
>>> 
>>> Confidentiality:
>>> This communication is intended for the above-named person and may be 
>>> confidential and/or legally privileged.
>>> If it has come to you in error you must take no action based on it, nor 
>>> must you copy or show it to anyone; please delete/destroy and inform the 
>>> sender immediately.
>>> 
>>> On June 7, 2016 at 1:42:46 AM, Eli Jordan (elias.k.jor...@gmail.com 
>>> ) wrote:
>>> 
 Thanks Radoslaw. I'm not really set on using host names, I just want a 
 reliable way to start the framework. For the meantime I have gone with a 
 solution similar to what you suggested. We use /etc/default/mesos file to 
 configure mesos, and it has the ip defined, so I just mounted that in the 
 container and read the ip.
 
 I would like to avoid having a dependency on the file system of the  
 agents though. I'm not sure why I can't have the docket executor set the 
 LIBPROCESS_IP variable in the same way the command executor does.
 
 Thanks
 Eli
 
 On 6 Jun 2016, at 21:44, Radoslaw Gruchalski > wrote:
 
> Out of curiosity. Why are you insisting on using host names?
> Say you have 1 master and 2 agents with these IPs:
> 
> - mesos-master-0: 10.100.1.10
> - mesos-agent-0: 10.100.1.11
> - mesos-agent-1: 10.100.1.12
> 
> Your problem is that you have no way to obtain an IP address of the agent 
> in the container. Correct?
> One way 

Re: Removing the External Containerizer

2016-04-21 Thread Tom Arnfeld
Hey,

No objections from us here to remove it.

As far as our usage goes, we updated mesos-hadoop and also our own frameworks a 
little while ago (so we can switch to the native docker implementation).

I think it’s for the best to remove it!

Tom.

> On 20 Apr 2016, at 22:02, Kevin Klues  wrote:
> 
> Hello all,
> 
> The 'external' containerizer has been deprecated since August and we
> are now considering removing it permanently before the 0.29 release.
> Are there any objections to this?
> 
> The following JIRA suggests that Hadoop on Mesos was still using the
> External containerizer format.
> https://issues.apache.org/jira/browse/MESOS-3370
> 
> However, it looks like this has been fixed in:
> https://github.com/mesos/hadoop/pull/68
> 
> Is anyone else still using the external containerizer and would like
> to see it persist a bit longer?
> 
> -- 
> ~Kevin



Re: Mesos sometimes not allocating the entire cluster

2016-02-22 Thread Tom Arnfeld
Hi Guangya,

Most of the agents do not have a role, so they use the default wildcard role 
for resources. Also none of the frameworks have a role, therefore they fall 
into the wildcard role too.

Frameworks are being offered resources up to a certain level of fairness but no 
further. The issue appears to be inside the allocator, relating to how it is 
deciding how many resources each framework should get within the role (wildcard 
‘*') in relation to fairness.

We seem to have circumvented the problem in the allocator by creating two 
completely new roles and putting one framework in each. No agents have this 
role assigned to any resources, but by doing this we seem to have got around 
the bug in the allocator that’s causing strange fairness allocations, resulting 
in no offers being sent.

I’m going to look into defining a reproducible test case for this scheduling 
situation to coax the allocator into behaving this way in a test environment.

Tom.

> On 22 Feb 2016, at 15:39, Guangya Liu <gyliu...@gmail.com> wrote:
> 
> If non of the framework has role, then no framework can consume reserved 
> resources, so I think that at least the framework 
> 20160219-164457-67375276-5050-28802-0014 and 
> 20160219-164457-67375276-5050-28802-0015 should have role.
> 
> Can you please show some detail for the following:
> 1) Master start command or master http endpoint for flags.
> 2) All slave start command or slave http endpoint for flags
> 3) master http endpoint for state 
> 
> Thanks,
> 
> Guangya
> 
> On Mon, Feb 22, 2016 at 10:57 PM, Tom Arnfeld <t...@duedil.com 
> <mailto:t...@duedil.com>> wrote:
> Ah yes sorry my mistake, there are a couple of agents with a dev role and 
> only one or two frameworks connect to the cluster with that role, but not 
> very often. Whether they’re connected or not doesn’t seem to cause any change 
> in allocation behaviour.
> 
> No other agents have roles.
> 
>> 974 2420 I0219 18:08:37.504587 28808 hierarchical.hpp:941] Allocating 
>> ports(*):[3000-5000]; cpus(*):0.5; mem(*):16384; disk(*):51200 on slave 
>> 20160112-174949-84152492-5050-19807-S316 to framework 
>> 20160219-164457-67375276-5050-28802-0014
>> 
>> This agent should have another 9.5 cpus reserved by some role and no 
>> framework is configured using resources from this role, thus the resources 
>> on this role are wasting.  I think that the following agent may also have 
>> some reserved resources configured: 
>> 20160112-174949-84152492-5050-19807-S317, 
>> 20160112-174949-84152492-5050-19807-S322 and even more agents.
> 
> 
> I don’t think that’s correct, this is likely to be an offer for a slave where 
> 9CPUs are currently allocated to an executor.
> 
> I can verify via the agent configuration and HTTP endpoints that most of the 
> agents do not have a role, and none of the frameworks do.
> 
> Tom.
> 
>> On 22 Feb 2016, at 14:09, Guangya Liu <gyliu...@gmail.com 
>> <mailto:gyliu...@gmail.com>> wrote:
>> 
>> Hi Tom,
>> 
>> I think that your cluster should have some role, weight configuration 
>> because I can see there are at least two agent has role with "dev" 
>> configured.
>> 
>> 56 1363 I0219 18:08:26.284010 28810 hierarchical.hpp:1025] Filtered 
>> ports(dev):[3000-5000]; cpus(dev):10; mem(dev):63488; disk(dev):153600 on 
>> slave 20160112-165226-67375276-5050-22401-S300 for framework 
>> 20160219-164457-67375276-5050-28802-0015
>> 57 1364 I0219 18:08:26.284162 28810 hierarchical.hpp:941] Allocating 
>> ports(dev):[3000-5000]; cpus(dev):10; mem(dev):63488; disk(dev):153600 on 
>> slave 20160112-165226-67375276-5050-22401-S300 to framework 
>> 20160219-164457-67375276-5050-28802-0014
>> 58 1365 I0219 18:08:26.286725 28810 hierarchical.hpp:1025] Filtered 
>> ports(dev):[3000-5000]; cpus(dev):10; mem(dev):63488; disk(dev):153600 on 
>> slave 20160112-165226-67375276-5050-22401-S303 for framework 
>> 20160219-164457-67375276-5050-28802-0015
>> 59 1366 I0219 18:08:26.286875 28810 hierarchical.hpp:941] Allocating 
>> ports(dev):[3000-5000]; cpus(dev):10; mem(dev):63488; disk(dev):153600 on 
>> slave 20160112-165226-67375276-5050-22401-S303 to framework 
>> 20160219-164457-67375276-5050-28802-0014
>> 
>> Also I think that the framework 20160219-164457-67375276-5050-28802-0014 and 
>> 20160219-164457-67375276-5050-28802-0015 may have a high weight cause I saw 
>> that framework  20160219-164457-67375276-5050-28802-0014 get 26 agents at 
>> 18:08:26.
>> 
>> Another is that some other agents may also have role configured but no 
>> frameworks are configured with the agent role and this caused some ag

Re: Mesos sometimes not allocating the entire cluster

2016-02-22 Thread Tom Arnfeld
Ah yes sorry my mistake, there are a couple of agents with a dev role and only 
one or two frameworks connect to the cluster with that role, but not very 
often. Whether they’re connected or not doesn’t seem to cause any change in 
allocation behaviour.

No other agents have roles.

> 974 2420 I0219 18:08:37.504587 28808 hierarchical.hpp:941] Allocating 
> ports(*):[3000-5000]; cpus(*):0.5; mem(*):16384; disk(*):51200 on slave 
> 20160112-174949-84152492-5050-19807-S316 to framework 
> 20160219-164457-67375276-5050-28802-0014
> 
> This agent should have another 9.5 cpus reserved by some role and no 
> framework is configured using resources from this role, thus the resources on 
> this role are wasting.  I think that the following agent may also have some 
> reserved resources configured: 20160112-174949-84152492-5050-19807-S317, 
> 20160112-174949-84152492-5050-19807-S322 and even more agents.

I don’t think that’s correct, this is likely to be an offer for a slave where 
9CPUs are currently allocated to an executor.

I can verify via the agent configuration and HTTP endpoints that most of the 
agents do not have a role, and none of the frameworks do.

Tom.

> On 22 Feb 2016, at 14:09, Guangya Liu <gyliu...@gmail.com> wrote:
> 
> Hi Tom,
> 
> I think that your cluster should have some role, weight configuration because 
> I can see there are at least two agent has role with "dev" configured.
> 
> 56 1363 I0219 18:08:26.284010 28810 hierarchical.hpp:1025] Filtered 
> ports(dev):[3000-5000]; cpus(dev):10; mem(dev):63488; disk(dev):153600 on 
> slave 20160112-165226-67375276-5050-22401-S300 for framework 
> 20160219-164457-67375276-5050-28802-0015
> 57 1364 I0219 18:08:26.284162 28810 hierarchical.hpp:941] Allocating 
> ports(dev):[3000-5000]; cpus(dev):10; mem(dev):63488; disk(dev):153600 on 
> slave 20160112-165226-67375276-5050-22401-S300 to framework 
> 20160219-164457-67375276-5050-28802-0014
> 58 1365 I0219 18:08:26.286725 28810 hierarchical.hpp:1025] Filtered 
> ports(dev):[3000-5000]; cpus(dev):10; mem(dev):63488; disk(dev):153600 on 
> slave 20160112-165226-67375276-5050-22401-S303 for framework 
> 20160219-164457-67375276-5050-28802-0015
> 59 1366 I0219 18:08:26.286875 28810 hierarchical.hpp:941] Allocating 
> ports(dev):[3000-5000]; cpus(dev):10; mem(dev):63488; disk(dev):153600 on 
> slave 20160112-165226-67375276-5050-22401-S303 to framework 
> 20160219-164457-67375276-5050-28802-0014
> 
> Also I think that the framework 20160219-164457-67375276-5050-28802-0014 and 
> 20160219-164457-67375276-5050-28802-0015 may have a high weight cause I saw 
> that framework  20160219-164457-67375276-5050-28802-0014 get 26 agents at 
> 18:08:26.
> 
> Another is that some other agents may also have role configured but no 
> frameworks are configured with the agent role and this caused some agents 
> have some static reserved resources cannot be allocated.
> 
> I searched 20160112-174949-84152492-5050-19807-S316 in the log and found that 
> it was allocating the following resources to frameworks:
> 
> 974 2420 I0219 18:08:37.504587 28808 hierarchical.hpp:941] Allocating 
> ports(*):[3000-5000]; cpus(*):0.5; mem(*):16384; disk(*):51200 on slave 
> 20160112-174949-84152492-5050-19807-S316 to framework 
> 20160219-164457-67375276-5050-28802-0014
> 
> This agent should have another 9.5 cpus reserved by some role and no 
> framework is configured using resources from this role, thus the resources on 
> this role are wasting.  I think that the following agent may also have some 
> reserved resources configured: 20160112-174949-84152492-5050-19807-S317, 
> 20160112-174949-84152492-5050-19807-S322 and even more agents.
>  
> So I would suggest that you check the master and each slave start command to 
> see how does role configured. You can also check this via the command: < curl 
> "http://master-ip:5050/master/state.json 
> <http://master-ip:5050/master/state.json>" 2>/dev/null| jq . >  (Note: There 
> is a dot in the end of the command) to get all slave resources status: 
> reserved, used, total resources etc.
> 
> Thanks,
> 
> Guangya
> 
> 
> On Mon, Feb 22, 2016 at 5:16 PM, Tom Arnfeld <t...@duedil.com 
> <mailto:t...@duedil.com>> wrote:
> No roles, no reservations.
> 
> We're using the default filter options with all frameworks and default 
> allocation interval.
> 
> On 21 Feb 2016, at 08:10, Guangya Liu <gyliu...@gmail.com 
> <mailto:gyliu...@gmail.com>> wrote:
> 
>> Hi Tom,
>> 
>> I traced the agent of "20160112-165226-67375276-5050-22401-S199" and found 
>> that it is keeps declining by many frameworks: once a framework got it, the 
>> framework will decline it immedi

Re: Mesos sometimes not allocating the entire cluster

2016-02-18 Thread Tom Arnfeld
Hi Ben,

I've only just seen your email! Really appreciate the reply, that's
certainly an interesting bug and we'll try that patch and see how we get on.

Cheers,

Tom.

On 29 January 2016 at 19:54, Benjamin Mahler <bmah...@apache.org> wrote:

> Hi Tom,
>
> I suspect you may be tripping the following issue:
> https://issues.apache.org/jira/browse/MESOS-4302
>
> Please have a read through this and see if it applies here. You may also
> be able to apply the fix to your cluster to see if that helps things.
>
> Ben
>
> On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld <t...@duedil.com> wrote:
>
>> Hey,
>>
>> I've noticed some interesting behaviour recently when we have lots of
>> different frameworks connected to our Mesos cluster at once, all using a
>> variety of different shares. Some of the frameworks don't get offered more
>> resources (for long periods of time, hours even) leaving the cluster under
>> utilised.
>>
>> Here's an example state where we see this happen..
>>
>> Framework 1 - 13% (user A)
>> Framework 2 - 22% (user B)
>> Framework 3 - 4% (user C)
>> Framework 4 - 0.5% (user C)
>> Framework 5 - 1% (user C)
>> Framework 6 - 1% (user C)
>> Framework 7 - 1% (user C)
>> Framework 8 - 0.8% (user C)
>> Framework 9 - 11% (user D)
>> Framework 10 - 7% (user C)
>> Framework 11 - 1% (user C)
>> Framework 12 - 1% (user C)
>> Framework 13 - 6% (user E)
>>
>> In this example, there's another ~30% of the cluster that is unallocated,
>> and it stays like this for a significant amount of time until something
>> changes, perhaps another user joins and allocates the rest chunks of
>> this spare resource is offered to some of the frameworks, but not all of
>> them.
>>
>> I had always assumed that when lots of frameworks were involved,
>> eventually the frameworks that would keep accepting resources indefinitely
>> would consume the remaining resource, as every other framework had rejected
>> the offers.
>>
>> Could someone elaborate a little on how the DRF allocator / sorter
>> handles this situation, is this likely to be related to the different users
>> being used? Is there a way to mitigate this?
>>
>> We're running version 0.23.1.
>>
>> Cheers,
>>
>> Tom.
>>
>
>


Re: Mesos sometimes not allocating the entire cluster

2016-01-22 Thread Tom Arnfeld
I can’t send the entire log as there’s a lot of activity on the cluster all the 
time, is there anything particular you’re looking for?

> On 22 Jan 2016, at 12:46, Klaus Ma <klaus1982...@gmail.com> wrote:
> 
> Can you share the whole log of master? I'll be helpful :).
> 
> 
> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer 
> Platform OpenSource Technology, STG, IBM GCG 
> +86-10-8245 4084 | klaus1982...@gmail.com <mailto:klaus1982...@gmail.com> | 
> http://k82.me <http://k82.me/>
> On Thu, Jan 21, 2016 at 11:57 PM, Tom Arnfeld <t...@duedil.com 
> <mailto:t...@duedil.com>> wrote:
> Guangya - Nope, there's no outstanding offers for any frameworks, the ones 
> that are getting offers are responding properly.
> 
> Klaus - This was just a sample of logs for a single agent, the cluster has at 
>  least ~40 agents at any one time.
> 
> On 21 January 2016 at 15:20, Guangya Liu <gyliu...@gmail.com 
> <mailto:gyliu...@gmail.com>> wrote:
> Can you please help check if some outstanding offers in cluster which does 
> not accept by any framework? You can check this via the endpoint of 
> /master/state.json endpoint.
> 
> If there are some outstanding offers, you can start the master with a 
> offer_timeout flag to let master rescind some offers if those offers are not 
> accepted by framework.
> 
> Cited from https://github.com/apache/mesos/blob/master/docs/configuration.md 
> <https://github.com/apache/mesos/blob/master/docs/configuration.md>
> 
> --offer_timeout=VALUE Duration of time before an offer is rescinded from a 
> framework.
> This helps fairness when running frameworks that hold on to offers, or 
> frameworks that accidentally drop offers.
> 
> 
> Thanks,
> 
> Guangya
> 
> On Thu, Jan 21, 2016 at 9:44 PM, Tom Arnfeld <t...@duedil.com 
> <mailto:t...@duedil.com>> wrote:
> Hi Klaus,
> 
> Sorry I think I explained this badly, these are the logs for one slave 
> (that's empty) and we can see that it is making offers to some frameworks. In 
> this instance, the Hadoop framework (and others) are not among those getting 
> any offers, they get offered nothing. The allocator is deciding to send 
> offers in a loop to a certain set of frameworks, starving others.
> 
> On 21 January 2016 at 13:17, Klaus Ma <klaus1982...@gmail.com 
> <mailto:klaus1982...@gmail.com>> wrote:
> Yes, it seems Hadoop framework did not consume all offered resources: if 
> framework launch task (1 CPUs) on offer (10 CPUs), the other 9 CPUs will 
> return back to master (recoverResouces).
> 
> 
> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer 
> Platform OpenSource Technology, STG, IBM GCG 
> +86-10-8245 4084 <tel:%2B86-10-8245%204084> | klaus1982...@gmail.com 
> <mailto:klaus1982...@gmail.com> | http://k82.me <http://k82.me/>
> On Thu, Jan 21, 2016 at 6:46 PM, Tom Arnfeld <t...@duedil.com 
> <mailto:t...@duedil.com>> wrote:
> Thanks everyone!
> 
> Stephan - There's a couple of useful points there, will definitely give it a 
> read.
> 
> Klaus - Thanks, we're running a bunch of different frameworks, in that list 
> there's Hadoop MRv1, Apache Spark, Marathon and a couple of home grown 
> frameworks we have. In this particular case the Hadoop framework is the major 
> concern, as it's designed to continually accept offers until it has enough 
> slots it needs. With the example I gave above, we observe that the master is 
> never sending any sizeable offers to some of these frameworks (the ones with 
> the larger shares), which is where my confusion stems from.
> 
> I've attached a snippet of our active master logs which show the activity for 
> a single slave (which has no active executors). We can see that it's cycling 
> though sending and recovering declined offers from a selection of different 
> frameworks (in order) but I can say that not all of the frameworks are 
> receiving these offers, in this case that's the Hadoop framework.
> 
> 
> On 21 January 2016 at 00:26, Klaus Ma <klaus1982...@gmail.com 
> <mailto:klaus1982...@gmail.com>> wrote:
> Hi Tom,
> 
> Which framework are you using, e.g. Swarm, Marathon or something else? and 
> which language package are you using?
> 
> DRF will sort role/framework by allocation ratio, and offer all "available" 
> resources by slave; but if the resources it too small (< 0.1CPU) or the 
> resources was reject/declined by framework, the resources will not offer it 
> until filter timeout. For example, in Swarm 1.0, the default filter timeout 
> 5s (because of go scheduler API); so here is case that may impact the 
> utilisation: the Swarm got one slav

Re: Mesos sometimes not allocating the entire cluster

2016-01-21 Thread Tom Arnfeld
Thanks everyone!

Stephan - There's a couple of useful points there, will definitely give it
a read.

Klaus - Thanks, we're running a bunch of different frameworks, in that list
there's Hadoop MRv1, Apache Spark, Marathon and a couple of home grown
frameworks we have. In this particular case the Hadoop framework is the
major concern, as it's designed to continually accept offers until it has
enough slots it needs. With the example I gave above, we observe that the
master is never sending any sizeable offers to some of these frameworks
(the ones with the larger shares), which is where my confusion stems from.

I've attached a snippet of our active master logs which show the activity
for a single slave (which has no active executors). We can see that it's
cycling though sending and recovering declined offers from a selection of
different frameworks (in order) but I can say that not all of the
frameworks are receiving these offers, in this case that's the Hadoop
framework.


On 21 January 2016 at 00:26, Klaus Ma <klaus1982...@gmail.com> wrote:

> Hi Tom,
>
> Which framework are you using, e.g. Swarm, Marathon or something else? and
> which language package are you using?
>
> DRF will sort role/framework by allocation ratio, and offer all
> "available" resources by slave; but if the resources it too small (<
> 0.1CPU) or the resources was reject/declined by framework, the resources
> will not offer it until filter timeout. For example, in Swarm 1.0, the
> default filter timeout 5s (because of go scheduler API); so here is case
> that may impact the utilisation: the Swarm got one slave with 16 CPUS, but
> only launch one container with 1 CPUS; the other 15 CPUS will return back
>  to master and did not re-offer until filter timeout (5s).
> I had pull a request to make Swarm's parameters configurable, refer to
> https://github.com/docker/swarm/pull/1585. I think you can check this
> case by master log.
>
> If any comments, please let me know.
>
> 
> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> Platform OpenSource Technology, STG, IBM GCG
> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me
>
> On Thu, Jan 21, 2016 at 2:19 AM, Tom Arnfeld <t...@duedil.com> wrote:
>
>> Hey,
>>
>> I've noticed some interesting behaviour recently when we have lots of
>> different frameworks connected to our Mesos cluster at once, all using a
>> variety of different shares. Some of the frameworks don't get offered more
>> resources (for long periods of time, hours even) leaving the cluster under
>> utilised.
>>
>> Here's an example state where we see this happen..
>>
>> Framework 1 - 13% (user A)
>> Framework 2 - 22% (user B)
>> Framework 3 - 4% (user C)
>> Framework 4 - 0.5% (user C)
>> Framework 5 - 1% (user C)
>> Framework 6 - 1% (user C)
>> Framework 7 - 1% (user C)
>> Framework 8 - 0.8% (user C)
>> Framework 9 - 11% (user D)
>> Framework 10 - 7% (user C)
>> Framework 11 - 1% (user C)
>> Framework 12 - 1% (user C)
>> Framework 13 - 6% (user E)
>>
>> In this example, there's another ~30% of the cluster that is unallocated,
>> and it stays like this for a significant amount of time until something
>> changes, perhaps another user joins and allocates the rest chunks of
>> this spare resource is offered to some of the frameworks, but not all of
>> them.
>>
>> I had always assumed that when lots of frameworks were involved,
>> eventually the frameworks that would keep accepting resources indefinitely
>> would consume the remaining resource, as every other framework had rejected
>> the offers.
>>
>> Could someone elaborate a little on how the DRF allocator / sorter
>> handles this situation, is this likely to be related to the different users
>> being used? Is there a way to mitigate this?
>>
>> We're running version 0.23.1.
>>
>> Cheers,
>>
>> Tom.
>>
>
>
0121 10:43:27.513950 22408 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0626
I0121 10:43:28.546314 22409 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200, allocated: ) on 
slave 20151103-233456-100929708-5050-865-S36 from framework 
20160112-165226-67375276-5050-22401-0644
I0121 10:43:30.095793 22403 hierarchical.hpp:761] Recovered 
ports(*):[3000-5000]; cpus(*):9.5; mem(*):59392; disk(*):51200 (total: 
ports(*):[3000-5000]; 

Re: Mesos sometimes not allocating the entire cluster

2016-01-21 Thread Tom Arnfeld
Guangya - Nope, there's no outstanding offers for any frameworks, the ones
that are getting offers are responding properly.

Klaus - This was just a sample of logs for a single agent, the cluster has
at  least ~40 agents at any one time.

On 21 January 2016 at 15:20, Guangya Liu <gyliu...@gmail.com> wrote:

> Can you please help check if some outstanding offers in cluster which does
> not accept by any framework? You can check this via the endpoint of
> /master/state.json endpoint.
>
> If there are some outstanding offers, you can start the master with a
> offer_timeout flag to let master rescind some offers if those offers are
> not accepted by framework.
>
> Cited from
> https://github.com/apache/mesos/blob/master/docs/configuration.md
>
> --offer_timeout=VALUE Duration of time before an offer is rescinded from
> a framework.
>
> This helps fairness when running frameworks that hold on to offers, or
> frameworks that accidentally drop offers.
>
> Thanks,
>
> Guangya
>
> On Thu, Jan 21, 2016 at 9:44 PM, Tom Arnfeld <t...@duedil.com> wrote:
>
>> Hi Klaus,
>>
>> Sorry I think I explained this badly, these are the logs for one slave
>> (that's empty) and we can see that it is making offers to some frameworks.
>> In this instance, the Hadoop framework (and others) are not among those
>> getting any offers, they get offered nothing. The allocator is deciding to
>> send offers in a loop to a certain set of frameworks, starving others.
>>
>> On 21 January 2016 at 13:17, Klaus Ma <klaus1982...@gmail.com> wrote:
>>
>>> Yes, it seems Hadoop framework did not consume all offered resources: if
>>> framework launch task (1 CPUs) on offer (10 CPUs), the other 9 CPUs will
>>> return back to master (recoverResouces).
>>>
>>> 
>>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
>>> Platform OpenSource Technology, STG, IBM GCG
>>> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me
>>>
>>> On Thu, Jan 21, 2016 at 6:46 PM, Tom Arnfeld <t...@duedil.com> wrote:
>>>
>>>> Thanks everyone!
>>>>
>>>> Stephan - There's a couple of useful points there, will definitely give
>>>> it a read.
>>>>
>>>> Klaus - Thanks, we're running a bunch of different frameworks, in that
>>>> list there's Hadoop MRv1, Apache Spark, Marathon and a couple of home grown
>>>> frameworks we have. In this particular case the Hadoop framework is the
>>>> major concern, as it's designed to continually accept offers until it has
>>>> enough slots it needs. With the example I gave above, we observe that the
>>>> master is never sending any sizeable offers to some of these frameworks
>>>> (the ones with the larger shares), which is where my confusion stems from.
>>>>
>>>> I've attached a snippet of our active master logs which show the
>>>> activity for a single slave (which has no active executors). We can see
>>>> that it's cycling though sending and recovering declined offers from a
>>>> selection of different frameworks (in order) but I can say that not all of
>>>> the frameworks are receiving these offers, in this case that's the Hadoop
>>>> framework.
>>>>
>>>>
>>>> On 21 January 2016 at 00:26, Klaus Ma <klaus1982...@gmail.com> wrote:
>>>>
>>>>> Hi Tom,
>>>>>
>>>>> Which framework are you using, e.g. Swarm, Marathon or something else?
>>>>> and which language package are you using?
>>>>>
>>>>> DRF will sort role/framework by allocation ratio, and offer all
>>>>> "available" resources by slave; but if the resources it too small (<
>>>>> 0.1CPU) or the resources was reject/declined by framework, the resources
>>>>> will not offer it until filter timeout. For example, in Swarm 1.0, the
>>>>> default filter timeout 5s (because of go scheduler API); so here is case
>>>>> that may impact the utilisation: the Swarm got one slave with 16 CPUS, but
>>>>> only launch one container with 1 CPUS; the other 15 CPUS will return back
>>>>>  to master and did not re-offer until filter timeout (5s).
>>>>> I had pull a request to make Swarm's parameters configurable, refer to
>>>>> https://github.com/docker/swarm/pull/1585. I think you can check this
>>>>> case by master log.
>>>>>
>>>>> If any comments, please let me know.
>>>>>
>>

Mesos sometimes not allocating the entire cluster

2016-01-20 Thread Tom Arnfeld
Hey,

I've noticed some interesting behaviour recently when we have lots of
different frameworks connected to our Mesos cluster at once, all using a
variety of different shares. Some of the frameworks don't get offered more
resources (for long periods of time, hours even) leaving the cluster under
utilised.

Here's an example state where we see this happen..

Framework 1 - 13% (user A)
Framework 2 - 22% (user B)
Framework 3 - 4% (user C)
Framework 4 - 0.5% (user C)
Framework 5 - 1% (user C)
Framework 6 - 1% (user C)
Framework 7 - 1% (user C)
Framework 8 - 0.8% (user C)
Framework 9 - 11% (user D)
Framework 10 - 7% (user C)
Framework 11 - 1% (user C)
Framework 12 - 1% (user C)
Framework 13 - 6% (user E)

In this example, there's another ~30% of the cluster that is unallocated,
and it stays like this for a significant amount of time until something
changes, perhaps another user joins and allocates the rest chunks of
this spare resource is offered to some of the frameworks, but not all of
them.

I had always assumed that when lots of frameworks were involved, eventually
the frameworks that would keep accepting resources indefinitely would
consume the remaining resource, as every other framework had rejected the
offers.

Could someone elaborate a little on how the DRF allocator / sorter handles
this situation, is this likely to be related to the different users being
used? Is there a way to mitigate this?

We're running version 0.23.1.

Cheers,

Tom.


Re: Monitoring

2016-01-19 Thread Tom Arnfeld
We're using collectd (https://collectd.org/) to send system metrics to
Graphite, and also using the https://github.com/rayrod2030/collectd-mesos
collectd plugin to pull stats directly from the Apache Mesos stats endpoint.

This works pretty well for us, and seems kind-of similar to the Diamond
approach (TIL Diamond, will have to look into that).

On 19 January 2016 at 21:18, Joe Smith  wrote:

> TellApart also has a rather active fork of Diamond (they're working to
> merge it back upstream ~soonish) that you can take a look at
> https://github.com/tellapart/Diamond. They use it to monitor both Apache
> Mesos and Apache Aurora.
>
> Twitter has an internal monitoring system, and we have an agent which is
> installed via RPM/puppet on each host that scrapes the metrics pages and
> pushes data to our time series database. If you wanted to setup an agent
> through Aurora itself, you'd need support to have one-task per machine
>  (which would be cool,
> but could lead to a circular dependency since Aurora or Mesos could go down
> and not launch your monitoring agents).
>
> I'd likely recommend using the same system you use for deploying Mesos as
> that for getting your monitoring agents onto your hosts.
>
> On Tue, Jan 19, 2016 at 12:17 PM, Tomek Janiszewski 
> wrote:
>
>> Hi
>>
>> In our setup we are using Diamond with default system collectors and one
>> custom collector (based on
>> https://github.com/python-diamond/Diamond/pull/106 but with some
>> improvements). Some other solutions were presented at MesosCon:
>> https://www.youtube.com/watch?v=yLkc17HFEb8
>> https://www.youtube.com/watch?v=zlgAT_xFNzU
>>
>> Tomek
>>
>> wt., 19.01.2016 o 21:04 użytkownik Michał Łowicki 
>> napisał:
>>
>>> Hi,
>>>
>>> I've read Mesos Observability Metrics
>>>  which gives
>>> nice overview of cluster's health. What about other parameters like I/O
>>> usage (disk, network), number of processes etc. Maybe there are some tools
>>> or their configurations dedicated for Mesos? (we're mostly using Diamond
>>> and StatsD which reports to Graphite). How to launch such tools -
>>> separately from Mesos or launch as a part of long-running tasks?
>>>
>>> --
>>> BR,
>>> Michał Łowicki
>>>
>>
>


Re: statusUpdate() duplicate messages?

2015-11-18 Thread Tom Arnfeld
When you construct the scheduler, are you disabling implicit acknowledgements?

https://github.com/apache/mesos/blob/master/include/mesos/scheduler.hpp#L373 


I’d suggest having a read over this document, it explains some of this -> 
http://mesos.apache.org/documentation/latest/reconciliation/ 


a) Mesos may re-send messages if you don’t acknowledge them, and task status 
messages are guaranteed at least once
c) If you disable implicit status acknowledgement, yep
d) You should, they are guaranteed to be delivered at some point at least once 
by the slave / master. To keep your framework in sync with the cluster it is 
recommended to reconcile tasks often (as explained in the document above)
e) http://mesos.apache.org/documentation/latest/reconciliation/ 


Hope that helps, and I think that’s all correct! The docs will be able to 
clarify better :-)

> On 18 Nov 2015, at 12:09, James Vanns  wrote:
> 
> Hello list.
> 
> We have an experimental framework (C++ API) based on Mesos 0.24 and we're 
> seeing duplicate task status messages -- eg. 2 'FINISHED' messages for a 
> single task. This may well be normal behaviour but I wasn't prepared for it. 
> Could someone point me in the direction of a decent description on status 
> updates/messages somewhere in the Mesos documentation? Or explain the 
> following;
> 
> a) Is this normal (it's not just the FINISHED state)?
> b) What might cause this behaviour (it's intermittent)?
> c) I do not explicitly acknowledge receipt of these messages - should I!?
> d) Should I treat these status update messages as reliable and robust!?
> e) Where can I learn more about this kind of internal detail?
> 
> Cheers,
> 
> Jim
> 
> --
> Senior Code Pig
> Industrial Light & Magic



Re: Spark on Mesos / Executor Memory

2015-10-17 Thread Tom Arnfeld
Hi Bharath,

When running jobs in fine grained mode, each Spark task is sent to mesos as a 
task which allows the offers system to maintain fairness between different 
spark application (as you've described). Having said that, unless your memory 
per-node is hugely undersubscribed when running these jobs in parallel. This 
behaviour matches exactly what you've described.

What you're seeing happens because even though there's a new mesos task for 
each Spark task (allowing CPU to be shared) the Spark executors don't get 
killed even when they aren't doing any work, which means the memory isn't 
released. The JVM doesn't allow for flexible memory re-allocation (as far as 
i'm aware) which make it impossible for spark to dynamically scale up the 
memory of the executor over time as tasks start and finish.

As Dave pointed out, the simplest way to solve this is to use a higher level 
tool that can run your spark jobs through one mesos framework and then you can 
let spark distribute the resources more effectively.

I hope that helps!

Tom.

> On 17 Oct 2015, at 06:47, Bharath Ravi Kumar <reachb...@gmail.com> wrote:
> 
> Can someone respond if you're aware of the reason for such a memory 
> footprint? It seems unintuitive and hard to reason about. 
> 
> Thanks,
> Bharath
> 
> On Thu, Oct 15, 2015 at 12:29 PM, Bharath Ravi Kumar <reachb...@gmail.com 
> <mailto:reachb...@gmail.com>> wrote:
> Resending since user@mesos bounced earlier. My apologies.
> 
> On Thu, Oct 15, 2015 at 12:19 PM, Bharath Ravi Kumar <reachb...@gmail.com 
> <mailto:reachb...@gmail.com>> wrote:
> (Reviving this thread since I ran into similar issues...)
> 
> I'm running two spark jobs (in mesos fine grained mode), each belonging to a 
> different mesos role, say low and high. The low:high mesos weights are 1:10. 
> On expected lines, I see that the low priority job occupies cluster resources 
> to the maximum extent when running alone. However, when the high priority job 
> is submitted, it does not start and continues to await cluster resources (as 
> seen in the logs). Since the jobs run in fine grained mode and the low 
> priority tasks begin to finish, the high priority job should ideally be able 
> to start and gradually take over cluster resources as per the weights. 
> However, I noticed that while the "low" job gives up CPU cores with each 
> completing task (e.g. reduction from 72 -> 12 with default parallelism set to 
> 72), the memory resources are held on (~500G out of 768G). The 
> spark.executor.memory setting appears to directly impact the amount of memory 
> that the job holds on to. In this case, it was set to 200G in the low 
> priority task and 100G in the high priority task. The nature of these jobs is 
> such that setting the numbers to smaller values (say 32g) resulted in job 
> failures with outofmemoryerror.  It appears that the spark framework is 
> retaining memory (across tasks)  proportional to spark.executor.memory for 
> the duration of the job and not releasing memory as tasks complete. This 
> defeats the purpose of fine grained mode execution as the memory occupancy is 
> preventing the high priority job from accepting the prioritized cpu offers 
> and beginning execution. Can this be explained / documented better please? 
> 
> Thanks,
> Bharath
> 
> On Sat, Apr 11, 2015 at 10:59 PM, Tim Chen <t...@mesosphere.io 
> <mailto:t...@mesosphere.io>> wrote:
> (Adding spark user list)
> 
> Hi Tom,
> 
> If I understand correctly you're saying that you're running into memory 
> problems because the scheduler is allocating too much CPUs and not enough 
> memory to acoomodate them right?
> 
> In the case of fine grain mode I don't think that's a problem since we have a 
> fixed amount of CPU and memory per task. 
> However, in coarse grain you can run into that problem if you're with in the 
> spark.cores.max limit, and memory is a fixed number.
> 
> I have a patch out to configure how much max cpus should coarse grain 
> executor use, and it also allows multiple executors in coarse grain mode. So 
> you could say try to launch multiples of max 4 cores with 
> spark.executor.memory (+ overhead and etc) in a slave. 
> (https://github.com/apache/spark/pull/4027 
> <https://github.com/apache/spark/pull/4027>)
> 
> It also might be interesting to include a cores to memory multiplier so that 
> with a larger amount of cores we try to scale the memory with some factor, 
> but I'm not entirely sure that's intuitive to use and what people know what 
> to set it to, as that can likely change with different workload.
> 
> Tim
> 
> 
> 
> 
> 
> 
> 
> On Sat, Apr 11, 2015 at 9:51 AM, Tom Arnfeld <t...@duedil.com 
> <mailto:t...@due

Re: MesosCon Seattle attendee introduction thread

2015-08-17 Thread Tom Arnfeld
Hey everyone! I'm Tom Arnfeld, a software engineer working at DueDil (in 
London). We've been running Mesos for almost 18 months now, for large scale 
batch and stream data processing applications. Give me a shout if you want to 
talk data stuff! We're also experimenting with Mesos to deploy long running 
applications, so even better if you're doing both of these things :)


I'm giving a 5 minute lightning talk at #MesosCon on one of our projects, 
portainer (https://github.com/duedil-ltd/portainer) so if you're running Docker 
on Mesos, definitely come along! I'm also keen to talk about pure Python 
frameworks for Mesos and pesos (https://github.com/wickman/pesos).





@tarnfeld







On Sunday, 16 Aug 2015 at 20:22, David Greenberg dsg123456...@gmail.com, 
wrote:

Hello MesosCon Attendees! I'm David Greenberg, an architect at Two Sigma 
Investments. I'm a member of the MesosCon program committee, and I'm really 
excited to participate in the result of months of planning!

I'm also the author of the upcoming O'Reilly book Building Applications on 
Mesos (http://shop.oreilly.com/product/mobile/0636920039952.do)

I'd love to talk about what we're building with Mesos at Two Sigma, or about 
what you're building.

You can find me on Twitter at @dgrnbrg



On Sun, Aug 16, 2015 at 6:07 PM Jeff Schroeder jeffschroe...@computer.org 
wrote:



My name is Jeff, I'm a Software Engineer for a chicago based trading firm, but 
my passion is building distributed things. In a previous life, I was a systems 
administrator who got angry due to not being able to automate ALL THE THINGS. 
This led me to teach myself the tools necessary to do just that. I coded / 
automated myself out of one job right into another one. I'm an Aurora user and 
have contributed a few enhancements to aurproxy.





On Sun, Aug 16, 2015 at 5:27 PM, Roger Ignazio rigna...@gmail.com wrote:


Thanks for kicking this off Dave!




I'm Roger Ignazio, a QE Automation Engineer at Puppet Labs. As a part of 
Engineering Services, the QE team is responsible for providing automated 
testing infrastructure, tooling, and services to the greater Engineering 
organization. I'm also the author of the upcoming book Mesos In Action with 
Manning Publications. (By the way: conference attendees get 42% off with the 
code cftwmesos!).




I'll be speaking at ContainerCon about managing Mesos, Docker, and Chronos with 
Puppet. I'll also be presenting a case-study at MesosCon about using Mesos and 
Marathon to scale Jenkins infrastructure.




I'm also on Twitter: @rogerignazio







On Sun, Aug 16, 2015 at 1:58 PM, Dave Lester d...@davelester.org wrote:
Hi All,


I'd like to kick off a thread for folks to introduce themselves in

advance of #MesosCon

http://events.linuxfoundation.org/events/mesoscon. Here goes:


My name is Dave Lester, and I'm an Open Source Advocate at Twitter. I am

a member of the MesosCon program committee, along with a stellar group

of other community members who have volunteered

http://events.linuxfoundation.org/events/mesoscon/program/programcommittee.

Can't wait to meet as many of you as possible.


I'm eager to meet with folks interested in learning more about how we

deploy and manage services at Twitter using Mesos and Apache Aurora

http://aurora.apache.org. Twitter has a booth where I'll be hanging

out for a portion of the conference, feel free to stop by and say hi.

I'm also interested in connecting with companies that use Mesos; let's

make sure we add you to our #PoweredByMesos list

http://mesos.apache.org/documentation/latest/powered-by-mesos/.


I'm also on Twitter: @davelester


Next!





















-- 
Jeff Schroeder

Don't drink and derive, alcohol and analysis don't mix.
http://www.digitalprognosis.com

Re: Cleaning out old mesos-slave sandbox directories

2015-07-09 Thread Tom Arnfeld
Ok, do you think that'd be a change that would be accepted into Mesos if I sent 
it in?




Thanks Vinod, btw.



--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Wed, Jul 8, 2015 at 7:24 PM, Vinod Kone vinodk...@gmail.com wrote:

 On Wed, Jul 8, 2015 at 11:20 AM, Tom Arnfeld t...@duedil.com wrote:
 Do you know if the mesos-slave will re-schedule something for GC if it
 fails deletion?

 No it doesn't.

Re: Cleaning out old mesos-slave sandbox directories

2015-07-08 Thread Tom Arnfeld
Good question, there are likely mounts, yup... though they should be being 
unmounted cleanly, though perhaps not in all cases and maybe we need to retry 
deleting things in the gc process.




Do you know if the mesos-slave will re-schedule something for GC if it fails 
deletion?



--


Tom Arnfeld

Senior Developer // DueDil






On Wednesday, Jul 8, 2015 at 7:19 pm, Vinod Kone vinodk...@gmail.com, wrote:
Are there any special files (mounts etc) in your slave directory? The logic 
Mesos uses to delete a directory is likely different from the shell utility 
'rm'.

On Wed, Jul 8, 2015 at 11:09 AM, Tom Arnfeld t...@duedil.com wrote:

In this instance there were three old slave directories, and there are three 
log lines in the mesos-slave.INFO file;





I0708 11:24:52.023453  2425 slave.cpp:3499] Garbage collecting old slave 
20150515-105200-84152492-5050-9915-S46

I0708 11:24:52.023923  2425 slave.cpp:3499] Garbage collecting old slave 
20150217-184553-67375276-5050-18563-S74

I0708 11:24:52.023921  2428 gc.cpp:56] Scheduling 
'/mnt/mesos/mesos-slave/slaves/20150515-105200-84152492-5050-9915-S46' for gc 
6.9972599407days in the future

I0708 11:24:52.054704  2425 slave.cpp:3499] Garbage collecting old slave 
20150515-105200-84152492-5050-9915-S22

I0708 11:24:52.054723  2424 gc.cpp:56] Scheduling 
'/mnt/mesos/mesos-slave/slaves/20150217-184553-67375276-5050-18563-S74' for gc 
6.9937182815days in the future

I0708 11:24:52.067934  2425 gc.cpp:56] Scheduling 
'/mnt/mesos/mesos-slave/slaves/20150515-105200-84152492-5050-9915-S22' for gc 
6.9922252444days in the future




This happens right after the recovery process finishes after the slave boots 
up. I've looked at another slave that's currently at 99% disk capacity and the 
slave has been up since 27th May 2015, it also has the Garbage collecting old 
slave log lines just after boot for ~6 days. Looking a little deeper in to 
this slave logs; this looks like an interesting error;





W0527 17:35:08.935755  1749 gc.cpp:139] Failed to delete 
'/mnt/mesos/mesos-slave/slaves/20150217-184553-67375276-5050-18563-S72': 
Directory not empty




I think I actually discussed this with BenH a while back, we're running 0.21.0 
on this cluster.




Anyone else seen this before? Using the standard `rm` unix tool clears out the 
directories fine currently, running as the same user as the slave (root).






--


Tom Arnfeld

Senior Developer // DueDil







On Wed, Jul 8, 2015 at 7:00 PM, Vinod Kone vinodk...@gmail.com wrote:





On Wed, Jul 8, 2015 at 10:54 AM, Tom Arnfeld t...@duedil.com wrote:

When this happens the old slave directories appear not to be tracked by the 
mesos GC process, and stay around indefinitely. Over time if enough full slave 
restarts happen (say, due to reconfiguration) the disks can be completely 
filled and the mesos slave won't do anything about it.







This shouldn't happen. Old slave directories should be gc'ed by the slave based 
on their last modification time. Do you see any log lines with  Garbage 
collecting old slave ?

Cleaning out old mesos-slave sandbox directories

2015-07-08 Thread Tom Arnfeld
Hey,


I'm wondering if anyone in the community has a decent solution to this; when a 
slave restarts and re-registers (perhaps it was offline for too long) it will 
get a new slave ID, and use a new directory inside the work_dir for sandboxes.


When this happens the old slave directories appear not to be tracked by the 
mesos GC process, and stay around indefinitely. Over time if enough full slave 
restarts happen (say, due to reconfiguration) the disks can be completely 
filled and the mesos slave won't do anything about it.


I'm guessing the simplest case would be a cron job that cleans out the 
directories based on the timestamps in the directory names...


Any input would be great!


Tom.

--

Tom Arnfeld
Senior Developer // DueDil

Re: Cleaning out old mesos-slave sandbox directories

2015-07-08 Thread Tom Arnfeld
In this instance there were three old slave directories, and there are three 
log lines in the mesos-slave.INFO file;





I0708 11:24:52.023453  2425 slave.cpp:3499] Garbage collecting old slave 
20150515-105200-84152492-5050-9915-S46

I0708 11:24:52.023923  2425 slave.cpp:3499] Garbage collecting old slave 
20150217-184553-67375276-5050-18563-S74

I0708 11:24:52.023921  2428 gc.cpp:56] Scheduling 
'/mnt/mesos/mesos-slave/slaves/20150515-105200-84152492-5050-9915-S46' for gc 
6.9972599407days in the future

I0708 11:24:52.054704  2425 slave.cpp:3499] Garbage collecting old slave 
20150515-105200-84152492-5050-9915-S22

I0708 11:24:52.054723  2424 gc.cpp:56] Scheduling 
'/mnt/mesos/mesos-slave/slaves/20150217-184553-67375276-5050-18563-S74' for gc 
6.9937182815days in the future

I0708 11:24:52.067934  2425 gc.cpp:56] Scheduling 
'/mnt/mesos/mesos-slave/slaves/20150515-105200-84152492-5050-9915-S22' for gc 
6.9922252444days in the future




This happens right after the recovery process finishes after the slave boots 
up. I've looked at another slave that's currently at 99% disk capacity and the 
slave has been up since 27th May 2015, it also has the Garbage collecting old 
slave log lines just after boot for ~6 days. Looking a little deeper in to 
this slave logs; this looks like an interesting error;





W0527 17:35:08.935755  1749 gc.cpp:139] Failed to delete 
'/mnt/mesos/mesos-slave/slaves/20150217-184553-67375276-5050-18563-S72': 
Directory not empty




I think I actually discussed this with BenH a while back, we're running 0.21.0 
on this cluster.




Anyone else seen this before? Using the standard `rm` unix tool clears out the 
directories fine currently, running as the same user as the slave (root).






--


Tom Arnfeld

Senior Developer // DueDil

On Wed, Jul 8, 2015 at 7:00 PM, Vinod Kone vinodk...@gmail.com wrote:

 On Wed, Jul 8, 2015 at 10:54 AM, Tom Arnfeld t...@duedil.com wrote:
 When this happens the old slave directories appear not to be tracked by
 the mesos GC process, and stay around indefinitely. Over time if enough
 full slave restarts happen (say, due to reconfiguration) the disks can be
 completely filled and the mesos slave won't do anything about it.

 This shouldn't happen. Old slave directories should be gc'ed by the slave
 based on their last modification time
 https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L4059. Do
 you see any log lines with  Garbage collecting old slave ?

RE: mesos-execute + docker_image

2015-07-07 Thread Tom Arnfeld
I've been using mesos-execute for a few little experiments, using docker 
images. The --docker_image flag will be passed straight through to mesos as the 
string to use for the actual image. There are no other options at present for 
specifying docker port mapping/networking/volume configuration through 
mesos-execute at the moment.


Tom.



--


Tom Arnfeld

Senior Developer // DueDil






On Tuesday, Jul 7, 2015 at 5:21 pm, Nikolaos Ballas neXus 
nikolaos.bal...@nexusgroup.com, wrote:



Search for containerizers in the manual on apache or mesosphere sites 





















Sent from my Samsung device











 Original message 


From: tommy xiao xia...@gmail.com 


Date: 07/07/2015 18:14 (GMT+01:00) 


To: user@mesos.apache.org 


Subject: Re: mesos-execute + docker_image 




How about check marathon?



2015-07-07 22:26 GMT+08:00 Jürgen Jakobitsch 

j.jakobit...@semantic-web.at:



hi,






i just installed mesos-0.22.0 (from the mesossphere repos) on centOS6.



can anyone point me into the right direction on how to run a docker image



inside mesos using mesos-execute plus the docker_image parameter.








also note that i would like to pass some parameters to the docker run command








any pointer really appreciated.








wkr j
















| Jürgen Jakobitsch,


| Software Developer


| Semantic Web Company GmbH


| Mariahilfer Straße 70 / Neubaugasse 1, Top 8


| A - 1070 Wien, Austria


| Mob +43 676 62 12 710 | Fax 

+43.1.402 12 35 - 22



COMPANY INFORMATION


| web       : http://www.semantic-web.at/


| foaf      : 

http://company.semantic-web.at/person/juergen_jakobitsch


PERSONAL INFORMATION


| web       : http://www.turnguard.com


| foaf      : http://www.turnguard.com/turnguard


| g+        : 

https://plus.google.com/111233759991616358206/posts


| skype     : jakobitsch-punkt


| xmlns:tg  = http://www.turnguard.com/turnguard#;

































-- 
Deshi Xiao


Twitter: xds2000


E-mail: xiaods(AT)gmail.com

Re: Hadoop on Mesos. HDFS question.

2015-07-03 Thread Tom Arnfeld
It might be worth taking a look at the install documentation on the Hadoop on 
Mesos product here; https://github.com/mesos/hadoop



For our installations I don't think we really do much more than installing the 
apt packages you mentioned and then installing the hadoop-mesos jars.. plus 
adding the appropriate configuration.






On Friday, Jul 3, 2015 at 3:52 pm, Kk Bk kkbr...@gmail.com, wrote:

I am trying to install Hadoop on Mesos on ubuntu servers, So followed 
instruction as per link 
https://open.mesosphere.com/tutorials/run-hadoop-on-mesos/#step-2.




Step-2 of link says to install HDFS using as per link 
http://www.cloudera.com/content/cloudera/en/documentation/cdh4/latest/CDH4-Installation-Guide/cdh4ig_topic_4_4.html.




Question: Is it sufficient to run following commands




1) On Namenode: sudo apt-get install hadoop-hdfs-namenode

2) On Datanode: sudo apt-get install hadoop-0.20-mapreduce-tasktracker 
hadoop-hdfs-datanode




Or just follow the instructions on the mesosphere link that installs HDFS ?

Re: RFC: Framework - Executor Message Passing Optimization Removal

2015-06-30 Thread Tom Arnfeld
We're using it for streaming realtime logs to the framework. In our short-lived 
framework for building Docker images, the executor streams back stdout/stderr 
logs from the build to the client for ease of use/debugging and the 
executor-framework best-effort messaging stuff made this effortless.


--


Tom Arnfeld

Developer // DueDil

On Mon, Jun 29, 2015 at 10:48 PM, Benjamin Mahler
benjamin.mah...@gmail.com wrote:

 FYI Some folks reached out off thread that they are using this optimization
 for distributed health checking of tasks. This is on the order of O(10,000)
 framework messages per second for them, which may not be possible through
 the master.
 On Tue, Jun 23, 2015 at 6:08 PM, Benjamin Mahler benjamin.mah...@gmail.com
 wrote:
 The existing Mesos API provides unreliable messaging passing for framework
 - executor communication:

 -- Schedulers can call 'sendFrameworkMessage(executor, slave, data)' on
 the driver [1], this sends a message to the executor. This has a
 best-effort optimization to bypass the master, and send the message to the
 slave directly.

 -- Executors can call 'sendFrameworkMessage(data)' on the driver [2],
 which sends a message to the scheduler. This has a best-effort optimization
 to bypass the master, and send the message to the scheduler driver directly
 (via the slave).

 As part of the HTTP API [3], schedulers can only make Calls against the
 master, and all Events must be streamed back on the scheduler-initiated
 connection to the master. This means that we can no longer easily support
 bypassing the master as an optimization.

 The plan is also to remove this optimization in the existing driver, in
 order to conform to the upcoming Event/Call messages [4] used in the HTTP
 API, so:


 *** If anyone is relying on this best-effort optimization, please chime
 in! ***


 [1]
 https://github.com/apache/mesos/blob/0.22.1/include/mesos/scheduler.hpp#L289
 [2]
 https://github.com/apache/mesos/blob/0.22.1/include/mesos/executor.hpp#L185
 [3]
 https://docs.google.com/document/d/1pnIY_HckimKNvpqhKRhbc9eSItWNFT-priXh_urR-T0/edit
 [4]
 https://github.com/apache/mesos/blob/0.22.1/include/mesos/scheduler/scheduler.proto


Re: Debugging framework registration from inside docker

2015-06-11 Thread Tom Arnfeld
I believe you're correct Jim, if you set LIBPROCESS_IP=$HOST_IP libprocess will 
try to bind to that address as well as announce it, which won't work inside a 
bridged container.




We've been having a similar discussion on 
https://github.com/wickman/pesos/issues/25.



--


Tom Arnfeld

Developer // DueDil






On Thursday, Jun 11, 2015 at 10:00 am, James Vanns jvanns@gmail.com, 
wrote:


Looks like I share the same symptoms as this 'marathon inside container' 
problem;




https://groups.google.com/d/topic/marathon-framework/aFIlv-VnF58/discussion



I guess that sheds some light on the subject ;)








On 11 June 2015 at 09:43, James Vanns jvanns@gmail.com wrote:

For what exactly? I thought that was for slave-master communication? There is 
no problem there. Or are you suggesting that from inside the running container 
I set at least LIBPROCESS_IP to the host IP rather than the IP of eth0 the 
container sees? Won't that screw with the docker bridge routing?


This doesn't quite make sense. I have other network connections inside this 
container and those channels are established and communicating fine. It's just 
with the mesos master for some reason. Just to be clear;




* The running process is a scheduling framework

* It does not listen for any inbound connection requests

* It, of course, does attempt an outbound connection to the zookeeper to get 
the MM

  (this works)

* It then attempts to establish a connection with the MM

  (this also works)

* When the MM sends a response, it fails - it effectively tries to send the 

response back to the private/internal docker IP where my scheduler is 

running.

* This problem disappears when run with --net=host




TCPDump never shows any inbound traffic;





IP 172.17.1.197.55182  172.20.121.193.5050

...




Therefore there is never any ACK# that corresponds with the SEQ# and these are 
just re-transmissions. I think!






Jim










On 10 June 2015 at 18:16, Steven Schlansker sschlans...@opentable.com wrote:

On Jun 10, 2015, at 10:10 AM, James Vanns jvanns@gmail.com wrote:


 Hi. When attempting to run my scheduler inside a docker container in 
 --net=bridge mode it never receives acknowledgement or a reply to that 
 request. However, it works fine in --net=host mode. It does not listen on any 
 port as a service so does not expose any.



 The scheduler receives the mesos master (leader) from zookeeper fine but 
 fails to register the framework with that master. It just loops trying to do 
 so - the master sees the registration but deactivates it immediately as 
 apparently it disconnects. It doesn't disconnect but is obviously 
 unreachable. I see the reason for this in the sendto() and the master log 
 file -- because the internal docker bridge IP is included in the POST and 
 perhaps that is how the master is trying to talk back

 to the requesting framework??



 Inside the container is this;

 tcp        0      0 0.0.0.0:44431           0.0.0.0:*               LISTEN    
   1/scheduler



 This is not my code! I'm at a loss where to go from here. Anyone got any 
 further suggestions

 to fix this?



You may need to try setting LIBPROCESS_IP and LIBPROCESS_PORT to hide the fact 
that you are on a virtual Docker interface.












-- 

--

Senior Code Pig

Industrial Light  Magic














-- 

--

Senior Code Pig

Industrial Light  Magic

Re: Design doc for Mesos HTTP API

2015-05-01 Thread Tom Arnfeld
Thanks for sharing this Vinod, very clear and useful document!




Q: Could you explain in a little detail why the decision was made to use a 
single HTTP endpoint rather than something like /event (for the stream) and 
/call for making calls? It seems a little strange / contrived to me that the 
difference between sending data to the master and receiving a stream of events 
would be based on the order of the calls and via the same endpoint. For 
example, would there not be a failure case here where the initial HTTP 
connection (SUBSCRIBE) fails (perhaps due to application error) and the driver 
continue making subsequent POST requests to send messages? In this situation, 
what would happen? Would the next http request that sent a message start 
getting streamed events in the response?




Perhaps i've misread another section of the document that explains this, but 
it'd be great if you could help me understand.



--


Tom Arnfeld

Developer // DueDil






On Thursday, Apr 30, 2015 at 10:26 pm, Vinod Kone vinodk...@gmail.com, wrote:




Welcome to the Greater Anglia Wi-Fi Service





Home

Tickets  Fares

Travel Information

Offers

Destinations

About Us

Contact Us









Map
















You are now connected





Welcome to the Wi-Fi Service


Wireless Internet access is 
available free for all First Class Customers and to all Standard Class 
Customers upon payment of a small fee. 



In order to use the service you 
will need to complete a one time registration. Thereafter you can log on simply 
by entering your email address in the box below and, for Standard Customers, 
selecting the time period you want to purchase. 



Wireless Internet access is not 
available in the Quiet Coach.








Error



An unexpected error has 
occurred

An unexpected error has 
occurred, please contact the Wi-Fi Support Team on 0845 193 0138 for further 
information and assistance.





Communication error

An error occurred while 
contacting the authentication server. Please contact the Wi-Fi Support Team on 
0845 193 0138 for further information and assistance, or try again later.





An unexpected error has 
occurred

An unexpected error has 
occurred, please contact the Wi-Fi Support Team on 0845 193 0138 for further 
information and assistance.





Invalid code

Re: MRv2 jobs on mesos

2015-04-28 Thread Tom Arnfeld
Hi Bharath,





As far as I'm aware there are no plans to make Hadoop MRv2 work with the Hadoop 
on Mesos framework (https://github.com/mesos/hadoop), unless someone else in 
the community is working on this and keeping quiet? We're certainly not working 
on this.




My understanding is that MRv2 is better described as the MapReduce application 
for YARN which means you can't run it without a YARN cluster, because it's 
intrinsically designed around it.




Cheers,





--


Tom Arnfeld

Developer // DueDil






On Tuesday, Apr 28, 2015 at 9:24 am, Bharath Ravi Kumar reachb...@gmail.com, 
wrote:



Hi,


I want to be able to run jobs compiled against MRv2 on mesos through the 
hadoop-on-mesos framework. Please let me know if support for this will be 
implemented in the near future. Thanks.
(Aside: Myraid isn't applicable since I'd like to run MRv2 jobs without the 
intervening YARN runtime.)


Thanks,

Bharath

Re: Current State of Service Discovery

2015-04-12 Thread Tom Arnfeld
Hi Yaron,

(Also apologies for the long reply)

 First, I think that we need to clearly distinguish that service
discovery, and request routing / load balancing are related but separate
concerns. In my mind, at least, discovery is about finding a way to
communicate with a service (much like a DNS query), whereas routing and
load balancing concerns effectively and efficiently directing (handling)
network traffic.

It's very refreshing to hear you talk about service discovery this way. I
think this is a very important point that often gets lost in discussions,
and implementations don't always truly take this to heart so the result
doesn't end up as system agnostic as intended. We've spend the last ~year
deploying our own service discovery system because we felt nothing in the
community really fitted *truly* into the sentence you described above...

The result for us was something very similar to what you've come up with, a
DNS (DNS-SD rfc6763) system that runs multi-datacenter backed by an
distributed etcd database for the names and records. We built our own DNS
server to do this as consul/weave didn't exist back then -
https://github.com/duedil-ltd/discodns. We're firm believers that DNS
itself can provide us with the framework to achieve 99% of all use cases,
assuming we ultimately build in support for things like dynamic updates via
DNS (rfc2136) and possibly even the push based long-polling features that
Apple use in Bonjour/mDNSResponder. Not to mention that using say,
Marathon, for service discovery is increadiblly restricting. We'd like to
use the same system for every service we run, ranging from things deployed
in one cloud environment on Mesos to others in another environment deployed
using Chef. There's no reason why this can't be the case, imo.

The key idea very much the same, we want different systems to be able to
register services in the way they need to (either via http to etcd, or via
dns update) and using their own semantics. A good example is that some
systems might want to have TTLs on records (in etcd, so the record
automatically disappears) to remove unhealthy instances of services,
however other systems might not want to relate their existence in service
discovery with their health (think long running distributed databases).
Currently we have some command line tools and chef cookbooks for service
registering and a WIP branch for *dnsupdate* (
https://github.com/duedil-ltd/discodns/pull/31).

(I'd be very very interested to hear more about your experience with Weave
for this purpose, perhaps a blog post? :-))

 Regarding DNS: again I don’t think it makes sense to have a ‘mesos-dns’
and ‘weave-dns’ and ‘kubernets-dns’ - it makes much more sense to have a
single DNS that easily integrates with multiple data sources.

There's actually a ticket on mesos-dns to support plugins
https://github.com/mesosphere/mesos-dns/issues/62 and I had an idea to
write a discondns plugin that'd write the records into our etcd database,
which might be an interesting way to achieve integration with these tools.
Though I wonder whether this approach results is scalability problems
because the state becomes too large for a single system to re-process on a
regular basis, maybe it's best for the things running on mesos to register
themselves, or even a mesos module for the slaves.

 https://registry.hub.docker.com/u/yaronr/discovery

We'd played around with the LUA support in NGINX to create some sort of
dns-sd based service discovery proxy, though don't have anything worth
sharing yet as far as I know!

Thanks for sharing!

On 12 April 2015 at 10:03, Yaron Rosenbaum yaron.rosenb...@gmail.com
wrote:

 Ok, this is a bit long, I apologize in advance

 I’ve been researching and experimenting with various challenges around
 managing microservices at-scale. I’ve been working extensively with Docker,
 CoreOS and recently Mesos.

 First, I think that we need to clearly distinguish that service discovery,
 and request routing / load balancing are related but separate concerns. In
 my mind, at least, discovery is about finding a way to communicate with a
 service (much like a DNS query), whereas routing and load balancing
 concerns effectively and efficiently directing (handling) network traffic.

 There are multiple solutions and approaches out there, but I don’t know if
 any single technology could address all ‘microservices at-scale’ needs on
 its own effectively and efficiently. In other words - mixing multiple
 approaches, tools and technologies is probably the right way to go.
 I’m saying this because many of the existing tools come with a single
 technology in mind. Tools that come form the Mesos camp obviously have
 Mesos in mind, tools that come from Kube have Kube in mind, tools coming
 from CoreOS have CoreOS in mind, etc.

 I think It’s time to start mixing things together to really benefit from
 all the goodness in all the various camps.

 I’ll give an example:
 First, with respect to network traffic routing 

Re: Spark on Mesos / Executor Memory

2015-04-11 Thread Tom Arnfeld
Thanks for sharing the details Tim. Though I agree with James here, the
approach to cap cores doesn't really solve the underlying problem. In our
case we're running several frameworks, all of which consume varying amounts
of resources throughout their lifetime. When the cluster is busy this
results in lots of slaves tightly packed meaning when resources become
available we want to ensure frameworks have the ability to do _something_
if not at their desired capacity, ultimately this evens out over time to a
fair share.

An example of where this becomes a problem with spark, if we cap the cores
at 5 CPUs (in our case we'd see only a few executors per slave) we can set
the memory limit lower. However this means that if the framework can only
get 1 CPU for a slave it's going to be requiring a lot more memory than it
really needs, and that may not be available, so nothing get's launched.

 It also might be interesting to include a cores to memory multiplier so
that with a larger amount of cores we try to scale the memory with some
factor, but I'm not entirely sure that's intuitive to use and what people
know what to set it to, as that can likely change with different workload.

A cores multiplier is definitely an interesting route to go down, I think
specifying the memory for the executor on it's own and adding in a
multiplication of some memory value * CPUs allocated would go towards
helping solve the problem. We're actually using coarse mode but I think the
same sort of issue still stands for fine grained, in fact it would probably
be worse, because the number of tasks per executor is a lot more fluid.

Tom.

On 11 April 2015 at 21:05, CCAAT cc...@tampabay.rr.com wrote:

 Hello Tim,

 Your approach seems most reasonable, particularly from an over arching
 viewpoint. However, it occurs to me the that as folks have several to many
 different frameworks (distributed applications)  running on a given mesos
 cluster, that the optimization of resource allocation (utilization) may
 ultimately need to be under some sort of tunable, dynamic scheme. Most
 distributed application, say it runs for a few hours, will usually not have
 a constant resource demand on memory  so how can any static configuration
 suffice for a dynamic mix of frequently changing distributed application
 work well with static configurations. This is particularly amplified as a
 problem, where
 Apache-spark is an in-memory resource demand, that is very different
 than other frameworks that may be active on the same cluster.

 I really think we are just experiencing the tip of the iceberg here
 as these mesos clusters grow, expand and take on a variety of problems,
 or did I miss some already existing robustness in the codes?


 James



 On 04/11/2015 12:29 PM, Tim Chen wrote:

 (Adding spark user list)

 Hi Tom,

 If I understand correctly you're saying that you're running into memory
 problems because the scheduler is allocating too much CPUs and not
 enough memory to acoomodate them right?

 In the case of fine grain mode I don't think that's a problem since we
 have a fixed amount of CPU and memory per task.
 However, in coarse grain you can run into that problem if you're with in
 the spark.cores.max limit, and memory is a fixed number.

 I have a patch out to configure how much max cpus should coarse grain
 executor use, and it also allows multiple executors in coarse grain
 mode. So you could say try to launch multiples of max 4 cores with
 spark.executor.memory (+ overhead and etc) in a slave.
 (https://github.com/apache/spark/pull/4027)

 It also might be interesting to include a cores to memory multiplier so
 that with a larger amount of cores we try to scale the memory with some
 factor, but I'm not entirely sure that's intuitive to use and what
 people know what to set it to, as that can likely change with different
 workload.

 Tim







 On Sat, Apr 11, 2015 at 9:51 AM, Tom Arnfeld t...@duedil.com
 mailto:t...@duedil.com wrote:

 We're running Spark 1.3.0 (with a couple of patches over the top for
 docker related bits).

 I don't think SPARK-4158 is related to what we're seeing, things do
 run fine on the cluster, given a ridiculously large executor memory
 configuration. As for SPARK-3535 although that looks useful I think
 we'e seeing something else.

 Put a different way, the amount of memory required at any given time
 by the spark JVM process is directly proportional to the amount of
 CPU it has, because more CPU means more tasks and more tasks means
 more memory. Even if we're using coarse mode, the amount of executor
 memory should be proportionate to the amount of CPUs in the offer.

 On 11 April 2015 at 17:39, Brenden Matthews bren...@diddyinc.com
 mailto:bren...@diddyinc.com wrote:

 I ran into some issues with it a while ago, and submitted a
 couple PRs to fix it:

 https://github.com/apache/spark/pull/2401
 https://github.com/apache

Re: Spark on Mesos / Executor Memory

2015-04-11 Thread Tom Arnfeld
We're running Spark 1.3.0 (with a couple of patches over the top for docker
related bits).

I don't think SPARK-4158 is related to what we're seeing, things do run
fine on the cluster, given a ridiculously large executor memory
configuration. As for SPARK-3535 although that looks useful I think we'e
seeing something else.

Put a different way, the amount of memory required at any given time by the
spark JVM process is directly proportional to the amount of CPU it has,
because more CPU means more tasks and more tasks means more memory. Even if
we're using coarse mode, the amount of executor memory should be
proportionate to the amount of CPUs in the offer.

On 11 April 2015 at 17:39, Brenden Matthews bren...@diddyinc.com wrote:

 I ran into some issues with it a while ago, and submitted a couple PRs to
 fix it:

 https://github.com/apache/spark/pull/2401
 https://github.com/apache/spark/pull/3024

 Do these look relevant? What version of Spark are you running?

 On Sat, Apr 11, 2015 at 9:33 AM, Tom Arnfeld t...@duedil.com wrote:

 Hey,

 Not sure whether it's best to ask this on the spark mailing list or the
 mesos one, so I'll try here first :-)

 I'm having a bit of trouble with out of memory errors in my spark jobs...
 it seems fairly odd to me that memory resources can only be set at the
 executor level, and not also at the task level. For example, as far as I
 can tell there's only a *spark.executor.memory* config option.

 Surely the memory requirements of a single executor are quite
 dramatically influenced by the number of concurrent tasks running? Given a
 shared cluster, I have no idea what % of an individual slave my executor is
 going to get, so I basically have to set the executor memory to a value
 that's correct when the whole machine is in use...

 Has anyone else running Spark on Mesos come across this, or maybe someone
 could correct my understanding of the config options?

 Thanks!

 Tom.





Spark on Mesos / Executor Memory

2015-04-11 Thread Tom Arnfeld
Hey,

Not sure whether it's best to ask this on the spark mailing list or the
mesos one, so I'll try here first :-)

I'm having a bit of trouble with out of memory errors in my spark jobs...
it seems fairly odd to me that memory resources can only be set at the
executor level, and not also at the task level. For example, as far as I
can tell there's only a *spark.executor.memory* config option.

Surely the memory requirements of a single executor are quite dramatically
influenced by the number of concurrent tasks running? Given a shared
cluster, I have no idea what % of an individual slave my executor is going
to get, so I basically have to set the executor memory to a value that's
correct when the whole machine is in use...

Has anyone else running Spark on Mesos come across this, or maybe someone
could correct my understanding of the config options?

Thanks!

Tom.


Re: Fwd: Questions about Mesos

2015-04-07 Thread Tom Arnfeld
Hi Robin,


It might be a little late to reply but I thought it would be worth weighing in. 
Given the Mesos master and slave are primarily configured using command line 
parameters, the main issue is getting a working install as configuration is 
fairly simple.




It can be quite easy to compile Mesos in your environment if you want to avoid 
using the publicly available Docker images or apt repositories provoded by 
Mesosphere (https://www.mesosphere.com/downloads/). 






Instructions for compiling can be found at 
http://mesos.apache.org/gettingstarted/.





After that you can either use the own configuration management system for 
deployment or maybe build some appropriate images using tools provided e.g AMIs 
within EC2.




- https://github.com/mdsol/mesos_cookbook


- https://github.com/everpeace/cookbook-mesos


- https://github.com/deric/puppet-mesos





I hope his helps!







On Friday, 3 Apr 2015 at 06:43, Robin Anil robin.a...@gmail.com, wrote:

Fellow Mesos-ers

Firstly, I am loving the speed of Mesos so far. I set up a cluster from scratch 
and have been running docker applications with ease with mesos-dns generating 
the SRV records. Now I am looking for a serious production setup on AWS


I see few choices:




1) Start with linux machines, set up masters, zookeeper and slaves by getting 
packages from the apt repo

2) Some how use the mesosphere docker images for 
zookeeper/mesos-master/mesos-slave to bootstrap a cluster. 







2) is a lot cleaner but none of the docker images have any sort of help. I have 
to manually reverse engineer them. Before I invest in building my own docker 
images and configuration. I wanted to ask if those public docker images are 
even supported by the community, or if anyone is running a similar setup in 
production? Experiences/notes will help.




Secondly, I am trying to choose between Marathon and Aurora as the scheduler, 
Aurora has priority and is_production which is very attractive, I would love if 
some of you can share notes about your experiences with either.




Robin

Re: Custom python executor with Docker

2015-04-07 Thread Tom Arnfeld
It's not possible to invoke the docker containerizer from outside of Mesos, as 
far as I know.




If you persue this route, you can run into issues with orphaned containers as 
your executor may die for some unknown reason, and the container is still 
running. Recovering from this can be tricky business, so it's better if you can 
adapt your framework design to fit within the Mesos Task/Executor pattern.



--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Mon, Apr 6, 2015 at 7:00 PM, Vinod Kone vinodk...@apache.org wrote:

 Tim, do you want answer this?
 On Wed, Apr 1, 2015 at 7:27 AM, Tom Fordon tom.for...@gmail.com wrote:
 Hi.  I'm trying to understand using docker within a custom executor. For
 each of my tasks, I would like to perform some steps on the node before
 launching a docker container. I was planning on writing a custom python
 executor for this, but I wasn't sure how to launch docker from within this
 executor.

 Can I just call docker in a subprocess using the ContainerInfo from the
 Task? If I do this, how does the Containerizer fit in?
 Thank you,
 Tom Fordon


Re: Using mesos-dns in an enterprise

2015-04-02 Thread Tom Arnfeld
We're using a BGP based solution currently to solve the problem of highly 
available DNS resolvers.




That might be a route worth taking, and one that could still work via marathon 
on top of Mesos.



--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Thu, Apr 2, 2015 at 10:07 PM, John Omernik j...@omernik.com wrote:

 True :)
 On Thu, Apr 2, 2015 at 3:37 PM, Tom Arnfeld t...@duedil.com wrote:
 Last time I checked haproxy didn't support UDP which would be key for
 mesos-dns.

 --

 Tom Arnfeld
 Developer // DueDil

 (+44) 7525940046
 25 Christopher Street, London, EC2A 2BS


 On Thu, Apr 2, 2015 at 3:53 PM, John Omernik j...@omernik.com wrote:

 That was my first response as well... I work at a bank, and the thought
 of changing dns servers on the clients everywhere made me roll my eyes :)

 John


 On Thu, Apr 2, 2015 at 9:39 AM, Tom Arnfeld t...@duedil.com wrote:

 This is great, thanks for sharing!

 It's nice to see other members of the community sharing more realistic
 implementations of DNS rather than just update your resolv conf and it
 works :-)

 --

 Tom Arnfeld
 Developer // DueDil

 (+44) 7525940046
 25 Christopher Street, London, EC2A 2BS


 On Thu, Apr 2, 2015 at 3:30 PM, John Omernik j...@omernik.com wrote:

 Based on my earlier emails about the state of service discovery.  I did
 some research and a little writeup on how to use mesos-dns as a forward
 lookup zone in a enterprise bind installation. I feel this is more secure,
 and more comfortable for an enterprise DNS team as opposed to changing the
 first resolver on every client that may interact with mesos to be the
 mesos-dns server.  Please feel free to modify/correct and include this in
 the mesos-dns documentation if you feel it's valuable.


 Goals/Thought Process
 - Run mesos-dns on a non-standard port. (such as 8053).  This allows
 you to run it as a non-root user.
 - While most DNS clients may not understand this (a different port), in
 an enterprise, most DNS servers will respect a forward lookup zone with a
 server using a different port.
 - Setup below for BIND9 allows you to keep all your mesos servers AND
 clients in an enterprise pointing their requests at your enterprise DNS
 server, rather than mesos-dns.
   - This is easier from an enterprise configuration standpoint. Make
 one change on your dns servers, rather than adding a resolver on all the
 clients.
   - This is more secure in that you can run mesos-dns as non-root (53
 is a privileged port, 8053 is not) no sudo required
   - For more security, you can limit connections to the mesos-dns
 server to only your enterprise dns servers. This could help mitigate any
 unknown vulnerabilities in mesos-dns.
   - This allows you to HA mesos-dns in that you can specify multiple
 resolvers for your bind configuration.




 Bind9 Config
 This was put into my named.conf.local It sets up the .mesos zone and
 forwards to mesos dns. All my mesos servers already pointed at this 
 server,
 therefore no client changes required.


 #192.168.0.100 is my host running mesos DNS
 zone mesos {
 type forward;
 forward only;
 forwarders { 192.168.0.100 port 8053; };
 };




 config.json mesos-dns config file.
 I DID specify my internal DNS server in the resolvers (192.168.0.10)
 however, I am not sure if I need to do this.  Since only requests for
 .mesos will actually be sent to mesos-dns.

 {
   masters: [192.168.0.98:5050],
   refreshSeconds: 60,
   ttl: 60,
   domain: mesos,
   port: 8053,
   resolvers: [192.168.0.10],
   timeout: 5,
   listener: 0.0.0.0,
   email: root.mesos-dns.mesos
 }


 marathon start json
 Note the lack of sudo here. I also constrained it to one host for now,
 but that could change if needed.

 {
 cmd: /mapr/brewpot/mesos/mesos-dns/mesos-dns
 -config=/mapr/brewpot/mesos/mesos-dns/config.json,
 cpus: 1.0,
 mem: 1024,
 id: mesos-dns,
 instances: 1,
 constraints: [[hostname, CLUSTER, hadoopmapr1.brewingintel.com
 ]]
 }






Re: Mesos Hadoop Framework 0.1.0

2015-03-28 Thread Tom Arnfeld
We're running the framework to support our legacy jobs written in Hadoop MRv1. 
Essentially this is a feature that moves further towards getting Hadoop to play 
nicely on a shared cluster.


The Hadoop on Mesos framework is pretty greedy at the moment, and it can be 
quite problematic if you're trying to pack a multi-tenant cluster to the max.



--


Tom Arnfeld

Developer // DueDil






On Saturday, Mar 28, 2015 at 2:40 pm, Jeff Schroeder 
jeffschroe...@computer.org, wrote:

Does this have any pros / cons over Myriad, which runs Yarn on Mesos? Other 
than not requiring Yarn :)

On Saturday, March 28, 2015, Tom Arnfeld t...@duedil.com wrote:





Hey everyone,




I thought it best to send an email to the list before merging and tagging a 
0.1.0 release for the Hadoop on Mesos framework. This release is for a new 
feature we've been working on for quite some time, which allows Hadoop 
TaskTrackers to be semi-terminated when they are idle, without destroying any 
map output they may need to retain for running reduce tasks.




Essentially this means that over the lifetime of a job (one with more 
map/reduce tasks than the size of the cluster) the ratio of map and reduce 
slots can change, resulting in significantly better resource utilization, 
because the map slots can be freed up after they have finished doing work.




If anyone is running Hadoop on Mesos or would be kind enough to contribute to 
reviewing the code in the diff, or giving the branch a go on their cluster, 
that would be very much appreciated! We've been running the patch in production 
for several months and have seen some quite significant performance gains with 
our type of workload.




The pull request is here https://github.com/mesos/hadoop/pull/33.




Feel free to get in touch if you have any questions! Thanks!





--


Tom Arnfeld

Developer // DueDil









-- 
Text by Jeff, typos by iPhone

Re: Mesos Hadoop Framework 0.1.0

2015-03-28 Thread Tom Arnfeld
To follow up, this is also a decent solution to a nasty problem in the current 
framework detailed here, https://github.com/mesos/hadoop/issues/32.




--


Tom Arnfeld

Developer // DueDil

On Sat, Mar 28, 2015 at 2:40 PM, Jeff Schroeder
jeffschroe...@computer.org wrote:

 Does this have any pros / cons over Myriad, which runs Yarn on Mesos? Other
 than not requiring Yarn :)
 On Saturday, March 28, 2015, Tom Arnfeld t...@duedil.com wrote:
  Hey everyone,

 I thought it best to send an email to the list before merging and tagging
 a 0.1.0 release for the Hadoop on Mesos framework. This release is for a
 new feature we've been working on for quite some time, which allows Hadoop
 TaskTrackers to be semi-terminated when they are idle, without destroying
 any map output they may need to retain for running reduce tasks.

 Essentially this means that over the lifetime of a job (one with more
 map/reduce tasks than the size of the cluster) the ratio of map and reduce
 slots can change, resulting in significantly better resource utilization,
 because the map slots can be freed up after they have finished doing work.

 If anyone is running Hadoop on Mesos or would be kind enough to contribute
 to reviewing the code in the diff, or giving the branch a go on their
 cluster, that would be very much appreciated! We've been running the patch
 in production for several months and have seen some quite significant
 performance gains with our type of workload.

 The pull request is here https://github.com/mesos/hadoop/pull/33.

 Feel free to get in touch if you have any questions! Thanks!

  --

 Tom Arnfeld
 Developer // DueDil

 -- 
 Text by Jeff, typos by iPhone

Re: hadoop on mesos odd issues with heartbeat and ghost task trackers.

2015-03-03 Thread Tom Arnfeld
Hi John,

Not sure if you ended up getting to the bottom of the issue, but often when
the scheduler gives up and his this time out it's because something funky
happened in mesos and the scheduler wasn't updated correctly. Could you
describe the state (with some logs too if possible) of mesos while this
happens?

Tom.

On 25 February 2015 at 17:01, John Omernik j...@omernik.com wrote:

 I am running hadoop on mesos 0.0.8 on Mesos 0.21.0.  I am running into
 a weird issue where it appears two of my nodes, when a task tracker is
 run on them,  never really complete the check in process, the job
 tracker is waiting for their heartbeat, they think they are running
 successfully, and then tasks that would be assigned to them stay in a
 hung/pending state waiting for the heartbeat.

 Basically in the job tracker log, I see the below (where the pending
 tasks is one, the inactive slots is 2 (launched but no heartbeat yet)
 so the jobtracker just sits there waiting, and the node thinks it's
 running fine.

 Is there a way to have the JobTracker give up on a task tracker
 sooner?  This waiting for timeout period seems odd.

 Thanks!

 (if there is any other information I can provide, please let me know)



 Job Tracker Log:

Pending Map Tasks: 0

Pending Reduce Tasks: 1

   Running Map Tasks: 0

Running Reduce Tasks: 0

  Idle Map Slots: 2

   Idle Reduce Slots: 0

  Inactive Map Slots: 2 (launched but no hearbeat yet)

   Inactive Reduce Slots: 2 (launched but no hearbeat yet)

Needed Map Slots: 0

 Needed Reduce Slots: 0

  Unhealthy Trackers: 0

 2015-02-25 10:57:01,930 INFO mapred.ResourcePolicy [Thread-1290]:
 Satisfied map and reduce slots needed.

 2015-02-25 10:57:02,083 INFO mapred.MesosScheduler [IPC Server handler
 7 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:31264.

 2015-02-25 10:57:02,097 INFO mapred.MesosScheduler [IPC Server handler
 0 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:50060.

 2015-02-25 10:57:02,148 INFO mapred.MesosScheduler [IPC Server handler
 4 on 7676]: Unknown/exited TaskTracker: http://moonman:31182.

 2015-02-25 10:57:02,392 INFO mapred.MesosScheduler [IPC Server handler
 1 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:31264.

 2015-02-25 10:57:02,403 INFO mapred.MesosScheduler [IPC Server handler
 3 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:50060.

 2015-02-25 10:57:02,459 INFO mapred.MesosScheduler [IPC Server handler
 6 on 7676]: Unknown/exited TaskTracker: http://moonman:31182.

 2015-02-25 10:57:02,702 INFO mapred.MesosScheduler [IPC Server handler
 4 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:31264.

 2015-02-25 10:57:02,714 INFO mapred.MesosScheduler [IPC Server handler
 5 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:50060.



Re: Does MesosScheduler.resourceOffers need to be reentrant?

2015-02-26 Thread Tom Arnfeld
As far as I know the entire scheduler (the API at least) is single threaded so 
only one callback will fire at a given time.



--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Thu, Feb 26, 2015 at 9:45 AM, Dario Rexin da...@mesosphere.io wrote:

 No, resourceOffers will only be called once at a time.
 Cheers,
 Dario
 On 26 Feb 2015, at 10:09, Itamar Ostricher ita...@yowza3d.com wrote:
 
 Making sure the question is clear:
 I'm implementing a framework scheduler,
 and I want to know if the resourceOffers method can be invoked while a 
 previous invocation hasn't returned yet (on another thread).
 
 Thanks,
 - Itamar.

Re: Hadoop on Mesos

2015-01-29 Thread Tom Arnfeld
Hi Alex,




Great to hear you're hoping to use Hadoop on Mesos. We've been running it for a 
good 6 months and it's been awesome.




I'll answer the simpler question first, running multiple job trackers should be 
just fine.. even multiple JTs with HA enabled (we do this). The mesos scheduler 
for Hadoop will ship all configuration options needed for each TaskTracker 
within mesos, so there's nothing you need to have specifically configured on 
each slave..




# Slow slot allocations




If you only have a few slaves, not many resources and a large amount of 
resources per slot, you might end up with a pretty small slot allocation (e.g 5 
mappers and 1 reducer). Because of the nature of Hadoop, slots are static for 
each TaskTracker and the framework does a best effort to figure out what 
balance of map/reduce slots to launch on the cluster.




Because of this, the current stable version of the framework has a few issues 
when running on small clusters, especially when you don't configure min/max 
slot capacity for each JobTracker. Few links below




- https://github.com/mesos/hadoop/issues/32

- https://github.com/mesos/hadoop/issues/31

- https://github.com/mesos/hadoop/issues/28

- https://github.com/mesos/hadoop/issues/26




Having said that, we've been working on a solution to this problem which 
enables Hadoop to launch different types of slots over the lifetime of a single 
job, meaning you could start with 5 maps and 1 reduce, and then end with 0 maps 
and 6 reduce. It's not perfect, but it's a decent optimisation if you still 
need to use Hadoop.




- https://github.com/mesos/hadoop/pull/33


You may also want to look into how large your executor URI is (the one 
containing the hadoop source that gets downloaded for each task tracker) and 
how long that takes to download.. it might be that the task trackers are taking 
a while to bootstrap.




# HA Hadoop JTs




The framework currently does not support a full HA setup, however that's not a 
huge issue. The JT will automatically restart jobs where they left off on it's 
own when a failover occurs, but for the time being all the track trackers will 
be killed and new ones spawned. Depending on your setup, this could be a fairly 
negligible time.




# Multiple versions of hadoop on the cluster




This is totally fine, each JT configuration can be given it's own hadoop tar.gz 
file with the right version in it, and they will all happily share the Mesos 
cluster.




I hope this makes sense! Ping me on irc (tarnfeld) if you run into anything 
funky on that branch for flexi trackers.




Tom.


--


Tom Arnfeld

Developer // DueDil

On Thu, Jan 29, 2015 at 4:09 PM, Alex alex.m.lis...@gmail.com wrote:

 Hi guys,
 I'm a Hadoop and Mesos n00b, so please be gentle. I'm trying to set up a
 Mesos cluster, and my ultimate goal is to introduce Mesos in my
 organization by showing off it's ability to run multiple Hadoop
 clusters, plus other stuff, on the same resources. I'd like to be able
 to do this with a HA configuration as close as possible to something we
 would run in production.
 I've successfully set up a Mesos cluster with 3 masters and 4 slaves,
 but I'm having trouble getting Hadoop jobs to run on top of it. I'm
 using Mesos 0.21.1 and Hadoop CDH 5.3.0. Initially I tried to follow the
 Mesosphere tutorial[1], but it looks like it is very outdated and I
 didn't get very far. Then I tried following the instructions in the
 github repo[2], but they're also less than ideal.
 I've managed to get a Hadoop jobtracker running on one of the masters, I
 can submit jobs to it and they eventually finish. The strange thing is
 that they take a really long time to start the reduce task, so much so
 that the first few times I thought it wasn't working at all. Here's part
 of the output for a simple wordcount example:
 15/01/29 16:37:58 INFO mapred.JobClient:  map 0% reduce 0%
 15/01/29 16:39:23 INFO mapred.JobClient:  map 25% reduce 0%
 15/01/29 16:39:31 INFO mapred.JobClient:  map 50% reduce 0%
 15/01/29 16:39:34 INFO mapred.JobClient:  map 75% reduce 0%
 15/01/29 16:39:37 INFO mapred.JobClient:  map 100% reduce 0%
 15/01/29 16:56:25 INFO mapred.JobClient:  map 100% reduce 100%
 15/01/29 16:56:29 INFO mapred.JobClient: Job complete: job_201501291533_0004
 Mesos started 3 task trackers which ran the map tasks pretty fast, but
 then it looks like it was stuck for quite a while before launching a
 fourth task tracker to run the reduce task. Is this normal, or is there
 something wrong here?
 More questions: my configuration file looks a lot like the example in
 the github repo, but that's listed as being representative of a
 pseudo-distributed configuration. What should it look like for a real
 distributed setup? How can I go about running multiple Hadoop clusters?
 Currently, all three masters have the same configuration file, so they
 all create a different framework. How should things be set up for a
 high-availability Hadoop framework that can

Re: mesos and coreos?

2015-01-18 Thread Tom Arnfeld
The way I see it, Mesos is an API and framework for building and running 
distributed systems. CoreOS is an API and framework for running them.

--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Sun, Jan 18, 2015 at 3:01 PM, Jason Giedymin jason.giedy...@gmail.com
wrote:

 The value of coreos that immediately comes to mind since I do much work with 
 these tools:
  - the small foot print, it is a minimal os, meant to run containers. So it 
 throws everything not needed for that out.
  - containers are the launch vehicle, thus deps are in container land. I can 
 run and test containers with ease, not having to worry about multiple OSes.
  - with etcd and fleet, coordinating the launch and modification of both 
 machines and cluster make it a breeze. Allowing you to do dynamic mesos 
 scaling up or down. I add nodes at will, across multiple cloud platforms, 
 ready to launch multitude of containers or just mesos.
  - security. There is a defined write strategy. You cannot write willy nilly 
 to any location.
  - all the above further allow auto OS updates, which is supported today on 
 all platforms that deploy coreos. This means more frequent updates since the 
 os is minimal, which should increase the security effectiveness when compared 
 to big box superstore OSes like Redhat or Ubuntu. Some platforms charge quite 
 a bit for managed updates of this frequency and level of testing.
 Coreos allows me to keep apps in a configured container that I trust, tested, 
 and works time and time again.
  
 I see coreos as a compliment.
 As a fyi I'm available for questions, debugging, and client work in this area.
 Hope this helps some, from real world usage.
 Sent from my iPad
 On Jan 18, 2015, at 9:16 AM, Victor L vlyamt...@gmail.com wrote:
 
 I am confused: what's the value of mesos on the top of coreos cluster? Mesos 
 provides distributed resource management, fault tolerance, etc., but doesn't 
 coreos provides the same things already? 
 Thanks

Re: hadoop job stuck.

2015-01-14 Thread Tom Arnfeld
Hi Dan,




Can you look at the stdout/stderr logs in the task sandbox for me and share any 
errors here?




Also – What version of Hadoop are you using, and what version of the Hadoop on 
Mesos framework?




Thanks.



--


Tom Arnfeld

Developer // DueDil






On Wednesday, Jan 14, 2015 at 8:22 pm, Dan Dong dongda...@gmail.com, wrote:
Hi,
  When I run hadoop jobs on Mesos(0.21.0), the jobs are stuck for ever:
15/01/14 13:59:30 INFO mapred.FileInputFormat: Total input paths to process : 8
15/01/14 13:59:30 INFO mapred.JobClient: Running job: job_201501141358_0001
15/01/14 13:59:31 INFO mapred.JobClient:  map 0% reduce 0%

From jobtracker log I see:
2015-01-14 13:59:35,542 INFO org.apache.hadoop.mapred.ResourcePolicy: Launching 
task Task_Tracker_0 on http://centos-2.local:31911 with mapSlots=1 reduceSlots=0
2015-01-14 14:04:35,552 WARN org.apache.hadoop.mapred.MesosScheduler: Tracker 
http://centos-2.local:31911 failed to launch within 300 seconds, killing it

 I started manually namenode and jobtracker on master node and datanode on 
slave, but I could not see tasktracker started by mesos on slave. Note that if 
I ran hadoop directly without Mesos( of course the conf files are different and 
tasktracker will be started manually on slave), everything works fine. Any 
hints?

Cheers,
Dan

Re: Running services on all slaves

2015-01-08 Thread Tom Arnfeld
That's a great point Itamar, and something we discussed quite some time ago 
here but never implemented. These are the first two options that spring to mind 
that I can remember...




- Are you using docker containers for your tasks? Why not use containers 
pre-configured on the box for these services too?

- Build some custom init scripts for your services (perhaps systemd and the 
like can do this for you) that will drop your PIDs into cgroups after they 
launch, which would allow you to reserve those resources you need using the 
same resource system as the popular container tools.

- Do you need to actually reserve these resources? Perhaps if you're only 
concerned about memory, or CPU, you could just advertise your slaves as having 
less than the machine actually has (using the --resources) flag to mesos-slave.




With any of these three approaches you still are going to need to modify the 
--resources flag on each slave to ensure less resources than are actually 
available are advertised to the cluster.




Maybe those options are of some use. If you do end up implementing something in 
this area for settings aside resources for these auxiliary services, i'd love 
to know how you end up doing it!




--


Tom Arnfeld

Developer // DueDil






On Thursday, Jan 8, 2015 at 7:32 am, Itamar Ostricher ita...@yowza3d.com, 
wrote:

Thanks everybody for all your insights!


I totally agree with the last response from Tom.

The per-node services definitely belong to the level that provisions the 
machine and the mesos-slave service itself (in our case, pre-configured GCE 
images).




So I guess the problem I wanted to solve is more general - how can I make sure 
there are resources reserved for all of the system-level stuff that are running 
outside of the mesos context?

To be more specific, if I have a machine with 16 CPUs, it is common that my 
framework will schedule 16 heavy number-crunching processes on it.

This can starve anything else that's running on the machine... (like the 
logging aggregation service, and the mesos-slave service itself)

(this probably explains phenomena of lost tasks we've been observing)

What's the best-practice solution for this situation?





On Wed, Jan 7, 2015 at 2:09 AM, Tom Arnfeld t...@duedil.com wrote:

I completely agree with Charles, though I think I can appreciate what you're 
trying to do here. Take the log aggregation service as an example, you want 
that on every slave to aggregate logs, but want to avoid using yet another 
layer of configuration management to deploy it.




I'm of the opinion that these kind of auxiliary services which all work 
together (the mesos-slave process included) to define what we mean by a slave 
are the responsibility of whoever/whatever is provisioning the mesos-slave 
process and possibly even the machine itself. In our case, that's Chef. IMO 
once a slave registers with the mesos cluster it's immediately ready to start 
doing work, and mesos will actually start offering that slave immediately.




If you continue down this path you're also going to run into a variety of 
interesting timing issues when these services fail, or when you want to upgrade 
them. I'd suggest taking a look at some kind of more advanced process monitor 
to run these aux services like M/Monit instead of mesos (via Marathon).




Think of it another way, would you want something running through mesos to 
install apt package updates once a day? That'd be super weird, so why would log 
aggregation by any different?


--


Tom Arnfeld

Developer // DueDil







On Tue, Jan 6, 2015 at 11:57 PM, Charles Baker cnob...@gmail.com wrote:



It seems like an 'anti-pattern' (for lack of a better term) to attempt to force 
locality on a bunch of dependency services launched through Marathon. I thought 
the whole idea of Mesos (and Marathon) was to treat the data center as one 
giant computer in which it fundamentally should not matter where your services 
are launched. Although I obviously don't know the details of the use-case and 
may be grossly misunderstanding what you are trying to do but to me it sounds 
like you are attempting to shoehorn a non-distributed application into a 
distributed architecture. If this is the case, you may want to revisit your 
implementation and try to decouple the application's requirement of node-level 
dependency locality. It is also a good opportunity to possibly redesign a 
monolithic application into a distributed one.



On Tue, Jan 6, 2015 at 12:53 PM, David Greenberg dsg123456...@gmail.com wrote:

Tom is absolutely correct--you also need to ensure that your special tasks 
run as a user which is assigned a role w/ a special reservation to ensure they 
can always launch.



On Tue, Jan 6, 2015 at 2:38 PM, Tom Arnfeld t...@duedil.com wrote:

I'm not sure if I'm fully aware of the use case but if you use a different 
framework (aka Marathon) to launch these services, should the service die and 
need to be re-launched (or even

Re: Running services on all slaves

2015-01-06 Thread Tom Arnfeld
I completely agree with Charles, though I think I can appreciate what you're 
trying to do here. Take the log aggregation service as an example, you want 
that on every slave to aggregate logs, but want to avoid using yet another 
layer of configuration management to deploy it.




I'm of the opinion that these kind of auxiliary services which all work 
together (the mesos-slave process included) to define what we mean by a slave 
are the responsibility of whoever/whatever is provisioning the mesos-slave 
process and possibly even the machine itself. In our case, that's Chef. IMO 
once a slave registers with the mesos cluster it's immediately ready to start 
doing work, and mesos will actually start offering that slave immediately.




If you continue down this path you're also going to run into a variety of 
interesting timing issues when these services fail, or when you want to upgrade 
them. I'd suggest taking a look at some kind of more advanced process monitor 
to run these aux services like M/Monit instead of mesos (via Marathon).




Think of it another way, would you want something running through mesos to 
install apt package updates once a day? That'd be super weird, so why would log 
aggregation by any different?


--


Tom Arnfeld

Developer // DueDil

On Tue, Jan 6, 2015 at 11:57 PM, Charles Baker cnob...@gmail.com wrote:

 It seems like an 'anti-pattern' (for lack of a better term) to attempt to
 force locality on a bunch of dependency services launched through Marathon.
 I thought the whole idea of Mesos (and Marathon) was to treat the data
 center as one giant computer in which it fundamentally should not matter
 where your services are launched. Although I obviously don't know the
 details of the use-case and may be grossly misunderstanding what you are
 trying to do but to me it sounds like you are attempting to shoehorn a
 non-distributed application into a distributed architecture. If this is the
 case, you may want to revisit your implementation and try to decouple the
 application's requirement of node-level dependency locality. It is also a
 good opportunity to possibly redesign a monolithic application into a
 distributed one.
 On Tue, Jan 6, 2015 at 12:53 PM, David Greenberg dsg123456...@gmail.com
 wrote:
 Tom is absolutely correct--you also need to ensure that your special
 tasks run as a user which is assigned a role w/ a special reservation to
 ensure they can always launch.

 On Tue, Jan 6, 2015 at 2:38 PM, Tom Arnfeld t...@duedil.com wrote:

 I'm not sure if I'm fully aware of the use case but if you use a
 different framework (aka Marathon) to launch these services, should the
 service die and need to be re-launched (or even the slave restarts) could
 you not be in a position where another framework has consumed all resources
 on that slave and your core tasks cannot launch?

 Maybe if you're just using Marathon it might provide a sort of priority
 to decide who gets what resources first, but with multiple frameworks you
 might need to look into the slave resource reservations and framework roles.

 FWIW We're configuring these things out of band (via Chef to be specific).

 Hope this helps!

 --

 Tom Arnfeld
 Developer // DueDil

 (+44) 7525940046
 25 Christopher Street, London, EC2A 2BS


 On Tue, Jan 6, 2015 at 9:05 AM, Itamar Ostricher ita...@yowza3d.com
 wrote:

 Hi,

 I was wondering if the best approach to do what I want is to use mesos
 itself, or other Linux system tools.

 There are a bunch of services that our framework assumes are running on
 all participating slaves (e.g. logging service, data-bridge service, etc.).
 One approach to do that is in the infrastructure level, making sure that
 slave nodes are configured correctly (e.g. with pre-configured images, or
 other provisioning systems).
 Another approach would be to use mesos itself (maybe with something like
 Marathon) to schedule these services on all slave nodes.

 The advantage of the mesos-based approach is that it becomes trivial to
 account for the resource consumption of said services (e.g. make sure
 there's always at least 1 CPU dedicated to this).
 I'm not sure how to achieve something similar with the system-approach.

 Anyone has any insights on this?





Re: Mesos Community Meetings

2015-01-05 Thread Tom Arnfeld
+1 also! Very interesting to hear what’s being discussed. +1 on the google 
hangouts if these meetings are happening in person so we can listen along.



--


Tom Arnfeld

Developer // DueDil






On Monday, Dec 29, 2014 at 4:12 pm, Chris Aniszczyk caniszc...@gmail.com, 
wrote:

+1 to opening up meetings! How about create a google calendar with the 
meetings, agenda and info? 


Also someone should take meeting minutes and publish them to the list after 
each meeting for those who can't attend (on top of making information more 
discoverable via search).



Another approach is to use IRC meetings which there's a bot to record meetings, 
but that lacks the visual aspect of GH (e.g., see IRC meeting notes from 
Aurora: 
http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201412.mbox/%3C20141201192131.4888419FD5%40urd.zones.apache.org%3E)




Anyways, glad to see this finally happening.





On Mon, Dec 29, 2014 at 7:46 AM, Niklas Nielsen nik...@mesosphere.io wrote:Hi 
everyone,


Mesosphere and Twitter has been meeting up regularly to brief and discuss

current joint efforts in the Mesos project.

While this has worked great for the engineering teams, it should be a

community wide meeting as we discuss our agendas, timelines etc. which is

useful for a broader audience.

Unfortunately, we cannot host people on-site, but we can open Google

hangouts for all upcoming meetings.


Any thoughts or suggestions?


Best regards,

Niklas






-- 
Cheers,

Chris Aniszczyk
http://aniszczyk.org
+1 512 961 6719

Re: [VOTE] Release Apache Mesos 0.21.1 (rc2)

2014-12-30 Thread Tom Arnfeld
+1

--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Wed, Dec 31, 2014 at 12:16 AM, Ankur Chauhan an...@malloc64.com
wrote:

 +1
 Sent from my iPhone
 On Dec 30, 2014, at 16:01, Tim Chen t...@mesosphere.io wrote:
 
 Hi all,
 
 Just a reminder the vote is up for another 2 hours, let me know if any of 
 you have any objections.
 
 Thanks,
 
 Tim
 
 On Mon, Dec 29, 2014 at 5:32 AM, Niklas Nielsen nik...@mesosphere.io 
 wrote:
 +1, Compiled and tested on Ubuntu Trusty, CentOS Linux 7 and Mac OS X
 
 Thanks guys!
 Niklas
 
 
 On 19 December 2014 at 22:02, Tim Chen t...@mesosphere.io wrote:
 Hi Ankur,
 
 Since MESOS-1711 is just a minor improvement I'm inclined to include it 
 for the next major release which shouldn't be too far away from this 
 release.
 
 If anyone else thinks otherwise please let me know.
 
 Tim
 
 On Fri, Dec 19, 2014 at 12:44 PM, Ankur Chauhan an...@malloc64.com 
 wrote:
 Sorry for a late join in can we get 
 https://issues.apache.org/jira/plugins/servlet/mobile#issue/MESOS-1711 in 
 too or is it too late?
 -- ankur 
 Sent from my iPhone
 
 On Dec 19, 2014, at 12:23, Tim Chen t...@mesosphere.io wrote:
 
 Hi all,
 
 Please vote on releasing the following candidate as Apache Mesos 0.21.1.
 
 
 0.21.1 includes the following:
 
 * This is a bug fix release.
 
 ** Bug
   * [MESOS-2047] Isolator cleanup failures shouldn't cause TASK_LOST.
   * [MESOS-2071] Libprocess generates invalid HTTP
   * [MESOS-2147] Large number of connections slows statistics.json 
 responses.
   * [MESOS-2182] Performance issue in libprocess SocketManager.
 
 ** Improvement
   * [MESOS-1925] Docker kill does not allow containers to exit gracefully
   * [MESOS-2113] Improve configure to find apr and svn libraries/headers 
 in OSX
 
 The CHANGELOG for the release is available at:
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.21.1-rc2
 
 
 The candidate for Mesos 0.21.1 release is available at:
 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz
 
 The tag to be voted on is 0.21.1-rc2:
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.21.1-rc2
 
 The MD5 checksum of the tarball can be found at:
 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz.md5
 
 The signature of the tarball can be found at:
 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz.asc
 
 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS
 
 The JAR is up in Maven in a staging repository here:
 https://repository.apache.org/content/repositories/orgapachemesos-1046
 
 Please vote on releasing this package as Apache Mesos 0.21.1!
 
 The vote is open until Tue Dec 23 18:00:00 PST 2014 and passes if a 
 majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Mesos 0.21.1
 [ ] -1 Do not release this package because ...
 
 Thanks,
 
 Tim  Till
 

Re: Question about External Containerizer

2014-12-03 Thread Tom Arnfeld
Hi Diptanu,




That's correct, the ECP has the responsibility of updating the resource for a 
container, and it will do as new tasks are launched and killed for an executor. 
Since docker doesn't support this, our containerizer (Deimos does the same) 
goes behind docker to the cgroup for the container and updates the resources in 
a very similar way to the mesos-slave. I believe this is also what the built in 
Docker containerizer will do.




https://github.com/duedil-ltd/mesos-docker-containerizer/blob/master/containerizer/commands/update.py#L35





Tom.


--


Tom Arnfeld

Developer // DueDil

On Wed, Dec 3, 2014 at 10:45 AM, Diptanu Choudhury dipta...@gmail.com
wrote:

 Hi,
 I had a quick question about the external containerizer. I see that once
 the Task is launched, the ECP can receive the update calls, and the
 protobuf message passed to ECP with the update call is containerizer::Update
 .
 This protobuf has a Resources [list] field so does that mean Mesos might
 ask a running task to re-adjust the enforced resource limits?
 How would this work if the ECP was launching docker containers because
 Docker doesn't allow changing the resource limits once the container has
 been started?
 I am wondering how does Deimos and mesos-docker-containerizer handle this.
 -- 
 Thanks,
 Diptanu Choudhury
 Web - www.linkedin.com/in/diptanu
 Twitter - @diptanu http://twitter.com/diptanu

Re: Question about External Containerizer

2014-12-03 Thread Tom Arnfeld
When Mesos is asked to a launch a task (with either a custom Executor or the 
built in CommandExecutor) it will first spawn the executor which _has_ to be a 
system process, launched via command. This process will be launched inside of a 
Docker container when using the previously mentioned containerizers.




Once the Executor registers with the slave, the slave will send it a number of 
launchTask calls based on the number of tasks queued up for that executor. The 
Executor can then do as it pleases with those tasks, whether it's just a 
sleep(1) or to spawn a subprocess and do some other work. Given it is possible 
for the framework to specify resources for both tasks and executors, and the 
only thing which _has_ to be a system process is the executor, the mesos slave 
will limit the resources of the executor process to the sum of 
(TaskInfo.Executor.Resources + TaskInfo.Resources).





Mesos also has the ability to launch new tasks on an already running executor, 
so it's important that mesos is able to dynamically scale the resource limits 
up and down over time. Designing a framework around this idea can lead to some 
complex and powerful workflows which would be a lot more complex to build 
without Mesos.




Just for an example... Spark.




1) User launches a job on spark to map over some data

2) Spark launches a first wave of tasks based on the offers it received (let's 
say T1 and T2)

3) Mesos launches executors for those tasks (let's say E1 and E2) on different 
slaves

4) Spark launches another wave of tasks based on offers, and tells mesos to use 
the same executor (E1 and E2)

5) Mesos will simply call launchTasks(T{3,4}) on the two already running 
executors




At point (3) mesos is going to launch a Docker container and execute your 
executor. However at (5) the executor is already running so the tasks will be 
handed to the already running executor. 




Mesos will guarantee you (i'm 99% sure) that the resources for your container 
have been updated to reflect the limits set on the tasks before handing the 
tasks to you.




I hope that makes some sense!


--


Tom Arnfeld

Developer // DueDil

On Wed, Dec 3, 2014 at 10:54 AM, Diptanu Choudhury dipta...@gmail.com
wrote:

 Thanks for the explanation Tom, yeah I just figured that out by reading
 your code! You're touching the memory.soft_limit_in_bytes and
 memory.limit_in_bytes directly.
 Still curios to understand in which situations Mesos Slave would call the
 external containerizer to update the resource limits of a container? My
 understanding was that once resource allocation happens for a task,
 resources are not taken away until the task exits[fails, crashes or
 finishes] or Mesos asks the slave to kill the task.
 On Wed, Dec 3, 2014 at 2:47 AM, Tom Arnfeld t...@duedil.com wrote:
 Hi Diptanu,

 That's correct, the ECP has the responsibility of updating the resource
 for a container, and it will do as new tasks are launched and killed for an
 executor. Since docker doesn't support this, our containerizer (Deimos does
 the same) goes behind docker to the cgroup for the container and updates
 the resources in a very similar way to the mesos-slave. I believe this is
 also what the built in Docker containerizer will do.


 https://github.com/duedil-ltd/mesos-docker-containerizer/blob/master/containerizer/commands/update.py#L35

 Tom.

 --

 Tom Arnfeld
 Developer // DueDil


 On Wed, Dec 3, 2014 at 10:45 AM, Diptanu Choudhury dipta...@gmail.com
 wrote:

 Hi,

 I had a quick question about the external containerizer. I see that once
 the Task is launched, the ECP can receive the update calls, and the
 protobuf message passed to ECP with the update call is
 containerizer::Update.

 This protobuf has a Resources [list] field so does that mean Mesos might
 ask a running task to re-adjust the enforced resource limits?

 How would this work if the ECP was launching docker containers because
 Docker doesn't allow changing the resource limits once the container has
 been started?

 I am wondering how does Deimos and mesos-docker-containerizer handle this.

 --
 Thanks,
 Diptanu Choudhury
 Web - www.linkedin.com/in/diptanu
 Twitter - @diptanu http://twitter.com/diptanu



 -- 
 Thanks,
 Diptanu Choudhury
 Web - www.linkedin.com/in/diptanu
 Twitter - @diptanu http://twitter.com/diptanu

Re: Timeline for 0.22.0?

2014-12-03 Thread Tom Arnfeld
I don't mind helping out shepherding a release through for 0.21.1 though I 
don't have committer rights.


--


Tom Arnfeld

Developer // DueDil

On Tue, Dec 2, 2014 at 10:44 PM, Benjamin Mahler
benjamin.mah...@gmail.com wrote:

 If anyone is interested in driving a 0.21.1 bug fix release, we could get
 bug fixes out more quickly than waiting for 0.22.0.
 On Tue, Dec 2, 2014 at 2:28 PM, Tim Chen t...@mesosphere.io wrote:
 Hi Scott,

 The patch for MESOS-1925 is already merged into master, so you should be
 able to just grab master in the mean time.

 As for 0.22.0 timeline, I don't think we set a timeline yet, usually we
 call a estimated time to release when we have enough to release a new
 version.

 Tim

 On Tue, Dec 2, 2014 at 2:08 PM, Scott Rankin sran...@motus.com wrote:

   Hi all,

  We’re very excited here for Mesos and are working on our first
 production deployment using Mesos/Marathon/Chronos.  One thing that we need
 for production readiness is MESOS-1925.  I see it’s been assigned to 0.22.0
 – I was wondering if there was any timeline yet for when that release will
 come out.  I can put together our own branch with 0.21.0 + the patch, but
 I’d rather wait for the release.

  Thanks!
 Scott

   This email message contains information that Motus, LLC considers
 confidential and/or proprietary, or may later designate as confidential and
 proprietary. It is intended only for use of the individual or entity named
 above and should not be forwarded to any other persons or entities without
 the express consent of Motus, LLC, nor should it be used for any purpose
 other than in the course of any potential or actual business relationship
 with Motus, LLC. If the reader of this message is not the intended
 recipient, or the employee or agent responsible to deliver it to the
 intended recipient, you are hereby notified that any dissemination,
 distribution, or copying of this communication is strictly prohibited. If
 you have received this communication in error, please notify sender
 immediately and destroy the original message.

 Internal Revenue Service regulations require that certain types of
 written advice include a disclaimer. To the extent the preceding message
 contains advice relating to a Federal tax issue, unless expressly stated
 otherwise the advice is not intended or written to be used, and it cannot
 be used by the recipient or any other taxpayer, for the purpose of avoiding
 Federal tax penalties, and was not written to support the promotion or
 marketing of any transaction or matter discussed herein.




Re: Rocket

2014-12-01 Thread Tom Arnfeld
+1 Sounds exciting!


--


Tom Arnfeld

Developer // DueDil

On Mon, Dec 1, 2014 at 8:03 PM, Jie Yu yujie@gmail.com wrote:

 Sounds great Tim!
 Do you know if they have published an API for the rocket toolset? Are we
 gonna rely on the command line interface?
 - Jie
 On Mon, Dec 1, 2014 at 11:10 AM, Tim Chen t...@mesosphere.io wrote:
 Hi all,

 Per the announcement from CoreOS about Rocket (
 https://coreos.com/blog/rocket/) , it seems to be an exciting
 containerizer runtime that has composable isolation/components, better
 security and image specification/distribution.

 All of these design goals also fits very well into Mesos, where in Mesos
 we also have a pluggable isolators model and have been experiencing some
 pain points with our existing containerizers around image distribution and
 security as well.

 I'd like to propose to integrate Rocket into Mesos with a new Rocket
 containerizer, where I can see we can potentially integrate our existing
 isolators into Rocket runtime.

 Like to learn what you all think,

 Thanks!


Re: Master memory usage

2014-11-22 Thread Tom Arnfeld
I have and it doesn't seem to add up. That being said, the growth of the memory 
and number of tasks does seem to make sense give the issue you linked to.


I'll upgrade and see where that leaves the issue.




Thanks for your help!


--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Thu, Nov 20, 2014 at 11:06 PM, Benjamin Mahler
benjamin.mah...@gmail.com wrote:

 Have you done the math on number of tasks * size of task?
 We didn't wipe the .data field in 0.19.1:
 https://issues.apache.org/jira/browse/MESOS-1746
 On Thu, Nov 20, 2014 at 2:51 PM, Tom Arnfeld t...@duedil.com wrote:
 That's what I thought. There around 2500 tasks launched with this master,
 most of which will be by our Hadoop JT. The Hadoop framework ships the
 configuration for the TT using the TaskInfo.data property, and that looks
 to be about 80K per task.

 Any debugging suggestions?

 --

 Tom Arnfeld
 Developer // DueDil

 (+44) 7525940046
 25 Christopher Street, London, EC2A 2BS


 On Thu, Nov 20, 2014 at 10:33 PM, Benjamin Mahler 
 benjamin.mah...@gmail.com wrote:

 It shouldn't be that high, especially with the size of the cluster I see
 in your stats.

 Which scheduler(s) are you running, and do they create large TaskInfo
 objects? Just a hunch, as I do not recall any leaks in 0.19.1.

 On Tue, Nov 18, 2014 at 1:00 AM, Tom Arnfeld t...@duedil.com wrote:

  I've noticed some strange memory usage behaviour of the Mesos master
 in a small cluster of ours. We have three master nodes in a quorum and are
 using ZK.

 The master in question has 12GB of ram available of which the
 mesos-master process is using 10GB (resident) of which seems quite a lot.
 That being said I'm not sure what the memory profile of the master should
 look like...

 Here's a snapshot of our /stats.json endpoint.

 This cluster is running 0.19.1 so perhaps there are some memory leak
 fixes in a newer release that we need to take advantage of.

 Any help would be appreciated!

 -

 {activated_slaves:19,active_schedulers:1,active_tasks_gauge:1,cpus_percent:0.116618075801749,cpus_total:171.5,cpus_used:20,deactivated_slaves:0,disk_percent:0.0273684210526316,disk_total:972800,disk_used:26624,elected:1,failed_tasks:11,finished_tasks:2658,invalid_status_updates:2638,killed_tasks:1,lost_tasks:4,master/cpus_percent:0.116618075801749,master/cpus_total:171.5,master/cpus_used:20,master/disk_percent:0.0273684210526316,master/disk_total:972800,master/disk_used:26624,master/dropped_messages:16,master/elected:1,master/event_queue_size:0,master/frameworks_active:1,master/frameworks_inactive:0,master/invalid_framework_to_executor_messages:0,master/invalid_status_update_acknowledgements:0,master/invalid_status_updates:2638,master/mem_percent:0.279896013864818,master/mem_total:1181696,master/mem_used:330752,master/messages_authenticate:0,master/messages_deactivate_framework:0,master/messages_exited_executor:2667,master/messages_framework_to_executor:0,master/messages_kill_task:4397,master/messages_launch_tasks:838024,master/messages_reconcile_tasks:0,master/messages_register_framework:27,master/messages_register_slave:1,master/messages_reregister_framework:326788,master/messages_reregister_slave:31,master/messages_resource_request:0,master/messages_revive_offers:0,master/messages_status_update:8009,master/messages_status_update_acknowledgement:0,master/messages_unregister_framework:26,master/messages_unregister_slave:0,master/outstanding_offers:0,master/recovery_slave_removals:0,master/slave_registrations:1,master/slave_removals:0,master/slave_reregistrations:18,master/slaves_active:19,master/slaves_inactive:0,master/tasks_failed:11,master/tasks_finished:2658,master/tasks_killed:1,master/tasks_lost:4,master/tasks_running:1,master/tasks_staging:0,master/tasks_starting:0,master/uptime_secs:1411611.70786125,master/valid_framework_to_executor_messages:0,master/valid_status_update_acknowledgements:0,master/valid_status_updates:5371,mem_percent:0.279896013864818,mem_total:1181696,mem_used:330752,outstanding_offers:0,registrar/queued_operations:0,registrar/registry_size_bytes:4348,registrar/state_fetch_ms:95.591936,registrar/state_store_ms:48.622848,staged_tasks:2675,started_tasks:26,system/cpus_total:2,system/load_15min:0.05,system/load_1min:0.03,system/load_5min:0.04,system/mem_free_bytes:152408064,system/mem_total_bytes:12631490560,total_schedulers:1,uptime:1411611.27369318,valid_status_updates:5371}


 --

 Tom Arnfeld
 Developer // DueDil

 (+44) 7525940046
 25 Christopher Street, London, EC2A 2BS





Re: Master memory usage

2014-11-20 Thread Tom Arnfeld
That's what I thought. There around 2500 tasks launched with this master, most 
of which will be by our Hadoop JT. The Hadoop framework ships the configuration 
for the TT using the TaskInfo.data property, and that looks to be about 80K per 
task.




Any debugging suggestions?


--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Thu, Nov 20, 2014 at 10:33 PM, Benjamin Mahler
benjamin.mah...@gmail.com wrote:

 It shouldn't be that high, especially with the size of the cluster I see in
 your stats.
 Which scheduler(s) are you running, and do they create large TaskInfo
 objects? Just a hunch, as I do not recall any leaks in 0.19.1.
 On Tue, Nov 18, 2014 at 1:00 AM, Tom Arnfeld t...@duedil.com wrote:
  I've noticed some strange memory usage behaviour of the Mesos master in
 a small cluster of ours. We have three master nodes in a quorum and are
 using ZK.

 The master in question has 12GB of ram available of which the mesos-master
 process is using 10GB (resident) of which seems quite a lot. That being
 said I'm not sure what the memory profile of the master should look like...

 Here's a snapshot of our /stats.json endpoint.

 This cluster is running 0.19.1 so perhaps there are some memory leak fixes
 in a newer release that we need to take advantage of.

 Any help would be appreciated!

 -

 {activated_slaves:19,active_schedulers:1,active_tasks_gauge:1,cpus_percent:0.116618075801749,cpus_total:171.5,cpus_used:20,deactivated_slaves:0,disk_percent:0.0273684210526316,disk_total:972800,disk_used:26624,elected:1,failed_tasks:11,finished_tasks:2658,invalid_status_updates:2638,killed_tasks:1,lost_tasks:4,master/cpus_percent:0.116618075801749,master/cpus_total:171.5,master/cpus_used:20,master/disk_percent:0.0273684210526316,master/disk_total:972800,master/disk_used:26624,master/dropped_messages:16,master/elected:1,master/event_queue_size:0,master/frameworks_active:1,master/frameworks_inactive:0,master/invalid_framework_to_executor_messages:0,master/invalid_status_update_acknowledgements:0,master/invalid_status_updates:2638,master/mem_percent:0.279896013864818,master/mem_total:1181696,master/mem_used:330752,master/messages_authenticate:0,master/messages_deactivate_framework:0,master/messages_exited_executor:2667,master/messages_framework_to_executor:0,master/messages_kill_task:4397,master/messages_launch_tasks:838024,master/messages_reconcile_tasks:0,master/messages_register_framework:27,master/messages_register_slave:1,master/messages_reregister_framework:326788,master/messages_reregister_slave:31,master/messages_resource_request:0,master/messages_revive_offers:0,master/messages_status_update:8009,master/messages_status_update_acknowledgement:0,master/messages_unregister_framework:26,master/messages_unregister_slave:0,master/outstanding_offers:0,master/recovery_slave_removals:0,master/slave_registrations:1,master/slave_removals:0,master/slave_reregistrations:18,master/slaves_active:19,master/slaves_inactive:0,master/tasks_failed:11,master/tasks_finished:2658,master/tasks_killed:1,master/tasks_lost:4,master/tasks_running:1,master/tasks_staging:0,master/tasks_starting:0,master/uptime_secs:1411611.70786125,master/valid_framework_to_executor_messages:0,master/valid_status_update_acknowledgements:0,master/valid_status_updates:5371,mem_percent:0.279896013864818,mem_total:1181696,mem_used:330752,outstanding_offers:0,registrar/queued_operations:0,registrar/registry_size_bytes:4348,registrar/state_fetch_ms:95.591936,registrar/state_store_ms:48.622848,staged_tasks:2675,started_tasks:26,system/cpus_total:2,system/load_15min:0.05,system/load_1min:0.03,system/load_5min:0.04,system/mem_free_bytes:152408064,system/mem_total_bytes:12631490560,total_schedulers:1,uptime:1411611.27369318,valid_status_updates:5371}


 --

 Tom Arnfeld
 Developer // DueDil

 (+44) 7525940046
 25 Christopher Street, London, EC2A 2BS


Re: Implementing an Executor

2014-11-19 Thread Tom Arnfeld
Hi Janet,




Great to hear you're using Mesos! It's not the best idea to block in callbacks 
from the mesos drivers, either in the Executor or Framework, this is because 
you won't be notified correctly when other events happen (master failover, 
shutdown request, kill task request) as I believe the driver will guarantee you 
only get one callback at once and uses a blocking call into userland code.




The slave and master become aware of task status by the executor correctly 
sending the TaskStatus message with TASK_STARTING, TASK_RUNNING, TASK_LOST, 
TASK_FAILED and TASK_FINISHED.




Your executor is guaranteed to be alive for your task to be alive (unless there 
are any failures cases i'm not aware of) so you it's easier to monitor your 
task from the outside (if it's a subprocess, for example). Your use of the 
docker containerizer will also contribute to this behaviour as the mesos slave 
is going to kill the container ASAP after the executor disconnects.




Sending task status updates should do the trick for you here.




Tom.


--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Wed, Nov 19, 2014 at 10:16 PM, Janet Borschowa
janet.borsch...@codefutures.com wrote:

 Hi,
 I'm implementing an executor which is used by the mesos slave to launch
 tasks. The tasks are to launch a docker container - this is because I need
 more info about the launched container than what the docker containerizer
 returns.
 Is it OK to block in the executor's launchTask method until the task
 completes? If not, how does the framework discover when that task
 completes? I could spawn a process which notifies my executor when the task
 completes and then have my executor send a status update. Or is there some
 other recommended way to deal with this when the task could run for an
 indefinite period of time before completing its work?
 Thanks!
 Janet
 --
 Janet Borschowa
 CodeFutures Corporation

Re: Implementing an Executor

2014-11-19 Thread Tom Arnfeld
Hi Janet,




Oh sorry my mistake, I didn't read your email correctly, I thought you were 
using the containerizer. What you're doing here is actually going to be quite 
difficult to do, the mesos docker containerizer has some quite complex logic 
implemented to ensure the slave stays in sync with the containers that are 
running, and kills anything that goes rogue.




It's going to be non-trivial for you to do that from the executor, though I 
guess you could make use of the docker events API or poll other endpoints in 
the API to check the status of your containers, and off the back of that send 
status updates to the cluster. Doing this however brings no guarantees that if 
your executor dies exceptionally (perhaps OOMd) the containers spawned will 
die... they'll keep running in the background and it'll be hard for you to know 
the state of your containers on the cluster.




You probably want to be aware (if you don't know already) that the resource 
limits assigned to your tasks aren't going to be enforced by mesos because 
docker is running outside of its control. You'll need to pass the correct 
CPU/Memory limit parameters to your docker containers to ensure this happens 
correctly.




Here are the docker API docs; 
https://docs.docker.com/reference/api/docker_remote_api_v1.15/




Something you might want to consider, if all you're trying to do is allow your 
container access to details about itself (e.g `docker inspect`) is to open up 
the docker remote API to be queried by your containers on the slave, and switch 
to using the mesos docker containerizer.


I hope that helps somewhat!




Tom.


--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Wed, Nov 19, 2014 at 10:16 PM, Janet Borschowa
janet.borsch...@codefutures.com wrote:

 Hi,
 I'm implementing an executor which is used by the mesos slave to launch
 tasks. The tasks are to launch a docker container - this is because I need
 more info about the launched container than what the docker containerizer
 returns.
 Is it OK to block in the executor's launchTask method until the task
 completes? If not, how does the framework discover when that task
 completes? I could spawn a process which notifies my executor when the task
 completes and then have my executor send a status update. Or is there some
 other recommended way to deal with this when the task could run for an
 indefinite period of time before completing its work?
 Thanks!
 Janet
 --
 Janet Borschowa
 CodeFutures Corporation

Master memory usage

2014-11-18 Thread Tom Arnfeld
I've noticed some strange memory usage behaviour of the Mesos master in a small 
cluster of ours. We have three master nodes in a quorum and are using ZK.

The master in question has 12GB of ram available of which the mesos-master 
process is using 10GB (resident) of which seems quite a lot. That being said 
I'm not sure what the memory profile of the master should look like...


Here's a snapshot of our /stats.json endpoint.


This cluster is running 0.19.1 so perhaps there are some memory leak fixes in a 
newer release that we need to take advantage of.


Any help would be appreciated!


-


{activated_slaves:19,active_schedulers:1,active_tasks_gauge:1,cpus_percent:0.116618075801749,cpus_total:171.5,cpus_used:20,deactivated_slaves:0,disk_percent:0.0273684210526316,disk_total:972800,disk_used:26624,elected:1,failed_tasks:11,finished_tasks:2658,invalid_status_updates:2638,killed_tasks:1,lost_tasks:4,master/cpus_percent:0.116618075801749,master/cpus_total:171.5,master/cpus_used:20,master/disk_percent:0.0273684210526316,master/disk_total:972800,master/disk_used:26624,master/dropped_messages:16,master/elected:1,master/event_queue_size:0,master/frameworks_active:1,master/frameworks_inactive:0,master/invalid_framework_to_executor_messages:0,master/invalid_status_update_acknowledgements:0,master/invalid_status_updates:2638,master/mem_percent:0.279896013864818,master/mem_total:1181696,master/mem_used:330752,master/messages_authenticate:0,master/messages_deactivate_framework:0,master/messages_exited_executor:2667,master/messages_framework_to_executor:0,master/messages_kill_task:4397,master/messages_launch_tasks:838024,master/messages_reconcile_tasks:0,master/messages_register_framework:27,master/messages_register_slave:1,master/messages_reregister_framework:326788,master/messages_reregister_slave:31,master/messages_resource_request:0,master/messages_revive_offers:0,master/messages_status_update:8009,master/messages_status_update_acknowledgement:0,master/messages_unregister_framework:26,master/messages_unregister_slave:0,master/outstanding_offers:0,master/recovery_slave_removals:0,master/slave_registrations:1,master/slave_removals:0,master/slave_reregistrations:18,master/slaves_active:19,master/slaves_inactive:0,master/tasks_failed:11,master/tasks_finished:2658,master/tasks_killed:1,master/tasks_lost:4,master/tasks_running:1,master/tasks_staging:0,master/tasks_starting:0,master/uptime_secs:1411611.70786125,master/valid_framework_to_executor_messages:0,master/valid_status_update_acknowledgements:0,master/valid_status_updates:5371,mem_percent:0.279896013864818,mem_total:1181696,mem_used:330752,outstanding_offers:0,registrar/queued_operations:0,registrar/registry_size_bytes:4348,registrar/state_fetch_ms:95.591936,registrar/state_store_ms:48.622848,staged_tasks:2675,started_tasks:26,system/cpus_total:2,system/load_15min:0.05,system/load_1min:0.03,system/load_5min:0.04,system/mem_free_bytes:152408064,system/mem_total_bytes:12631490560,total_schedulers:1,uptime:1411611.27369318,valid_status_updates:5371}

--

Tom Arnfeld
Developer // DueDil


(+44) 7525940046
25 Christopher Street, London, EC2A 2BS

Re: hadoop-mesos error

2014-11-18 Thread Tom Arnfeld
Hi John,




Could you paste your JT configuration and the configuration that gets printed 
out by the executor?




Also, what version of Hadoop are you running, and what revision of the 
framework?




Cheers,




Tom.


--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Tue, Nov 18, 2014 at 8:27 PM, John Omernik j...@omernik.com wrote:

 Hey all, I updated somethings on my cluster and in broke.  :)
 That said, I am at a loss, the JT spins up, however tasks fail right after
 the configuration listing with the error below, and am not sure how to get
 the debug information to troubleshoot this. Any pointers would be
 appreciated.
 Thanks!
 14/11/18 14:19:42 INFO mapred.TaskTracker: /tmp is not tmpfs or ramfs.
 Java Hotspot Instrumentation will be disabled by default
 14/11/18 14:19:42 INFO mapred.TaskTracker: Cleaning up config files
 from the job history folder
 java.lang.NumberFormatException: null
   at java.lang.Integer.parseInt(Integer.java:454)
   at java.lang.Integer.valueOf(Integer.java:582)
   at 
 org.apache.hadoop.mapred.TaskTracker.getResourceInfo(TaskTracker.java:2965)
   at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:2108)
   at 
 org.apache.hadoop.mapred.MesosExecutor.launchTask(MesosExecutor.java:80)
 Exception in thread Thread-1 I1118 14:19:42.040426  8716
 exec.cpp:413] Deactivating the executor libprocess

Re: [VOTE] Release Apache Mesos 0.21.0 (rc1)

2014-11-06 Thread Tom Arnfeld
+1




`make check` passed on Ubuntu 12.04 LTS (kernel 3.2.0-67)


--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Thu, Nov 6, 2014 at 8:43 PM, Ian Downes idow...@twitter.com.invalid
wrote:

 Apologies: I used support/tag.sh but had a local branch *and* local tag and
 it pushed the branch only.
 $ git ls-remote --tags origin-wip | grep 0.21.0
 a7733493dc9e6f2447f825671d8a745602c9bf7a refs/tags/0.21.0-rc1
 On Thu, Nov 6, 2014 at 8:11 AM, Tim St Clair tstcl...@redhat.com wrote:
 $ git tag -l | grep 21

 $ git branch -r
   origin/0.21.0-rc1

 It looks like you created a branch vs. tag ...?

 Cheers,
 Tim

 - Original Message -
  From: Ian Downes ian.dow...@gmail.com
  To: d...@mesos.apache.org, user@mesos.apache.org
  Sent: Wednesday, November 5, 2014 5:12:52 PM
  Subject: [VOTE] Release Apache Mesos 0.21.0 (rc1)
 
  Hi all,
 
  Please vote on releasing the following candidate as Apache Mesos 0.21.0.
 
 
  0.21.0 includes the following:
 
 
  State reconciliation for frameworks
  Support for Mesos modules
  Task status now includes source and reason
  A shared filesystem isolator
  A pid namespace isolator
 
  The CHANGELOG for the release is available at:
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.21.0-rc1
 
 
 
  The candidate for Mesos 0.21.0 release is available at:
 
 https://dist.apache.org/repos/dist/dev/mesos/0.21.0-rc1/mesos-0.21.0.tar.gz
 
  The tag to be voted on is 0.21.0-rc1:
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.21.0-rc1
 
  The MD5 checksum of the tarball can be found at:
 
 https://dist.apache.org/repos/dist/dev/mesos/0.21.0-rc1/mesos-0.21.0.tar.gz.md5
 
  The signature of the tarball can be found at:
 
 https://dist.apache.org/repos/dist/dev/mesos/0.21.0-rc1/mesos-0.21.0.tar.gz.asc
 
  The PGP key used to sign the release is here:
  https://dist.apache.org/repos/dist/release/mesos/KEYS
 
  The JAR is up in Maven in a staging repository here:
  https://repository.apache.org/content/repositories/orgapachemesos-1038
 
  Please vote on releasing this package as Apache Mesos 0.21.0!
 
  The vote is open until Sat Nov  8 15:09:48 PST 2014 and passes if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Mesos 0.21.0
  [ ] -1 Do not release this package because ...
 
  Thanks,
 
  Ian Downes
 

 --
 Cheers,
 Timothy St. Clair
 Red Hat Inc.


Re: CDH5.2.3 on mesos

2014-11-01 Thread Tom Arnfeld
We're running a HA job tracker (not deployed on top of mesos itself, though) 
with the mesos-hadoop framework being referenced here. This guide from Cloudera 
(CDH5) is pretty good for getting started: 
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_hag_jt_ha_intro_config.html





This also explains how ZooKeeper can be used as a failover controller to enable 
automatic failover if one job tracker goes down.


--


Tom ArnfeldDeveloper // DueDil




t...@duedil.com (+44) 7525940046


25 Christopher Street, London, EC2A 2BS
Company Number: 06999618

On Fri, Oct 31, 2014 at 2:07 AM, Ankur Chauhan an...@malloc64.com wrote:

 Thanks this is great. I actually ended up using the vanilla Hadoop 
 distribution and it worked just fine. I will try out your tutorial. Side 
 question, is there a solution around making the jobtracker ha?
 -- ankur 
 Sent from my iPhone
 On Oct 30, 2014, at 5:05 PM, Stratos Dimopoulos 
 stratos.dimopou...@gmail.com wrote:
 
 Hi Ankur,
 
 I recently went through the process of installing Hadoop on Mesos over 
 cdh5.1.2
 
 I created a post here - shouldn't be much different for your version: 
 http://strat0sphere.wordpress.com/2014/10/30/hadoop-on-mesos-installation-guide/
 
 You can also find an other post about configuring CDH5.1.2 specifically to 
 use with Mesos: useful: 
 http://strat0sphere.wordpress.com/2014/10/30/cloudera-hdfs-cdh5-installation-to-use-with-mesos/
 
 Have in mind that when using Mesos you don't need to start the jobtracker. 
 Mesos will do this for you. 
 You also mentioned that you are trying to start hadoop as ubuntu user. This 
 is not the right thing to do. Either add root to cloudera's root list or  
 (recommended) use the root user (mapred? hdfs?) that your cloudera version 
 considers as root - you have to check the documentation for this.
 
 Regarding the error you are seeing Does not contain a valid host:port 
 authority: local - I've seen this error when my worker version was 
 different than the jobtracker version (happened because I was using a 
 hadoop-on-mesos tar file compiled with a different version than the one my 
 cluster was using). To fix this you can do the obvious, which is making sure 
 the installed version is the same with the one you ship to the executors 
 through HDFS or you can hack this by adding the property 
 hadoop.skip.worker.version.check to True - In the later case I wish you good 
 luck... Neverhteless, I am not sure if this error can also appear in other 
 cases.
 
 Hope this helps.
 
 Stratos
 
 
 On Tue, Oct 28, 2014 at 12:30 PM, Ankur Chauhan an...@malloc64.com wrote:
 Anyone else have something to add on this?
 -- Ankur Chauhan
 
 On 28 Oct 2014, at 02:10, Ankur Chauhan an...@malloc64.com wrote:
 
 Hi tom,
 
 I was basically following the readme. This gist has the list of commands 
 how i am setting up things
 https://gist.github.com/ankurcha/a9504b0e423b1a40d756 so first of all if 
 possible if you could help me verify if my process of setting up 
 core-site, hdfs-site and mapred-site is correct. I was starting the node 
 with 

$ /opt/hadoop/bin/hadoop jobtracker
 
 There are two errors that i was working through. It seems that hadoop 
 doesn't like running as root (which is good) but despite starting the 
 process as ubuntu i kept getting 
 
  Does not contain a valid host:port authority: local
 
 -- Ankur
 
 On 28 Oct 2014, at 01:57, Tom Arnfeld t...@duedil.com wrote:
 
 Hi Ankur,
 
 There aren't any getting started resources other than the documention 
 there as far as I know. Could you share your hadoop configuration and 
 perhaps a description of the problems you're having?
 
 Tom.
 
 
 
 On Tue, Oct 28, 2014 at 8:53 AM, Ankur Chauhan an...@malloc64.com 
 wrote:
 H,
 
 
 I was trying to setup mesos/hadoop with the latest CDH version (MR1) and 
 it seems like the instructions are sort of out of date and I also tried 
 the suggestions in https://github.com/mesos/hadoop/issues/25 but after 4 
 hours of flailing around I am still kind of stuck :-/
 
 It seems like the configuration/installation instructions aren't 
 complete and I am just too new to hadoop to figure out what's missing or 
 going wrong. Does anyone know of a good resource I can use to get going?
 
 -- Ankur
 

Re: CDH5.2.3 on mesos

2014-10-28 Thread Tom Arnfeld
Hi Ankur,


There aren't any getting started resources other than the documention there as 
far as I know. Could you share your hadoop configuration and perhaps a 
description of the problems you're having?




Tom.

On Tue, Oct 28, 2014 at 8:53 AM, Ankur Chauhan an...@malloc64.com wrote:

 H,
 I was trying to setup mesos/hadoop with the latest CDH version (MR1) and it 
 seems like the instructions are sort of out of date and I also tried the 
 suggestions in https://github.com/mesos/hadoop/issues/25 
 https://github.com/mesos/hadoop/issues/25 but after 4 hours of flailing 
 around I am still kind of stuck :-/
 It seems like the configuration/installation instructions aren't complete and 
 I am just too new to hadoop to figure out what's missing or going wrong. Does 
 anyone know of a good resource I can use to get going?
 -- Ankur

Re: Problems with OOM

2014-09-26 Thread Tom Arnfeld
I'm not sure if this at all related to the issue you're seeing, but we ran
into this fun issue (or at least this seems to be the cause) helpfully
documented on this blog article:
http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html.

TLDR: OOM killer getting into an infinite loop, causing the CPU to spin out
of control on our VMs.

More details in this commit message to the OOM killer earlier this year;
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0c740d0afc3bff0a097ad03a1c8df92757516f5c

Hope this helps somewhat...

On 26 September 2014 14:15, Tomas Barton barton.to...@gmail.com wrote:

 Just to make sure, all slaves are running with:

 --isolation='cgroups/cpu,cgroups/mem'

 Is there something suspicious in mesos slave logs?

 On 26 September 2014 13:20, Stephan Erb stephan@blue-yonder.com
 wrote:

  Hi everyone,

 I am having issues with the cgroups isolation of Mesos. It seems like
 tasks are prevented from allocating more memory than their limit. However,
 they are never killed.

- My scheduled task allocates memory in a tight loop. According to
'ps', once its memory requirements are exceeded it is not killed, but ends
up in the state D (uninterruptible sleep (usually IO)).
- The task is still considered running by Mesos.
- There is no indication of an OOM in dmesg.
- There is neither an OOM notice nor any other output related to the
task in the slave log.
- According to htop, the system load is increased with a significant
portion of CPU time spend within the kernel. Commonly the load is so high
that all zookeeper connections time out.

 I am running Aurora and Mesos 0.20.1 using the cgroups isolation on
 Debian 7 (kernel 3.2.60-1+deb7u3). .

 Sorry for the somewhat unspecific error description. Still, anyone an
 idea what might be wrong here?

 Thanks and Best Regards,
 Stephan





Re: [VOTE] Release Apache Mesos 0.20.1 (rc3)

2014-09-19 Thread Tom Arnfeld
+1 (non-binding)

Make check on Ubuntu 12.04 with gcc 4.6.3

On 19 September 2014 17:37, Tim Chen t...@mesosphere.io wrote:

 +1 (non-binding)

 Make check on Centos 5.5, docker tests all passed too.

 Tim

 On Fri, Sep 19, 2014 at 9:17 AM, Jie Yu yujie@gmail.com wrote:

 +1 (binding)

 Make check on centos5 and centos6 (gcc48)

 On Thu, Sep 18, 2014 at 4:05 PM, Adam Bordelon a...@mesosphere.io
 wrote:

 Hi all,

 Please vote on releasing the following candidate as Apache Mesos 0.20.1.


 0.20.1 includes the following:

 
 Minor bug fixes for docker integration, network isolation, build, etc.

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.20.1-rc3

 

 The candidate for Mesos 0.20.1 release is available at:

 https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc3/mesos-0.20.1.tar.gz

 The tag to be voted on is 0.20.1-rc3:
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.20.1-rc3

 The MD5 checksum of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc3/mesos-0.20.1.tar.gz.md5

 The signature of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc3/mesos-0.20.1.tar.gz.asc

 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS

 The JAR is up in Maven in a staging repository here:
 https://repository.apache.org/content/repositories/orgapachemesos-1036

 Please vote on releasing this package as Apache Mesos 0.20.1!

 The vote is open until Mon Sep 22 17:00:00 PDT 2014 and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Mesos 0.20.1
 [ ] -1 Do not release this package because ...

 Thanks,
 Adam and Bhuvan






Re: Mesos task ordering guarantees

2014-09-17 Thread Tom Arnfeld
Thanks for taking a look, created a ticket
https://issues.apache.org/jira/browse/MESOS-1812

On 18 September 2014 02:30, Vinod Kone vinodk...@gmail.com wrote:

 Looked at the code in Slave::runTask() and indeed there is a bug that
 doesn't guarantee the order of task delivery to an executor. Mind filing a
 ticket?

 On Wed, Sep 17, 2014 at 6:22 PM, Tom Arnfeld t...@duedil.com wrote:

 Hey Vinod,

 On the most part I have indeed observed this to be the case. However
 every now and then the tasks are being launched out of order. Here's a
 slave log https://gist.github.com/tarnfeld/7a275e2ddffdc4da9e2f.

 You can see the slave is assigned the tasks in order, *Task_Tracker_10* first
 then *slots_Task_Tracker_10* which is the order they should be. Though
 they are queued for launching by the executor in the wrong order.

 On 14 September 2014 19:46, Tom Arnfeld t...@duedil.com wrote:

 That's great, thanks Vinod!


 On Sun, Sep 14, 2014 at 5:33 PM, Vinod Kone vinodk...@gmail.com wrote:

 Yes. The order is guaranteed.

 @vinodkone

 On Sep 14, 2014, at 5:28 AM, Tom Arnfeld t...@duedil.com wrote:

  Hey,

 I couldn't seem to find any documentation on this..

 If a framework responds to an offer with two tasks and they share the
 same executor (therefore leading to two invocations of *launchTasks()* on
 the executor), does Mesos provide any guarantees around the order of those
 tasks being handed to the executor once it comes up?

 Given that the LaunchTasksMessage protobuf contains a repeated TaskInfo 
 tasks
 does this mean the order will be honoured?

 Thanks,

 Tom.







Mesos task ordering guarantees

2014-09-14 Thread Tom Arnfeld
Hey,

I couldn't seem to find any documentation on this..

If a framework responds to an offer with two tasks and they share the same
executor (therefore leading to two invocations of *launchTasks()* on the
executor), does Mesos provide any guarantees around the order of those
tasks being handed to the executor once it comes up?

Given that the LaunchTasksMessage protobuf contains a repeated TaskInfo tasks
does this mean the order will be honoured?

Thanks,

Tom.


Re: Sandbox GC fails

2014-09-08 Thread Tom Arnfeld
That's useful to know, thanks Vinod. I'll try and dig deeper.

On Mon, Sep 8, 2014 at 5:33 AM, Vinod Kone vinodk...@gmail.com wrote:

 On Sat, Sep 6, 2014 at 8:23 AM, Tom Arnfeld t...@duedil.com wrote:
 If I try and manually remove the directory mentioned, it works fine. Is
 this a known issue, or should I do a little more debugging? I've not tried
 to reproduce it under specific conditions yet.


 This is surprising. GC does a recursive directory removal (see os::rmdir()
 in stout) using post-order traversal. Definitely some debugging is in order
 to see which directory failed and why. Does your sandbox contain any
 special files (other than directories and files) like mounts, devices etc?
 As a side note, should mesos perhaps have some kind of retry mechanism for
 GC? Also, will GC still run for an executor if the slave restarts after an
 executor terminates but before the GC process runs?

 I don't know what the error was above but I doubt a retry would've helped
 here. And yes GC runs for a terminated executor when slave restarts.

Sandbox GC fails

2014-09-06 Thread Tom Arnfeld
I've noticed the disk on my mesos slaves filling up when running tasks that
generate large amounts of data in their sandbox directories (~2-5GB). The
tasks don't last very long, and I can see that the mesos GC process is
trying to delete them, but failing. Here are some logs;

--

W0906 02:56:00.256515  1434 gc.cpp:139] Failed to delete
'/var/lib/mesos-slave/slaves/20140724-124232-33820844-5050-18432-4/frameworks/20140724-201017-50598060-5050-16139-0314/executors/0c5ce2f8-3564-11e4-a014-22000a48a50a/runs/d3e1342f-89d8-4930-a9ed-35e33920779c':
Directory not empty
W0906 02:56:00.258396  1434 gc.cpp:139] Failed to delete
'/var/lib/mesos-slave/slaves/20140724-124232-33820844-5050-18432-4/frameworks/20140724-201017-50598060-5050-16139-0314':
Directory not empty
W0906 02:56:00.259904  1434 gc.cpp:139] Failed to delete
'/var/lib/mesos-slave/slaves/20140724-124232-33820844-5050-18432-4/frameworks/20140724-201017-50598060-5050-16139-0314/executors/0c5ce2f8-3564-11e4-a014-22000a48a50a':
Directory not empty

--

If I try and manually remove the directory mentioned, it works fine. Is
this a known issue, or should I do a little more debugging? I've not tried
to reproduce it under specific conditions yet.

As a side note, should mesos perhaps have some kind of retry mechanism for
GC? Also, will GC still run for an executor if the slave restarts after an
executor terminates but before the GC process runs?

Tom.


Re: Mesos 0.20.0 with Docker registry availability

2014-09-05 Thread Tom Arnfeld
 You can tag each image with your commit hash that way Mesos will always
have to do a docker pull and you don't lose the fast iteration cycle in
development.

I mentioned this on one of the review requests the other day. The problem
here is that, say I want to iterate quickly on installing things for our
Hadoop on Mesos cluster, I need to now change all the hadoop configuration
on my Job Tracker's to point to the new image, which means a restart of the
JT and jobs will die. This goes for pretty much every mesos framework that
isn't for launching long running tasks.


On 5 September 2014 11:14, Steve Domin st...@gocardless.com wrote:

 Hi Ryan,

 You can tag each image with your commit hash that way Mesos will always
 have to do a docker pull and you don't lose the fast iteration cycle in
 development.

 Steve

 On Friday, September 5, 2014, craig mcmillan mccraigmccr...@gmail.com
 wrote:

 hey ryan,

 there are two deployment use-cases i generally have :

 - production : i want to consider carefully what i deploy, and refer to
a specific image. a versioned tag works well here

 - development : i want to iterate quickly and something like a branch
 (movable tag) works really well here, à la heroku :
 git-push = commit-hook = build docker-image =
 curl-to-marathon

 it's the development use-case that pull on every launch supports best

 would an option in the ContainerInfo to pull on every launch be
 reasonable ?

 i'm happy to do a PR if that would be helpful !

 :craig

 On 5 Sep 2014, at 9:07, Ryan Thomas wrote:

  Whilst this is somewhat unrelated to the mesos implementation, I think it
 is generally good practice to have immutable tags on the images, this is
 something I dislike about docker :)

 Whist the gc of old images will eventually become a problem, it will
 really
 only be the layer delta that is consumed with each new tag. But I think
 yes, there would need to be some mechanism to clear out the images in the
 local registry.

 ryan
 On 5 Sep 2014 18:03, mccraig mccraig mccraigmccr...@gmail.com wrote:

  ah, so i will have to use a different tag to update an app

 one immediate problem i can see is that it makes garbage collecting old
 docker images from slaves harder : currently i update the image
 associated
 with a tag and restart tasks to update the running app, then
 occasionally a
 cron job to remove all docker images with no tag

 if every updated image has a new tag it will be harder to figure out
 which
 images to remove... perhaps any with no running container, though that
 could lead to unnecessary pulls and slower restarts of failed tasks

 :craig

 On 5 Sep 2014, at 08:43, Ryan Thomas r.n.tho...@gmail.com wrote:

 Hey Craig,

 docker run will attempt a pull of the image if it cannot find a matching
 image and tag in its local repository.

 So it should only pull on the first run of a given tag.

 ryan
 On 5 Sep 2014 17:41, mccraig mccraig mccraigmccr...@gmail.com
 wrote:

  hi tim,

 if it doesn't pull on every run, when will it pull ?

 :craig

 On 5 Sep 2014, at 07:05, Tim Chen t...@mesosphere.io wrote:

 Hi Maxime,

 It is a very valid concern and that's why I've added a patch that
 should
 go out in 0.20.1 to not do a docker pull on every run anymore.

 Mesos will still try to docker pull when the image isn't available
 locally (via docker inspect), but only once.

 The downside ofcourse is that you're not able to automatically get the
 latest tagged image, but I think it's worth while price to may to gain
 the
 benefits of not depending on registry, able to run local images and
 more.

 Tim


 On Thu, Sep 4, 2014 at 10:50 PM, Maxime Brugidou 
 maxime.brugi...@gmail.com wrote:

  Hi,

 The current Docker integration in 0.20 does a docker pull from the
 registry before running any task. This means that your entire Mesos
 cluster
 becomes unusable if the registry goes down.

 The docs allow you to configure a custom .dockercfg for your tasks to
 point to a private docker registry.

 However it is not easy to run an HA docker registry. The
 docker-registry
 project recommend using S3 storage buy this is definitely not an
 option for
 some people.

 I know that for regular artifacts, Mesos can use HDFS storage and you
 can run your HDFS datanodes as Mesos tasks.

 So even if I attempt to have a docker registry storage in HDFS (which
 is
 not supported by docker-registry at the moment), I am stuck on a
 chicken
 and egg problem. I want to have as little services outside of Mesos as
 possible and it is hard to maintain HA services (especially outside of
 Mesos).

 Is there anyone running Mesos with Docker in production without S3? I
 am
 trying to make all the services outside of Mesos (the infra
 services that
 are necessary to run Mesos like DNS, Haproxy, Chef server... etc)
 either HA
 or not critical for the cluster to run. The docker registry is a new
 piece
 of infra outside of Mesos that is critical...

 Best,
 Maxime






Re: Introducing Portainer

2014-09-04 Thread Tom Arnfeld
That's a great question James. So for the past ~8 months we've been using a
jenkins master + (n) slaves setup to build images. We currently build
around ~20 different images for production, a few of which are also setup
to separately for CI (building a GitHub pull request, tagging with a branch
name, running unit/integration tests and pushing the image on success). To
be honest, this actually does work pretty well. Docker is a stable bit of
kit, and most of the issues we ran into were environmental... though still
a problem. I think this approach is valid, and will work for most users -
and even better if you couple if with the Jenkins on Mesos framework.

I'd like to highlight a few of the problems we found with running that
infrastructure;

- Even though we have a fair few number of images, people don't build
images 24/7. However, we have some big machines up 24/7 to serve the image
builds such that they complete in a timely manor when required... people
don't like to wait for things, and something like this can block a
developer unnecessarily.




On 4 September 2014 11:06, James Gray zaa...@gmail.com wrote:

 This is def interesting for us as this is a problem we are facing too.

 But just to clarify, what is the advantage of this approach over
 submitting Docker build jobs to either the Chronos or Jenkins
 schedulers?  For our use case, we were thinking about writing a
 scheduler for GoCD (which we use for CI/CD) to work much in the same
 way as the existing Jenkins scheduler.

 On Thu, Sep 4, 2014 at 12:29 AM, Joe Smith yasumo...@gmail.com wrote:
  +1 to a registry, stuff like NPM, vagrantcloud, and docker make it very
  clean to search (but seems like it takes more work to support and setup).
 
  Tom- this is rad! Also loving the use of Pesos- definitely looking
 forward
  to to more contributors there :)
 
 
  On Wed, Sep 3, 2014 at 4:13 PM, Chris Aniszczyk z...@twitter.com wrote:
 
  Just an idea but I think we should strive to provide a better approach
  that is more scalable/searchable IMHO as the number of frameworks
 continue
  to grow. I created an issue here to discuss potential options and if
 people
  are interested in providing some type of framework registry:
  https://issues.apache.org/jira/browse/MESOS-1759
 
  I see a couple options that could be interested, either the more lax
  community driven approach that JenkinsCI does via a GitHub organization
 or
  building a web-based registry similar to what the docker/ansible folks
 have
  done.
 
 
  On Wed, Sep 3, 2014 at 6:03 PM, Vinod Kone vinodk...@gmail.com wrote:
 
  This is great Tom. Thanks for sharing. We do list Mesos frameworks on
 the
  website (
 http://mesos.apache.org/documentation/latest/mesos-frameworks/).
  Please send a PR or RB request.
 
 
  On Wed, Sep 3, 2014 at 3:50 PM, Tom Arnfeld t...@duedil.com wrote:
 
  @Ankur Wups! That's silly of me...
  http://github.com/duedil-ltd/portainer
 
 
  On 3 September 2014 23:45, Ankur Chauhan an...@malloc64.com wrote:
 
  Could you share a link to the repo?
 
 
  On Wed, Sep 3, 2014 at 3:20 PM, Tom Arnfeld t...@duedil.com wrote:
 
  Hey everyone,
 
  Thought it would be worth sharing this on the mailing list. We've
  recently open sourced a Mesos framework called Portainer, which is
 for
  building docker containers on top of your cluster.
 
  It is in working order, though very early stage... It supports all
  Dockerfile instructions (including ADD) and can build multiple
 images in
  parallel. It's written entirely in Python, and is also built upon
 the Pesos
  python framework API @wickman, @nekto0n and I have been working on,
 so
  there's no need to install libmesos to use the framework.
 
  I ended up trying out the idea because we've had a painful
 experience
  managing dedicated infrastructure for building all of our images,
 which I'm
  sure some of you can empathise with, and figured we could leverage
 the spare
  capacity on our new Mesos cluster to cut that out entirely.
 
  We'd love any feedback or suggestions, as well as any contributions!
  Looking forward to hearing what you all think. :-)
 
  Side note... It would be great if there were a place we could list
 all
  of the known frameworks for users to explore, maybe this already
 exists?
 
  Cheers,
 
  Tom (and the rest of the infra team at DueDil).
 
 
 
 
 
 
 
  --
  Cheers,
 
  Chris Aniszczyk | Open Source | Twitter, Inc.
  @cra | +1 512 961 6719
 
 



Re: Introducing Portainer

2014-09-04 Thread Tom Arnfeld
still very early on (though we are already using it for some builds)... so
i'm very keen to hear feedback and see contributions from those using it!

Thanks,

Tom.



On 4 September 2014 11:30, Tom Arnfeld t...@duedil.com wrote:

 That's a great question James. So for the past ~8 months we've been using
 a jenkins master + (n) slaves setup to build images. We currently build
 around ~20 different images for production, a few of which are also setup
 to separately for CI (building a GitHub pull request, tagging with a branch
 name, running unit/integration tests and pushing the image on success). To
 be honest, this actually does work pretty well. Docker is a stable bit of
 kit, and most of the issues we ran into were environmental... though still
 a problem. I think this approach is valid, and will work for most users -
 and even better if you couple if with the Jenkins on Mesos framework.

 I'd like to highlight a few of the problems we found with running that
 infrastructure;

 - Even though we have a fair few number of images, people don't build
 images 24/7. However, we have some big machines up 24/7 to serve the image
 builds such that they complete in a timely manor when required... people
 don't like to wait for things, and something like this can block a
 developer unnecessarily.




 On 4 September 2014 11:06, James Gray zaa...@gmail.com wrote:

 This is def interesting for us as this is a problem we are facing too.

 But just to clarify, what is the advantage of this approach over
 submitting Docker build jobs to either the Chronos or Jenkins
 schedulers?  For our use case, we were thinking about writing a
 scheduler for GoCD (which we use for CI/CD) to work much in the same
 way as the existing Jenkins scheduler.

 On Thu, Sep 4, 2014 at 12:29 AM, Joe Smith yasumo...@gmail.com wrote:
  +1 to a registry, stuff like NPM, vagrantcloud, and docker make it very
  clean to search (but seems like it takes more work to support and
 setup).
 
  Tom- this is rad! Also loving the use of Pesos- definitely looking
 forward
  to to more contributors there :)
 
 
  On Wed, Sep 3, 2014 at 4:13 PM, Chris Aniszczyk z...@twitter.com wrote:
 
  Just an idea but I think we should strive to provide a better approach
  that is more scalable/searchable IMHO as the number of frameworks
 continue
  to grow. I created an issue here to discuss potential options and if
 people
  are interested in providing some type of framework registry:
  https://issues.apache.org/jira/browse/MESOS-1759
 
  I see a couple options that could be interested, either the more lax
  community driven approach that JenkinsCI does via a GitHub
 organization or
  building a web-based registry similar to what the docker/ansible folks
 have
  done.
 
 
  On Wed, Sep 3, 2014 at 6:03 PM, Vinod Kone vinodk...@gmail.com
 wrote:
 
  This is great Tom. Thanks for sharing. We do list Mesos frameworks on
 the
  website (
 http://mesos.apache.org/documentation/latest/mesos-frameworks/).
  Please send a PR or RB request.
 
 
  On Wed, Sep 3, 2014 at 3:50 PM, Tom Arnfeld t...@duedil.com wrote:
 
  @Ankur Wups! That's silly of me...
  http://github.com/duedil-ltd/portainer
 
 
  On 3 September 2014 23:45, Ankur Chauhan an...@malloc64.com wrote:
 
  Could you share a link to the repo?
 
 
  On Wed, Sep 3, 2014 at 3:20 PM, Tom Arnfeld t...@duedil.com wrote:
 
  Hey everyone,
 
  Thought it would be worth sharing this on the mailing list. We've
  recently open sourced a Mesos framework called Portainer, which is
 for
  building docker containers on top of your cluster.
 
  It is in working order, though very early stage... It supports all
  Dockerfile instructions (including ADD) and can build multiple
 images in
  parallel. It's written entirely in Python, and is also built upon
 the Pesos
  python framework API @wickman, @nekto0n and I have been working
 on, so
  there's no need to install libmesos to use the framework.
 
  I ended up trying out the idea because we've had a painful
 experience
  managing dedicated infrastructure for building all of our images,
 which I'm
  sure some of you can empathise with, and figured we could leverage
 the spare
  capacity on our new Mesos cluster to cut that out entirely.
 
  We'd love any feedback or suggestions, as well as any
 contributions!
  Looking forward to hearing what you all think. :-)
 
  Side note... It would be great if there were a place we could list
 all
  of the known frameworks for users to explore, maybe this already
 exists?
 
  Cheers,
 
  Tom (and the rest of the infra team at DueDil).
 
 
 
 
 
 
 
  --
  Cheers,
 
  Chris Aniszczyk | Open Source | Twitter, Inc.
  @cra | +1 512 961 6719
 
 





Introducing Portainer

2014-09-03 Thread Tom Arnfeld
Hey everyone,

Thought it would be worth sharing this on the mailing list. We've recently
open sourced a Mesos framework called Portainer, which is for *building*
docker containers on top of your cluster.

It is in working order, though very early stage... It supports all
*Dockerfile* instructions (including *ADD*) and can build multiple images
in parallel. It's written entirely in Python, and is also built upon the
Pesos python framework API @wickman, @nekto0n and I have been working on,
so there's no need to install libmesos to use the framework.

I ended up trying out the idea because we've had a painful experience
managing dedicated infrastructure for building all of our images, which I'm
sure some of you can empathise with, and figured we could leverage the
spare capacity on our new Mesos cluster to cut that out entirely.

We'd love any feedback or suggestions, as well as any contributions!
Looking forward to hearing what you all think. :-)

Side note... It would be great if there were a place we could list all of
the known frameworks for users to explore, maybe this already exists?

Cheers,

Tom (and the rest of the infra team at DueDil).


Re: Introducing Portainer

2014-09-03 Thread Tom Arnfeld
@Ankur Wups! That's silly of me... http://github.com/duedil-ltd/portainer


On 3 September 2014 23:45, Ankur Chauhan an...@malloc64.com wrote:

 Could you share a link to the repo?


 On Wed, Sep 3, 2014 at 3:20 PM, Tom Arnfeld t...@duedil.com wrote:

 Hey everyone,

 Thought it would be worth sharing this on the mailing list. We've
 recently open sourced a Mesos framework called Portainer, which is for
 *building* docker containers on top of your cluster.

 It is in working order, though very early stage... It supports all
 *Dockerfile* instructions (including *ADD*) and can build multiple
 images in parallel. It's written entirely in Python, and is also built upon
 the Pesos python framework API @wickman, @nekto0n and I have been working
 on, so there's no need to install libmesos to use the framework.

 I ended up trying out the idea because we've had a painful experience
 managing dedicated infrastructure for building all of our images, which I'm
 sure some of you can empathise with, and figured we could leverage the
 spare capacity on our new Mesos cluster to cut that out entirely.

 We'd love any feedback or suggestions, as well as any contributions!
 Looking forward to hearing what you all think. :-)

 Side note... It would be great if there were a place we could list all of
 the known frameworks for users to explore, maybe this already exists?

 Cheers,

 Tom (and the rest of the infra team at DueDil).





Re: Mesos slaves across network zones

2014-08-26 Thread Tom Arnfeld
Hey,

We've been running mesos slaves across sites, most in a private cloud off
site and using AWS EC2 for extra burst capacity when required, across a
Direct Connect link. We've found this model to work well on the mesos side,
though it's key to understand the interaction between tasks running across
multiple sites at the same time so you're aware of the effects of latency
and throughput (e.g data transfer if you're using Hadoop).

All of our master nodes (and zookeeper quorum) are in the same site (though
not physical location), but this isn't an issue for us since we're not
using AWS as a mechanism for redundancy.

I'm not aware of any built-in resource unit for network latency or
throughput, but I don't see any reason you couldn't specify your own on
each slave and configure frameworks to take that into account when making
scheduling decisions. The recent addition of the network isolator (
http://mesos.apache.org/documentation/latest/network-monitoring/) might
also be of use to you here.

Very interested in what others are doing in this space.

Tom.


On 26 August 2014 09:19, Yaron Rosenbaum ya...@whatson-social.com wrote:

 Hi

 Here's a crazy idea:
 Is it possible / has anyone tried to run Mesos where the slaves are in
 radically different network zones? For example: A few slaves on Azure, a
 few slaves on AWS, and a bunch of other slaves on premises etc.

- Assuming it's possible, is it possible to define resource
requirements for tasks, in terms of 'access to network resource A with less
than X latency and throughput between i and m' for example?
- Masters would probably have to be 'close' to each other, to prevent
'brain-splits', true or not ?
   - If so, then how does one assure Master HA ?


 I've been thinking about this for a while, and can't find a reason 'why
 not'.

 Please share your thoughts on the subject.

 (Y)




Re: MesosCon attendee introduction thread

2014-08-17 Thread Tom Arnfeld
 ​Depending on the specific needs, different models are possible. Would be
great to chat.

I'd love to join in this conversation, in our recent deployment we've been
bridging Amazon EC2 with our private cloud to make use of the elasticity
AWS provides, and have been thinking of ways to better improve that model.
The first is what we're doing right now, and the attributes-model is
intriguing.

(Sorry, we should take these discussions off this thread!)


On 17 August 2014 06:20, Sharma Podila spod...@netflix.com wrote:


 On Sat, Aug 16, 2014 at 6:41 PM, Michael Babineau 
 michael.babin...@gmail.com wrote:


 I'm especially interested in multi-datacenter Mesos (either as one
 cluster or coordinating across clusters) -- if anyone has thoughts around
 this, I'd love to chat!


 ​Depending on the specific needs, different models are possible. Would be
 great to chat.

 1. One Mesos master with slaves added from different datacenters

 Slaves from each data center may be given different attributes to bias
 scheduling tasks based on latency, data locality and other characteristics.
 Depends on what framework is being used.

 ​2. Peer to peer model, a full Mesos cluster in each data center

 ​A layer written on top or in between them to broker available resources
 as a 'lease out'

 3. Hierarchical model, one datacenter is 'primary' and off-shores tasks to
 other datacenters based on load (possibly similar to spilling over to
 off-site cloud when on-premise datacenter is full)
 4. A ring of Mesos clusters

 Send task to the datacenter based on consistent hashing of task name/ID, a
 la Cassandra clusters' key hashing. Although, one of the previous 3 models
 may already achieve the objectives that this model attempts to.

 ​
 Some of these are easier than others. There's going to be other models, I
 am sure.




Re: MesosCon attendee introduction thread

2014-08-14 Thread Tom Arnfeld
Awesome!

Hey everyone,

I'm Tom Arnfeld, and I work as a data / infrastructure engineer at a
financial startup called DueDil, based in London. We're a data aggregator,
and are in the process of migrating to a new production deployment of
Mesos, upon which we're running Hadoop, Docker and experimenting with
Spark. We've been running Hadoop on EC2 for the past couple of years, and
are very much looking forward to seeing how Mesos works out for us.

I'll also be heading out with our lead developer, Owen Smith, who has
recently been working on building out our internal service discovery tools,
based on etcd and DNS-SD. We're incredibly keen to learn as much as
possible, so if you fancy a chat we'll be around until the following Monday
:-)

Looking forward to meeting everyone, and watching all of the interesting
talks scheduled for the day!

Twitter: @tarnfeld
Github: github.com/tarnfeld


On 15 August 2014 00:17, Steve Domin st...@gocardless.com wrote:

 Great idea Dave!

 Hi everyone,

 I'm Steve Domin, and I'm heading the WebOps team at GoCardless
 https://gocardless.com, a payment startup based in London. We're
 currently integrating Mesos and Marathon and we'll hopefully be using it
 production by the end of Q3.

 Really looking forward to MesosCon and to meet people using Mesos (haven't
 met that many in London yet!).

 Twitter: @stevedomin



 On Fri, Aug 15, 2014 at 12:05 AM, Dave Lester daveles...@gmail.com
 wrote:

 Hi All,

 I thought it would be nice to kickoff a thread for folks to introduce
 themselves in advance of #MesosCon
 http://events.linuxfoundation.org/events/mesoscon, so here goes:

 My name is Dave Lester, and I am Open Source Advocate at Twitter. Twitter
 is an organizing sponsor for #MesosCon, and I've worked closely with Chris
 Aniszczyk, the Linux Foundation, and a great team of volunteers to
 hopefully make this an awesome community event.

 I'm interested in meeting more companies using Mesos that we can add to
 our #PoweredByMesos list
 http://mesos.apache.org/documentation/latest/powered-by-mesos/, and
 chatting with folks about Apache Aurora
 http://aurora.incubator.apache.org. Right now my Thursday and Friday
 evenings are free, so let's grab a beer and chat more.

 I'm also on Twitter: @davelester

 Next!





Re: Force a slave to garbage collect framework/executors

2014-08-01 Thread Tom Arnfeld
Crystal clear, thanks Ben!


On 1 August 2014 01:36, Benjamin Mahler benjamin.mah...@gmail.com wrote:

 Everything is scheduled for the garbage collection delay (1 week by
 default) from when it was last modified, but as the disk fills up we'll
 start pruning the older directories ahead of schedule.

 This means that things should be removed in the same order that they were
 scheduled.

 You can think of this as follows, everything gets scheduled for 1 week in
 the future, but we'll speed up the existing schedule when we need to make
 room. Make sense?


 On Thu, Jul 31, 2014 at 4:18 PM, Tom Arnfeld t...@duedil.com wrote:

 Yeah, specifically the docker issue was related to volumes not being
 removed with `docker rm` but that's a separate issue.

 So right now mesos won't remove older work directories to make room for
 new ones (old ones that have already been scheduled for removal in a few
 days time)? This means when the disk gets quite full, newer work
 directories will be removed much faster than older ones. Is that correct?



 On 31 July 2014 23:56, Benjamin Mahler benjamin.mah...@gmail.com wrote:

 Apologies for the lack of documentation, in the default setup, the slave
 will schedule the work directories for garbage collection when:

 (1) Executors terminate.
 (2) The slave recovers and discovers work directories for terminal
 executors.

 Sounds like the docker integration code you're using has a bug in this
 respect, either by not scheduling docker directories for garbage collection
 during (1) and/or (2).


 On Thu, Jul 31, 2014 at 3:40 PM, Tom Arnfeld t...@duedil.com wrote:

 I don't have them to hand now, but I recall it saying something in the
 high 90's and 0ns for the max allowed age. I actually found the root cause
 of the probably, docker related and out of mesos's control... though i'm
 still curious about the expected behaviour of the GC process. It doesn't
 seem to be well documented anywhere.

 Tom.


 On 31 July 2014 23:33, Benjamin Mahler benjamin.mah...@gmail.com
 wrote:

 What do the slave logs say?

 E.g.

 I0731 22:22:17.851347 23525 slave.cpp:2879] Current usage 7.84%. Max
 allowed age: 5.751197441470081days


 On Wed, Jul 30, 2014 at 8:55 AM, Tom Arnfeld t...@duedil.com wrote:

 I'm not sure if this is something already supported by mesos, and if
 so it'd be great if someone could point me in the right direction.

 Is there a way of asking a slave to garbage collect old executors
 manually?

 Maybe i'm misunderstanding things, but as each executor does (insert
 knowledge gap) mesos works out how long it is able to keep the sandbox 
 for
 and schedules it for garbage collection appropriately, also taking into
 account the command line

 The disk on one of my slaves is getting quite full (98%) and i'm
 curious how mesos is going to behave in this situation. Should it start
 clearing things up, given a task could launch that needs to use an amount
 of disk space, but that disk is being eaten up by old executor sandboxes.

 It may be worth noting i'm not specifying --gc_delay on any slave
 right now, perhaps I should be?

 Any input would be much appreciated.

 Cheers,

 Tom.









Re: Python bindings are changing!

2014-08-01 Thread Tom Arnfeld
Woah, this is really awesome Thomas! Especially the pip install ;-)

Looking forward to bringing pesos up to speed with this.


On 1 August 2014 21:30, Jie Yu yujie@gmail.com wrote:

 Thomas,

 Thank you for the heads-up. One question: what if mesos and python binding
 have different versions? For example, is it ok to use a 0.19.0 python
 binding and having a 0.20.0 mesos? Same question for the reverse.

 - Jie


 On Fri, Aug 1, 2014 at 9:37 AM, Thomas Rampelberg tho...@saunter.org
 wrote:

 - What problem are we trying to solve?

 Currently, the python bindings group protobufs, stub implementations
 and compiled code into a single python package that cannot be
 distributed easily. This forces python projects using mesos to copy
 protobufs around and forces a onerous dependency on anyone who would
 like to do a pure python binding.

 - How was this problem solved?

 The current python package has been split into two separate packages:

 - mesos.interface (stub implementations and protobufs)
 - mesos.native (old _mesos module)

 These are python meta-packages and can be installed as separate
 pieces. The `mesos.interface` package will be hosted on pypi and can
 be installed via. easy_install and pip.

 See https://issues.apache.org/jira/browse/MESOS-857 and
 https://reviews.apache.org/r/23224/.

 - Why should I care?

 These changes are not backwards compatible. With 0.20.0 you will need
 to change how you use the python bindings. Here's a quick overview:

 mesos.Scheduler - mesos.interface.Scheduler
 mesos.mesos_pb2 - mesos.interface.mesos_pb2
 mesos.MesosSchedulerDriver - mesos.native.MesosSchedulerDriver

 For more details, you can take a look at the examples in
 `src/examples/python.





Re: Force a slave to garbage collect framework/executors

2014-07-31 Thread Tom Arnfeld
I don't have them to hand now, but I recall it saying something in the high
90's and 0ns for the max allowed age. I actually found the root cause of
the probably, docker related and out of mesos's control... though i'm still
curious about the expected behaviour of the GC process. It doesn't seem to
be well documented anywhere.

Tom.


On 31 July 2014 23:33, Benjamin Mahler benjamin.mah...@gmail.com wrote:

 What do the slave logs say?

 E.g.

 I0731 22:22:17.851347 23525 slave.cpp:2879] Current usage 7.84%. Max
 allowed age: 5.751197441470081days


 On Wed, Jul 30, 2014 at 8:55 AM, Tom Arnfeld t...@duedil.com wrote:

 I'm not sure if this is something already supported by mesos, and if so
 it'd be great if someone could point me in the right direction.

 Is there a way of asking a slave to garbage collect old executors
 manually?

 Maybe i'm misunderstanding things, but as each executor does (insert
 knowledge gap) mesos works out how long it is able to keep the sandbox for
 and schedules it for garbage collection appropriately, also taking into
 account the command line

 The disk on one of my slaves is getting quite full (98%) and i'm curious
 how mesos is going to behave in this situation. Should it start clearing
 things up, given a task could launch that needs to use an amount of disk
 space, but that disk is being eaten up by old executor sandboxes.

 It may be worth noting i'm not specifying --gc_delay on any slave right
 now, perhaps I should be?

 Any input would be much appreciated.

 Cheers,

 Tom.





Force a slave to garbage collect framework/executors

2014-07-30 Thread Tom Arnfeld
I'm not sure if this is something already supported by mesos, and if so
it'd be great if someone could point me in the right direction.

Is there a way of asking a slave to garbage collect old executors manually?

Maybe i'm misunderstanding things, but as each executor does (insert
knowledge gap) mesos works out how long it is able to keep the sandbox for
and schedules it for garbage collection appropriately, also taking into
account the command line

The disk on one of my slaves is getting quite full (98%) and i'm curious
how mesos is going to behave in this situation. Should it start clearing
things up, given a task could launch that needs to use an amount of disk
space, but that disk is being eaten up by old executor sandboxes.

It may be worth noting i'm not specifying --gc_delay on any slave right
now, perhaps I should be?

Any input would be much appreciated.

Cheers,

Tom.


Re: Does Mesos support Hadoop MR V2

2014-07-28 Thread Tom Arnfeld
 modules.

 If someone wanted to take on the project of making a generic resource
 scheduler Interface for MRv2, that works be amazing :)
 On Jul 26, 2014 6:19 PM, Jie Yu yujie@gmail.com wrote:

 I am interested in investigating the idea of YARN on top of Mesos.
 One of the benefits I can think of is that we can get rid of the static
 resource allocation between YARN and Mesos clusters. In that way, Mesos 
 can
 allocate those resources that are not used by YARN to other Mesos
 frameworks like Aurora, Marathon, etc, to increase the resource 
 utilization
 of the entire data center. Also, we could avoid running each MRv2 job as 
 a
 framework which I think might cause some maintenance complexity (e.g. for
 framework rate limiting, etc). Finally, YARN currently does not have a 
 good
 isolation support. It only supports cpu isolation right now (using
 cgroups). By porting YARN on top of Mesos, we might be able to leverage 
 the
 existing Mesos containerizer strategy to provide better isolation between
 tasks. Maxime, I am curious why do you think it does not make sense to 
 run
 YARN over Mesos? Since I am not super familar with YARN, I might be 
 missing
 something.

 I have been thinking of making ResourceManager in YARN a Mesos
 framework and making NodeManager a Mesos executor. The NodeManager will
 launch containers using primitives provided by Mesos so that we have a
 consistent containerizer layer. I haven't fully figured out how this 
 could
 be done yet (e.g., nested containers, communication between NodeManager 
 and
 ResourceManager, etc.), but I would love to explore this direction. I 
 would
 like to hear about any feedback/suggestions you guys have about this
 direction.

 Thanks,
 - Jie


 On Fri, Jul 25, 2014 at 1:39 PM, Maxime Brugidou 
 maxime.brugi...@gmail.com wrote:

 We run both mesos and yarn in prod and it does not make sense to run
 yarn over mesos.

 However it would be interesting to find a way to run MRv2 jobs on
 mesos with some custom layer to swap yarn with mesos. Not sure how to 
 start
 though... MRv2 contains a yarn application master that needs to be
 rewritten as a mesos framework scheduler. This is probably doable. 
 However
 with MRv2 every map reduce job would be mapped as a new framework in 
 Mesos.
 Not sure how many frameworks mesos can run and scale up to. Especially
 short lived frameworks.
  On Jul 25, 2014 8:54 PM, Tom Arnfeld t...@duedil.com wrote:

 Hey Luyi,

 That's correct, the Hadoop framework currently only supports Hadoop
 2 MRv1. It also doesn't have great support for the HA jobtracker 
 available
 in newer versions of Hadoop, but I've been working on that the past few
 weeks.

 I'm not sure how Hadoop 2 would play with Mesos, but very
 interested to find out more. Am I correct in thinking MRv2 will only 
 run on
 top of YARN?

 I wonder if anyone else on the mailing list is running YARN on top
 of Mesos...

 Tom.

 On Friday, 25 July 2014, Luyi Wang wangluyi1...@gmail.com wrote:

 Checked the mesos github(https://github.com/mesos/hadoop). It
 listed support for MapReduce V1

 How about the MR V2?

 Right now we are using cloudera to manage hadoop clusters where
 uses MRV2. We are planning to migrate all our services to mesos(still 
 in
 the initial investigating stage).  Good suggestions, advice and 
 experiences
 are welcomed.

 Thanks a lot!


 -Luyi.









Re: Does Mesos support Hadoop MR V2

2014-07-25 Thread Tom Arnfeld
Hey Luyi,

That's correct, the Hadoop framework currently only supports Hadoop 2 MRv1.
It also doesn't have great support for the HA jobtracker available in newer
versions of Hadoop, but I've been working on that the past few weeks.

I'm not sure how Hadoop 2 would play with Mesos, but very interested to
find out more. Am I correct in thinking MRv2 will only run on top of YARN?

I wonder if anyone else on the mailing list is running YARN on top of
Mesos...

Tom.

On Friday, 25 July 2014, Luyi Wang wangluyi1...@gmail.com wrote:

 Checked the mesos github(https://github.com/mesos/hadoop). It listed
 support for MapReduce V1

 How about the MR V2?

 Right now we are using cloudera to manage hadoop clusters where uses MRV2.
 We are planning to migrate all our services to mesos(still in the initial
 investigating stage).  Good suggestions, advice and experiences are
 welcomed.

 Thanks a lot!


 -Luyi.






Re: Does Mesos support Hadoop MR V2

2014-07-25 Thread Tom Arnfeld
I've not seen any issues pertaining to running many short lived frameworks,
but that's not near the number of frameworks you'd see if each job was a
framework.

We've been pushing all our work on MRv1 High Availability JT upstream on
the github.com/mesos/hadoop repo, though there hasn't been much to it.

There's some outstanding work in regards to framework failover that needs
to be done (there's an issue on GH for this, right now if the JT fails over
the mesos framework will re-register and all task trackers relaunched
meaning running jobs restart from the beginning) and a couple of small bugs
we've found in relation to memory limits that we haven't debugged.

Maxime, it'd be cool to hear more about how possible it would be to do port
the MRv2 framework equivalent to Mesos. I'm not very familiar with the
internals of YARN itself.

Tom.

On Friday, 25 July 2014, Maxime Brugidou maxime.brugi...@gmail.com wrote:

 We run both mesos and yarn in prod and it does not make sense to run yarn
 over mesos.

 However it would be interesting to find a way to run MRv2 jobs on mesos
 with some custom layer to swap yarn with mesos. Not sure how to start
 though... MRv2 contains a yarn application master that needs to be
 rewritten as a mesos framework scheduler. This is probably doable. However
 with MRv2 every map reduce job would be mapped as a new framework in Mesos.
 Not sure how many frameworks mesos can run and scale up to. Especially
 short lived frameworks.
 On Jul 25, 2014 8:54 PM, Tom Arnfeld t...@duedil.com
 javascript:_e(%7B%7D,'cvml','t...@duedil.com'); wrote:

 Hey Luyi,

 That's correct, the Hadoop framework currently only supports Hadoop 2
 MRv1. It also doesn't have great support for the HA jobtracker available in
 newer versions of Hadoop, but I've been working on that the past few weeks.

 I'm not sure how Hadoop 2 would play with Mesos, but very interested to
 find out more. Am I correct in thinking MRv2 will only run on top of YARN?

 I wonder if anyone else on the mailing list is running YARN on top of
 Mesos...

 Tom.

 On Friday, 25 July 2014, Luyi Wang wangluyi1...@gmail.com
 javascript:_e(%7B%7D,'cvml','wangluyi1...@gmail.com'); wrote:

 Checked the mesos github(https://github.com/mesos/hadoop). It listed
 support for MapReduce V1

 How about the MR V2?

 Right now we are using cloudera to manage hadoop clusters where uses
 MRV2. We are planning to migrate all our services to mesos(still in the
 initial investigating stage).  Good suggestions, advice and experiences are
 welcomed.

 Thanks a lot!


 -Luyi.






Re: Mesos language bindings in the wild

2014-07-23 Thread Tom Arnfeld
It seems pretty recent, would be interesting to compare notes with
pesos/compactor.


On 23 July 2014 19:46, Brian Wickman wick...@apache.org wrote:

 wow, no.  I was aware of dpark but I was unaware that they had rolled
 their own driver.  Interesting.


 On Wed, Jul 23, 2014 at 11:17 AM, Tom Arnfeld t...@duedil.com wrote:

 Did anyone know this existed
 https://github.com/douban/dpark/tree/master/dpark/pymesos ? Just came
 across that while googling...


 On 23 July 2014 18:28, Erik Erlandson e...@redhat.com wrote:



 - Original Message -
  -1 for git submodules. I am really not keen on those; worked with them
  while working on Chromium and it was, to be frank, a mess to handle,
 update
  and maintain.
 

 I've also found submodules disappointing, and been watching on the
 sidelines as the boost community discovers what a pita they are.

 A newer alternative is git subtree.  Full disclosure: I haven't actually
 worked with subtree, but it looks like a better system than submodules:

 http://blogs.atlassian.com/2013/05/alternatives-to-git-submodule-git-subtree/



  I am rooting for separate repos. Maybe worth a non-binding vote?
 
  Niklas
 
 
  On 17 July 2014 11:45, Tim St Clair tstcl...@redhat.com wrote:
 
   Inline -
  
   --
  
   *From: *Vladimir Vivien vladimir.viv...@gmail.com
   *To: *user@mesos.apache.org
   *Sent: *Tuesday, July 15, 2014 1:34:37 PM
  
   *Subject: *Re: Mesos language bindings in the wild
  
   Hi all,
Apologies for being super late to this thread.  To answer Niklas
 point at
   the start of the thread: Yes, I am thrilled to contribute in anyway
 I can.
The project is moving forward and making progress (slower than I
 want, but
   progress regardless).
  
   Going Native
   Implementing a native client for Mesos is an arduous process right
 now
   since there's little doc to guide developers.  Once I went through
 C++ code
   and a few emails, it became easy (even easier than I thought).  If
 the push
   is for more native client, at some point we will need basic
 internals to be
   documented.
  
   Mesos-Certified
   Maybe a Mesos test suite can be used to certify native clients.
  There are
   tons of unit tests in the code that already validate the source code.
Maybe some of those test logic can be pulled out / copied into a
 small
   stand-alone mesos test server that clients can communicate with to
 run a
   test suite (just an idea).  This along with some documentation would
 help
   with quality of native clients.
  
  
   +1.
  
  
   In or Out of Core
   Having native clients source hosted in core would be great since all
 code
   would be in one location. Go code can certainly co-exist a
 subproject in
   Mesos.  Go's build workflow can be driven by Make. Go's dependency
   management can work with repo subdirectories (at least according to
 'go
   help importpath', I haven't tested that myself).  But, as Tom
 pointed out,
   the thing that raises a flag for me is project velocity.  If author
 wants
   to move faster or slower than Mesos release cycles, there's no way
 to do so
   once the code is part of core.
  
   Anyway, I have gone on long enough.   Looking for ward to feedback.
  
  
   I usually don't tread here, but perhaps a git-submodule works in this
   narrow case.
   Thoughts?
  
  
  
   On Tue, Jul 15, 2014 at 10:07 AM, Tim St Clair tstcl...@redhat.com
   wrote:
  
   Tom -
  
   I understand the desire to create bindings outside the core.  The
 point I
   was trying to make earlier around version semantics and testing was
 to
   'Hedge' the risk.  It basically creates a contract between core 
   framework+bindings writers.
  
   No one ever intends to break compatibility, but it happens all the
 time
   and usually in some very subtle ways at first.  A great example of
 this is
   a patch I recently submitted to Mesos where the cgroup code was
 writing an
   extra endln out.  Earlier versions of the kernel had no issue
 with this,
   but recent modifications would cause the cgroup code to fail.  Very
   subtle,
   and boom-goes-the-dynamite.
  
   Below was an email I sent a while back, that outlines a possible
   hedge/contract.  Please let me know what you think.
  
   --
   
Greetings!
   
I've conversed with folks about the idea of having a more
 formalized
   release
and branching strategy, such that others who are downstream can
 rely on
certain version semantics when planning upgrades, etc.  This
 becomes
   doubly
important as we start to trend towards a 1.0 release, and folks
 will
   depend
heavily on it for their core infrastructure, and APIs
 (Frameworks, and
   EC).
   
Therefore, I wanted to propose a more formalized branching and
 release
strategy, and see what others think.  I slightly modified this
 pattern
   from
the Condor  Kernel projects, which have well established
 processes

Re: [VOTE] Release Apache Mesos 0.19.1 (rc1)

2014-07-16 Thread Tom Arnfeld
+1 (non binding)

- Tested on Mac OSX mavericks
- Tested on Ubuntu 12.04 LTS machines (spark and Hadoop run fine also)

On 15 Jul 2014, at 19:48, Niklas Nielsen nik...@mesosphere.io wrote:

 +1 (binding)
 
 Tested on:
 - OSX Mavericks w/ clang-503.0.40  LLVM 3.4
 - Ubuntu 13.10 w/ gcc-4.8.1 (LogZooKeeperTest.WriteRead is still flaky on 
 that VM)
 
 Thanks Ben!
 
 
 On 14 July 2014 21:39, Benjamin Hindman benjamin.hind...@gmail.com wrote:
 +1, thanks Ben!
 
 
 On Mon, Jul 14, 2014 at 6:20 PM, Vinod Kone vinodk...@gmail.com wrote:
 +1 (binding)
 
 Tested on OSX Mavericks w/ gcc-4.8
 
 
 On Mon, Jul 14, 2014 at 2:35 PM, Timothy Chen tnac...@gmail.com wrote:
 +1 (non-binding).
 
 Tim
 
 On Mon, Jul 14, 2014 at 2:32 PM, Benjamin Mahler
 benjamin.mah...@gmail.com wrote:
  Hi all,
 
  Please vote on releasing the following candidate as Apache Mesos 0.19.1.
 
 
  0.19.1 includes the following:
  
  Fixes a long standing critical bug in the JNI bindings that can lead to
  framework unregistration.
  Allows the mesos fetcher to handle 30X redirects.
  Fixes a CHECK failure during container destruction.
  Fixes a regression that prevented local runs from working correctly.
 
  The CHANGELOG for the release is available at:
  https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.19.1-rc1
  
 
  The candidate for Mesos 0.19.1 release is available at:
  https://dist.apache.org/repos/dist/dev/mesos/0.19.1-rc1/mesos-0.19.1.tar.gz
 
  The tag to be voted on is 0.19.1-rc1:
  https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.19.1-rc1
 
  The MD5 checksum of the tarball can be found at:
  https://dist.apache.org/repos/dist/dev/mesos/0.19.1-rc1/mesos-0.19.1.tar.gz.md5
 
  The signature of the tarball can be found at:
  https://dist.apache.org/repos/dist/dev/mesos/0.19.1-rc1/mesos-0.19.1.tar.gz.asc
 
  The PGP key used to sign the release is here:
  https://dist.apache.org/repos/dist/release/mesos/KEYS
 
  The JAR is up in Maven in a staging repository here:
  https://repository.apache.org/content/repositories/orgapachemesos-1025
 
  Please vote on releasing this package as Apache Mesos 0.19.1!
 
  The vote is open until Thu Jul 17 14:28:59 PDT 2014 and passes if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Mesos 0.19.1
  [ ] -1 Do not release this package because ...
 
  Thanks,
  Ben
 
 
 



Re: Mesos language bindings in the wild

2014-07-11 Thread Tom Arnfeld
Very exciting. I'd vote +1 for splitting them out. Especially if you
look at the common way of using Go imports, just stick the project on
GitHub and import it directly using github.com/mesos/mesos-go or
similar.

I guess one argument is that you have more fragmentation of the code
(e.g every library has it's own copy of the protos) but I'm not sure
that's a bad thing.

Just my two cents. Looking forward to this!

 On 11 Jul 2014, at 16:59, Thomas Rampelberg tho...@saunter.org wrote:

 I've started preparing the python bindings to hopefully take this
 route ( https://reviews.apache.org/r/23224/ would love some reviews!
 ). In fact, there is already a native python implementation of both
 libprocess and the framework apis! (https://github.com/wickman/pesos/
 , https://github.com/wickman/compactor ).

 What are the benefits of bindings being part of the project source
 itself instead of having blessed implementations like mesos-python
 where the source and versioning becomes separate? I've been running
 into difficulties making automake and python's build tools play nicely
 together. It seems like there'd be more flexibility in general by
 splitting them out.


 On Thu, Jul 10, 2014 at 3:57 PM, Niklas Nielsen nik...@mesosphere.io wrote:
 I just wanted to clarify - native, meaning _no_ dependency to libmesos and
 native to its language (only Go, only Python and so on) i.e. use the
 low-level API.

 Sorry for the confusion,
 Niklas


 On 10 July 2014 15:55, Dominic Hamon dha...@twopensource.com wrote:

 In my dream world, we wouldn't need any native bindings. I can imagine
 having example frameworks or starter frameworks that use the low-level API
 (the wire protocol with protocol buffers for message passing), but nothing
 like we have that needs C or JNI, etc.




 On Thu, Jul 10, 2014 at 3:26 PM, Niklas Nielsen nik...@mesosphere.io
 wrote:

 Hi all,

 I wanted to start a discussion around the language bindings in the wild
 (Go, Haskell, native Python, Go, Java and so on) and possibly get to a
 strategy where we start bringing those into Mesos proper. As most things
 points towards, it will probably make sense to focus on the native
 bindings leveraging the low-level API. To name one candidate to start
 with, we are especially interested in getting Go native support in Mesos
 proper (and in a solid state). So Vladimir, we'd be super thrilled to
 start
 collaborating with you on your current work.

 We are interested to hear what thoughts you all might have on this.

 Thanks,
 Niklas



Python Celery on Mesos

2014-07-09 Thread Tom Arnfeld
Hey everyone,

Posted this on IRC earlier but thought i'd bring it up here. Has anyone given 
any thought to building an extension to Celery (a popular python framework for 
building task based apps http://www.celeryproject.org/) to allow it to behave 
as a Mesos framework and run tasks on Mesos? I'm not very familiar with the 
internals of Celery so it might be more complex to achieve than it looks from 
the surface, but something we'd be interested in working with.

Tom.

--
Tom Arnfeld
Developer // DueDil

t...@duedil.com
(+44) 7525940046




25 Christopher Street, London, EC2A 2BS
Company Number: 06999618

What is DueDil? |  Product features  |  Try it for free



Re: Deimos / Marathon task stuck in staging

2014-07-08 Thread Tom Arnfeld
Hey,

This looked to me like a bug in the external containerizer code, I opened a 
JIRA issue about it a while back - 
https://issues.apache.org/jira/browse/MESOS-1462.

I don't think anyone has started looking at fixing it yet.

Tom.

On 8 Jul 2014, at 15:29, Aurélien Dehay aurel...@dehay.info wrote:

 Hello.
 
 I'm testing mesos 0.19 /marathon 0.6.0 and deimos 0.3.4
 
 On  External containerizer failed (status: 4) (like, for example, when my 
 docker:/// url is false, resulting in a 404), the tasks stay in STAGING in 
 mesos.
 
 Even if I destroy or suspend the task in Marathon, the mesos tasks stay in 
 STAGING and therefore the resources are not freed.
 
 I have 3 lines like this:
 I0708 16:11:57.519381  3208 master.cpp:2000] Asked to kill task 
 postgresql-4.18d61402-06a9-11e4-bb12-1266fe80fd27 of framework 
 20140701-122510-16842879-5050-2188-
 
 
 Is there any way to clean the staging tasks, is it worth opening a JIRA? I 
 did not find any related entries.
 
 Thanks.



Re: 0.19.1

2014-07-04 Thread Tom Arnfeld
Happy to. It surprised me that this wasn't supported, especially considering 
the fetcher is supposed to be able to download URIs from any URL using http(s). 
This is most useful (and in my opinion quite an important issue) for 
downloading executors from S3 in situations a redirect is incurred, and more 
specifically, github tar archives which almost always go through a 301.

Don't mind if going into the next non-bugfix release if you don't agree it's 
that important.

On 4 Jul 2014, at 20:48, Dominic Hamon dha...@twopensource.com wrote:

 Hi
 
 Can you give some background as to why this is a critical fix? We try to
 minimise what we include in bug fix releases to avoid feature creep.
 
 Thanks
 On Jul 4, 2014 12:31 PM, Tom Arnfeld t...@duedil.com wrote:
 
 Any chance we can get https://issues.apache.org/jira/browse/MESOS-1448
 too?
 
 On 3 Jul 2014, at 21:40, Vinod Kone vinodk...@gmail.com wrote:
 
 Hi,
 
 We are planning to release 0.19.1 (likely next week) which will be a bug
 fix release. Specifically, these are the fixes that we are planning to
 cherry pick.
 
 
 https://issues.apache.org/jira/issues/?filter=12326191jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
 
 If there are other critical fixes that need to be backported to 0.19.1
 please reply here as soon as possible.
 
 Thanks,
 
 
 



Re: Docker support in Mesos core

2014-06-21 Thread Tom Arnfeld
Hey Everyone,

Excited to see discussions of this. Something I started playing around with 
just as the external containerizer was coming to life!

Diptanu – A few responses to your notes...

 a. Mesos understanding docker metrics which should be straightforward because 
 docker writes all its metrics in the following fashin for cpu, blkio, memory 
 etc - /sys/fs/cgroup/cpu/docker/conntainerid

Are these paths not operating system dependent? I'm not too familiar with 
cross-platform cgroups, so I am no doubt wrong here. These cgroup metrics are 
also the ones Mesos currently uses (both for usage statistics and for 
memory/cpu limits) so they can be pulled out much the same.

 b. Easier way to map tasks to docker containers, probably the external 
 containerizer takes care of it to a large extent. It would be helpful if 
 there was a blog about it's API and internals by the core committers 
 explaining the design. Even a simple example in the mesos codebase using the 
 external containerizer would help.

That's an interesting one, a blog post would be awesome. The containerizers 
right now currently use the ContainerId string provided to them from Mesos (I 
believe this is just the TaskID but i'm not certain of that). This helps ensure 
consistency of how containerizers are implemented, and makes them much simpler.

 c. stdout and stderr of docker containers in the Mesos task stdout and stderr 
 logs. Does the external containerizer already takes care of it? I had to 
 write a service which runs on every slave for exposing the container logs to 
 an user.

The external containerizer itself doesn't help you with this. The logs from the 
containerizer calls are dumped into the sandbox, however it's up to the 
containerizer (e.g Deimos) to redirect the logs from the container it launches. 
Deimos does take care of this already (as seen here 
https://github.com/mesosphere/deimos/blob/master/deimos/containerizer/docker.py#L132).

 e. Translate all task constraints to docker run flags. This is probably the 
 easiest and I know it's super easy to implement with the external 
 containerizer.

The current Docker containerizer implementations both do this, they support 
CPU, Memory and Ports. Docker currently doesn't support changing these limits 
on a  running container, so you have to go behind docker and write to the 
cgroup limits yourself. There's also no way to change the port mappings of a 
container that I know of.

Hope that answers some of your questions!

Tom.

On 21 Jun 2014, at 00:20, Diptanu Choudhury dipta...@gmail.com wrote:

 Great timing for this thread!
 
 I have been working on this for the past few months and here is what I am 
 doing and would be nice if docker was supported straight way in Mesos. So 
 here goes the features that I would personally love to see in Mesos Core from 
 the perspective of an user which I had to implement on my own -
 
 a. Mesos understanding docker metrics which should be straightforward because 
 docker writes all its metrics in the following fashin for cpu, blkio, memory 
 etc - /sys/fs/cgroup/cpu/docker/conntainerid
 I am sending all these metrics right now as a framework message back to my 
 framework/scheduler but it would be cool if Mesos took care of them.
 
 b. Easier way to map tasks to docker containers, probably the external 
 containerizer takes care of it to a large extent. It would be helpful if 
 there was a blog about it's API and internals by the core committers 
 explaining the design. Even a simple example in the mesos codebase using the 
 external containerizer would help.
 
 c. stdout and stderr of docker containers in the Mesos task stdout and stderr 
 logs. Does the external containerizer already takes care of it? I had to 
 write a service which runs on every slave for exposing the container logs to 
 an user.
 
 d. Mesos GC of tasks taking care of cleaning up docker containers which have 
 terminated. Right now the way I implemented this is that the service which 
 exposes the logs of a container also listens to docker events and when a 
 container exits, it knows that this has to be cleaned up and so removes it 
 after a fixed amount of time[configurable through a Rest API/config file]
 
 e. Translate all task constraints to docker run flags. This is probably the 
 easiest and I know it's super easy to implement with the external 
 containerizer.
 
 
 On Fri, Jun 20, 2014 at 3:40 PM, Tobias Knaup t...@knaup.me wrote:
 Hi all,
 
 We've got a lot of feedback from folks who use Mesos to run Dockers at scale 
 via Deimos, and the main wish was to make Docker a first class citizen in 
 Mesos, instead of a plugin that needs to be installed separately. Mesosphere 
 wants to contribute this and I already chatted with Ben H about what an 
 implementation could look like.
 
 I'd love for folks on here that are working with Docker to chime in!
 I created a JIRA here: https://issues.apache.org/jira/browse/MESOS-1524
 
 Cheers,
 
 Tobi
 
 
 
 -- 
 

Re: Apache Mesos 0.19.0 Released

2014-06-13 Thread Tom Arnfeld
Hey Dave (and the group),

I have to say for me it was a little fiddly to upgrade a 0.18.2
cluster to 0.19.0. Largely because of a requirement to bring
everything back up in a certain order (I had to lower the quorum count
to 1) otherwise mesos failed to get a majority vote to initialise the
log (I had 3 masters).

I'd also be very interested in a zookeeper implementation - and
perhaps some improved documentation around the log.

Cheers,

Tom.

 On 13 Jun 2014, at 08:17, Dick Davies d...@hellooperator.net wrote:

 I thought I read that there was going to be a registry implementation
 backed by zookeeper;
 does anyone know why that was dropped?

 Really excited to see the containerizer features rolling in, but the
 quorum looks at first glance
 to make Mesos a little harder to operate
 (This means adding or removing masters must be done carefully! ) - I
 understand the
 benefits but was hoping we could get by with the zookeeper registry.


 On 13 June 2014 03:49, Dave Lester daveles...@gmail.com wrote:
 Hi All,

 Below is a blog post that Ben Mahler wrote as release manager for Mesos
 0.19.0; it was published on the Mesos site today.

 I know that not everyone follows @ApacheMesos Twitter (even though you
 should!), so I wanted to make sure was also shared on the user@ list.

 Cheers,
 Dave


 Apache Mesos 0.19.0 Released

 The latest Mesos release, 0.19.0 is now available for download. This new
 version includes the following features and improvements:

 The master now persists the list of registered slaves in a durable
 replicated manner using the Registrar and the replicated log.
 Alpha support for custom container technologies has been added with the
 ExternalContainerizer.
 Metrics reporting has been overhauled and is now exposed on
 ip:port/metrics/snapshot.
 Slave Authentication: optionally, only authenticated slaves can register
 with the master.
 Numerous bug fixes and stability improvements.

 Full release notes are available on JIRA.

 Registrar

 Mesos 0.19.0 introduces the “Registrar”: the master now persists the list of
 registered slaves in a durable replicated manner. The previous lack of
 durable state was an intentional design decision that simplified failover
 and allowed masters to be run and migrated with ease. However, the stateless
 design had issues:

 In the event of a dual failure (slave fails while master is down), no lost
 task notifications are sent. This leads to a task running according to the
 framework but unknown to Mesos.
 When a new master is elected, we may allow rogue slaves to re-register with
 the master. This leads to tasks running on the slave that are not known to
 the framework.

 Persisting the list of registered slaves allows failed over masters to
 detect slaves that do not re-register, and notify frameworks accordingly. It
 also allows us to prevent rogue slaves from re-registering; terminating the
 rogue tasks in the process.

 The state is persisted using the replicated log (available since 0.9.0).

 External Containerization

 As alluded to during the containerization / isolation refactor in 0.18.0,
 the ExternalContainerizer has landed in this release. This provides alpha
 level support for custom containerization.

 Developers can implement their own external containerizers to provide
 support for custom container technologies. Initial Docker support is now
 available through some community driven external containerizers: Docker
 Containerizer for Mesos by Tom Arnfeld and Deimos by Jason Dusek. Please
 reach out on the mailing lists with questions!

 Metrics

 Previously, Mesos components had to use custom metrics code and custom HTTP
 endpoints for exposing metrics. This made it difficult to expose additional
 system metrics and often required having an endpoint for each libprocess
 Process (Actor) for which metrics were desired. Having metrics spread across
 endpoints was operationally complex.

 We needed a consistent, simple, and global way to expose metrics, which led
 to the creation of a metrics library within libprocess. All metrics are now
 exposed via /metrics/snapshot. The /stats.json endpoint remains for
 backwards compatibility.

 Upgrading

 For backwards compatibility, the “Registrar” will be enabled in a phased
 manner. By default, the “Registrar” is write-only in 0.19.0 and will be
 read/write in 0.20.0.

 If running in high-availability mode with ZooKeeper, operators must now
 specify the --work_dir for the master, along with the --quorum size of the
 ensemble of masters. This means adding or removing masters must be done
 carefully! The best practice is to only ever add or remove a single master
 at a time and to allow a small amount of time for the replicated log to
 catch up on the new master. Maintenance documentation will be added to
 reflect this.

 Please refer to the upgrades document, which details how to perform an
 upgrade from 0.18.x.

 Future Work

 Thanks to the Registrar, reconciliation primitives can now be provided

Mesos / Libprocess ENETUNREACH

2014-05-21 Thread Tom Arnfeld
Hey,

I’ve been testing out mesos for production recently and i’m having trouble 
registering frameworks over our VPN connection. I expect it’s a firewall issue, 
but i’m unsure as to what connectivity mesos requires for its frameworks.

When running the test java framework from one of the slaves everything runs 
fine, however from my local machine across our VPN it just continues to 
re-register. With a little more debugging (GLOG_v=2) I narrowed it down to a 
specific socket error 101 - ENETUNREACH. That’s an error being thrown on 
3rdparty/libprocess/src/process.cpp line 1208.

Would anyone be able to describe the required connectivity between the mesos 
master(s) and the frameworks themselves? I’m a not very familiar with 
libprocess.

Thanks!

Tom.

Re: Where did 0.18.1 go? Suggesting 0.18.2

2014-05-15 Thread Tom Arnfeld
Definitely +1.

On 13 May 2014, at 18:54, Benjamin Hindman b...@eecs.berkeley.edu wrote:

 +1!
 
 
 On Tue, May 13, 2014 at 9:51 AM, Niklas Nielsen n...@qni.dk wrote:
 Hey everyone,
 
 First and foremost, I apologize for the radio silence on my part with regards 
 to the 0.18.1 release. We didn't announce it or make it public on the website.
 The reason is that a bug in the mesos-fetcher got it's way in and would 
 render 0.18.1 not useful for production settings 
 (https://issues.apache.org/jira/browse/MESOS-1313)
 
 I suggest yet another bug-fix release 0.18.2 which cherry-pick 
 https://reviews.apache.org/r/21127/, expedite it and have it ready by EOW.
 
 I'd love some (quick) input before starting this release.
 
 Thanks,
 Niklas
 



Re: [DRAFT] Mesos community survey questions

2014-05-01 Thread Tom Arnfeld
I’d be quite interested in asking if people are running their own in-house 
built frameworks. I’m actually quite curious to know if people are just using 
Mesos mainly as a way of getting access to Hadoop+{insert data platform here} 
at the moment…

On 1 May 2014, at 21:00, Dave Lester daveles...@gmail.com wrote:

 Hi All,
 
 
 With #MesosCon http://events.linuxfoundation.org/events/mesoscon coming
 up in August, and tremendous growth of adopters on our #PoweredByMesos
 listhttp://mesos.apache.org/documentation/latest/powered-by-mesos/(now
 @ 30 companies!), I'd love to do a community survey to take the pulse
 of the community. A summary version of all results would be made available
 once the survey was completed.
 
 
 I've drafted a survey that is 11 questions long, and will take a few
 minutes to complete. Survey questions are pasted below. Matt Trifiro gave
 some input on this idea a few months back, and I'd love to have more eyes
 on it. Please respond to this thread with any suggestions, and if we have
 consensus then I'll go ahead and #shipit.
 
 
 Thanks,
 
 Dave
 
 
 
 
 1. What’s your name
 
 2. What’s your affiliation (company or organization)
 
 3. How long have you been using Mesos?
 
  not using it, just curious [if this is chosen, taken to last page of
 survey]
 
  0-2 weeks
 
  2-8 weeks
 
  2-6 mo
 
  6+ mo
 
 -- page break --
 
 4. How many machines are you running in your Mesos cluster? (dropdown)
 
 1
 
 2-5
 
 6-20
 
 21-100
 
 101+
 
 Cannot share this info
 
 5. Which Mesos frameworks are you using? (radio buttons for each option,
 ‘no’, ‘intend to try out', 'trying it out', 'production')
 
 [ ] Aurora
 
 [ ] Cassandra
 
 [ ] Chronos
 
 [ ] Hadoop
 
 [ ] Jenkins
 
 [ ] JobServer
 
 [ ] Marathon
 
 [ ] Spark
 
 [ ] Storm
 
 [ ] Other
 
 6. Where do you run Mesos?
 
 [ ] Private
 
 [ ] EC2
 
 [ ] Rackspace
 
 [ ] Other: 
 
 7. How likely are you to recommend Mesos to your friends and colleagues?
 
 [scale, 1-10 (for determining net promoter score)]
 
 
  page break 
 
 8. What features do you think are missing from Mesos?
 
 9. Are you interested in contributing in any of the following ways:
 
 __ documentation
 
 __ organizing local meetups
 
 __ sponsoring a Mesos-related event
 
 __ patches to the core
 
 __ building and open sourcing custom frameworks for Mesos
 
 
 10. Email address
 
 11. Is your company listed on our #PoweredByMesos page? If not, can we list
 you?
 
 [ ] Yes, we’re on the list!
 
 [ ] No, please add us
 
 [ ] No, but please refrain for adding us at this time



System dependencies with Mesos

2014-02-04 Thread Tom Arnfeld
I’m investigating the possibility of using Mesos to solve the problem of 
resource allocation between a Hadoop cluster and set of Jenkins slaves (and I 
like the possibility of being able to easily deploy other frameworks). One of 
the biggest overhanging questions I can’t seem to find an answer to is how to 
manage system dependencies across a wide variety of frameworks, and jobs 
running within those frameworks.

I came across this thread 
(http://www.mail-archive.com/user@mesos.apache.org/msg00301.html) and caching 
executor files seems to be the running solution, though not implemented yet. I 
too would really like to avoid shipping system dependencies (c-deps for python 
packages, as an example) along with every single job, and i’m especially unsure 
how this would interact with the Hadoop/Jenkins mesos schedulers (as each 
hadoop job may require it’s own system dependencies).

More importantly, the architecture of the machine submitting the job is often 
different from the slaves so we can’t simply ship all the built dependencies 
with the task.

We’re solving this problem at the moment for Hadoop by installing all 
dependencies we require on every hadoop task tracker node, which is far from 
ideal. For jenkins, we’re using Docker to isolate execution of different types 
of jobs, and built all system dependencies for a suite of jobs into docker 
images.

I like the idea of continuing down the path of Docker for process isolation and 
system dependency management, but I don’t see any easy way for this to interact 
with the existing hadoop/jenkins/etc. schedulers. I guess it’d require us to 
build our own schedulers/executors that wrapped the process in a Docker 
container.

I’d love to hear how others are solving this problem… and/or whether Docker 
seems like the wrong way to go.

—

Tom Arnfeld
Developer // DueDil

Compiling Mesos on Mac OSX Mountain Lion 10.9

2014-01-30 Thread Tom Arnfeld
I’m trying to get going with Mesos to do a bit of exploration and i’m having 
trouble compiling any version of Mesos on Mac OSX. I’m only looking to use the 
python binding (not actually run a mesos master/slave) on OSX to talk to a 
remote Mesos/ZK cluster i’ve got setup.

I’ve sifted through a bunch of errors, and hit a wall with one I can’t seem to 
solve. I’m using GCC 4.2 (`brew install gcc4.2`) as there are issues with the 
GCC included in Xcode. I’ve also had to switch to using protobuf 2.5.0 in 
`mesos/3rdparty/libprocess/3rdparty.

 git clone https://git-wip-us.apache.org/repos/asf/mesos.git
 cd mesos
 git checkout git checkout 0.16.0-rc4
 ./bootstrap
 CC=gcc-4.2 CXXFLAGS=-std=c++11 ./configure
 make



/bin/sh ./libtool  --tag=CXX   --mode=compile g++ -DPACKAGE_NAME=\libprocess\ 
-DPACKAGE_TARNAME=\libprocess\ -DPACKAGE_VERSION=\0.0.1\ 
-DPACKAGE_STRING=\libprocess\ 0.0.1\ -DPACKAGE_BUGREPORT=\\ 
-DPACKAGE_URL=\\ -DPACKAGE=\libprocess\ -DVERSION=\0.0.1\ 
-DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 
-DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 
-DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\.libs/\ 
-DHAVE_PTHREAD=1 -DHAVE_LIBZ=1 -I.  -I./include -I./3rdparty/stout/include 
-I3rdparty/boost-1.53.0 -I3rdparty/glog-0.3.3/src -I3rdparty/libev-4.15 
-I3rdparty/ry-http-parser-1c3624a -std=c++11 -g2 -O2 -MT 
libprocess_la-latch.lo -MD -MP -MF .deps/libprocess_la-latch.Tpo -c -o 
libprocess_la-latch.lo `test -f 'src/latch.cpp' || echo './'`src/latch.cpp
libtool: compile:  g++ -DPACKAGE_NAME=\libprocess\ 
-DPACKAGE_TARNAME=\libprocess\ -DPACKAGE_VERSION=\0.0.1\ 
-DPACKAGE_STRING=\libprocess 0.0.1\ -DPACKAGE_BUGREPORT=\\ 
-DPACKAGE_URL=\\ -DPACKAGE=\libprocess\ -DVERSION=\0.0.1\ 
-DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 
-DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 
-DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\.libs/\ 
-DHAVE_PTHREAD=1 -DHAVE_LIBZ=1 -I. -I./include -I./3rdparty/stout/include 
-I3rdparty/boost-1.53.0 -I3rdparty/glog-0.3.3/src -I3rdparty/libev-4.15 
-I3rdparty/ry-http-parser-1c3624a -std=c++11 -g2 -O2 -MT libprocess_la-latch.lo 
-MD -MP -MF .deps/libprocess_la-latch.Tpo -c src/latch.cpp  -fno-common -DPIC 
-o libprocess_la-latch.o
In file included from src/latch.cpp:3:

./include/process/process.hpp:10:10: fatal error: 'tr1/functional' file not 
found
#include tr1/functional
 ^
1 error generated.



The call to `make` fails with the above output. I managed to successfully 
install thrift version 0.9.1using brew (though unrelated) which makes me think 
it might not be an issue with thrift, but with the mesos build process?

I can attach my Makefile if that’s of any help, Thanks.

—
Tom Arnfeld
Developer // DueDil

Re: Compiling Mesos on Mac OSX Mountain Lion 10.9

2014-01-30 Thread Tom Arnfeld
I’ve removed that flag and run `make clean` but now I get back to the errors 
with Apple’s GCC compiler...

 make clean
  CC=gcc-4.2 ./configure
 make

As far as the protobuf upgrade goes, I noticed in JIRA that was release into 
0.17.0 – I can’t seem to find the source for that version anywhere?

#

g++ -DHAVE_CONFIG_H -I. -I./src  -I./src  -D_THREAD_SAFE -Wall 
-Wwrite-strings -Woverloaded-virtual -Wno-sign-compare  -DNO_FRAME_POINTER  -g 
-O2 -MT symbolize_unittest-symbolize_unittest.o -MD -MP -MF 
.deps/symbolize_unittest-symbolize_unittest.Tpo -c -o 
symbolize_unittest-symbolize_unittest.o `test -f 'src/symbolize_unittest.cc' || 
echo './'`src/symbolize_unittest.cc
mv -f .deps/symbolize_unittest-symbolize_unittest.Tpo 
.deps/symbolize_unittest-symbolize_unittest.Po
/bin/sh ./libtool --tag=CXX   --mode=link g++ -D_THREAD_SAFE -Wall 
-Wwrite-strings -Woverloaded-virtual -Wno-sign-compare  -DNO_FRAME_POINTER  -g 
-O2 -D_THREAD_SAFE   -o symbolize_unittest  
symbolize_unittest-symbolize_unittest.o  libglog.la   -lpthread
libtool: link: g++ -D_THREAD_SAFE -Wall -Wwrite-strings -Woverloaded-virtual 
-Wno-sign-compare -DNO_FRAME_POINTER -g -O2 -D_THREAD_SAFE -o 
symbolize_unittest symbolize_unittest-symbolize_unittest.o -Wl,-bind_at_load  
./.libs/libglog.a -lpthread
g++ -DHAVE_CONFIG_H -I. -I./src  -I./src  -D_THREAD_SAFE -Wall 
-Wwrite-strings -Woverloaded-virtual -Wno-sign-compare  -DNO_FRAME_POINTER  -g 
-O2 -MT stl_logging_unittest-stl_logging_unittest.o -MD -MP -MF 
.deps/stl_logging_unittest-stl_logging_unittest.Tpo -c -o 
stl_logging_unittest-stl_logging_unittest.o `test -f 
'src/stl_logging_unittest.cc' || echo './'`src/stl_logging_unittest.cc
In file included from src/stl_logging_unittest.cc:34:
In file included from ./src/glog/stl_logging.h:54:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/c++/v1/ext/hash_set:202:2:
 warning: Use of the header ext/hash_set is deprecated. Migrate to 
unordered_set [-W#warnings]
#warning Use of the header ext/hash_set is deprecated.  Migrate to 
unordered_set
 ^
In file included from src/stl_logging_unittest.cc:34:
In file included from ./src/glog/stl_logging.h:55:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/c++/v1/ext/hash_map:209:2:
 warning: Use of the header ext/hash_map is deprecated. Migrate to 
unordered_map [-W#warnings]
#warning Use of the header ext/hash_map is deprecated.  Migrate to 
unordered_map
 ^
In file included from src/stl_logging_unittest.cc:34:
./src/glog/stl_logging.h:56:11: fatal error: 'ext/slist' file not found
# include ext/slist
  ^
2 warnings and 1 error generated.
make[7]: *** [stl_logging_unittest-stl_logging_unittest.o] Error 1

#

—
Tom Arnfeld
Developer // DueDil

On 30 Jan 2014, at 19:46, Benjamin Mahler benjamin.mah...@gmail.com wrote:

 I don't believe that we compile with C++11 on gcc 4.2, and C++11 support did 
 not land in 0.16.0 IIRC.
 
 You should remove your -std=c++11 flag. Let us know if that does not work.
 
 
 On Thu, Jan 30, 2014 at 11:40 AM, Tom Arnfeld t...@duedil.com wrote:
 I’m trying to get going with Mesos to do a bit of exploration and i’m having 
 trouble compiling any version of Mesos on Mac OSX. I’m only looking to use 
 the python binding (not actually run a mesos master/slave) on OSX to talk to 
 a remote Mesos/ZK cluster i’ve got setup.
 
 I’ve sifted through a bunch of errors, and hit a wall with one I can’t seem 
 to solve. I’m using GCC 4.2 (`brew install gcc4.2`) as there are issues with 
 the GCC included in Xcode. I’ve also had to switch to using protobuf 2.5.0 in 
 `mesos/3rdparty/libprocess/3rdparty.
 
  git clone https://git-wip-us.apache.org/repos/asf/mesos.git
  cd mesos
  git checkout git checkout 0.16.0-rc4
  ./bootstrap
  CC=gcc-4.2 CXXFLAGS=-std=c++11 ./configure
  make
 
 
 
 /bin/sh ./libtool  --tag=CXX   --mode=compile g++ 
 -DPACKAGE_NAME=\libprocess\ -DPACKAGE_TARNAME=\libprocess\ 
 -DPACKAGE_VERSION=\0.0.1\ -DPACKAGE_STRING=\libprocess\ 0.0.1\ 
 -DPACKAGE_BUGREPORT=\\ -DPACKAGE_URL=\\ -DPACKAGE=\libprocess\ 
 -DVERSION=\0.0.1\ -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 
 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 
 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 
 -DLT_OBJDIR=\.libs/\ -DHAVE_PTHREAD=1 -DHAVE_LIBZ=1 -I.  -I./include 
 -I./3rdparty/stout/include -I3rdparty/boost-1.53.0 -I3rdparty/glog-0.3.3/src 
 -I3rdparty/libev-4.15 -I3rdparty/ry-http-parser-1c3624a -std=c++11 -g2 
 -O2 -MT libprocess_la-latch.lo -MD -MP -MF .deps/libprocess_la-latch.Tpo -c 
 -o libprocess_la-latch.lo `test -f 'src/latch.cpp' || echo './'`src/latch.cpp
 libtool: compile:  g++ -DPACKAGE_NAME=\libprocess

Re: Compiling Mesos on Mac OSX Mountain Lion 10.9

2014-01-30 Thread Tom Arnfeld
You’re right. I also and to recompile the version of gcc I installed with 
homebrew to include support for c++. Going to dump the install process below 
incase anyone else runs into this.

I still had to replace protobufs (but that’s an issue already resolved in JIRA).

(Switched to 4.3 since it seems homebrew/versions doesn’t have 4.2)

 brew tap homebrew/versions
 brew install homebrew/versions/gcc4.3 --enable-cxx
 ./bootstrap
 CC=“gcc-4.3” CXX=“g++-4.3” ./configure
 make

Thanks for the pointers! :)

—
Tom Arnfeld
Developer // DueDil

On 30 Jan 2014, at 20:03, Benjamin Mahler benjamin.mah...@gmail.com wrote:

 Did you intend to set CC and CXX? CC is the C compiler, CXX is the C++ 
 compiler.
 
 E.g.
 
 CC=gcc-4.2 CXX=g++-4.2 ./configure
 
 
 On Thu, Jan 30, 2014 at 11:52 AM, Tom Arnfeld t...@duedil.com wrote:
 I’ve removed that flag and run `make clean` but now I get back to the errors 
 with Apple’s GCC compiler...
 
  make clean
   CC=gcc-4.2 ./configure
  make
 
 As far as the protobuf upgrade goes, I noticed in JIRA that was release into 
 0.17.0 – I can’t seem to find the source for that version anywhere?
 
 #
 
 g++ -DHAVE_CONFIG_H -I. -I./src  -I./src  -D_THREAD_SAFE -Wall 
 -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare  -DNO_FRAME_POINTER  
 -g -O2 -MT symbolize_unittest-symbolize_unittest.o -MD -MP -MF 
 .deps/symbolize_unittest-symbolize_unittest.Tpo -c -o 
 symbolize_unittest-symbolize_unittest.o `test -f 'src/symbolize_unittest.cc' 
 || echo './'`src/symbolize_unittest.cc
 mv -f .deps/symbolize_unittest-symbolize_unittest.Tpo 
 .deps/symbolize_unittest-symbolize_unittest.Po
 /bin/sh ./libtool --tag=CXX   --mode=link g++ -D_THREAD_SAFE -Wall 
 -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare  -DNO_FRAME_POINTER  
 -g -O2 -D_THREAD_SAFE   -o symbolize_unittest  
 symbolize_unittest-symbolize_unittest.o  libglog.la   -lpthread
 libtool: link: g++ -D_THREAD_SAFE -Wall -Wwrite-strings -Woverloaded-virtual 
 -Wno-sign-compare -DNO_FRAME_POINTER -g -O2 -D_THREAD_SAFE -o 
 symbolize_unittest symbolize_unittest-symbolize_unittest.o -Wl,-bind_at_load  
 ./.libs/libglog.a -lpthread
 g++ -DHAVE_CONFIG_H -I. -I./src  -I./src  -D_THREAD_SAFE -Wall 
 -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare  -DNO_FRAME_POINTER  
 -g -O2 -MT stl_logging_unittest-stl_logging_unittest.o -MD -MP -MF 
 .deps/stl_logging_unittest-stl_logging_unittest.Tpo -c -o 
 stl_logging_unittest-stl_logging_unittest.o `test -f 
 'src/stl_logging_unittest.cc' || echo './'`src/stl_logging_unittest.cc
 In file included from src/stl_logging_unittest.cc:34:
 In file included from ./src/glog/stl_logging.h:54:
 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/c++/v1/ext/hash_set:202:2:
  warning: Use of the header ext/hash_set is deprecated. Migrate to 
 unordered_set [-W#warnings]
 #warning Use of the header ext/hash_set is deprecated.  Migrate to 
 unordered_set
  ^
 In file included from src/stl_logging_unittest.cc:34:
 In file included from ./src/glog/stl_logging.h:55:
 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/c++/v1/ext/hash_map:209:2:
  warning: Use of the header ext/hash_map is deprecated. Migrate to 
 unordered_map [-W#warnings]
 #warning Use of the header ext/hash_map is deprecated.  Migrate to 
 unordered_map
  ^
 In file included from src/stl_logging_unittest.cc:34:
 ./src/glog/stl_logging.h:56:11: fatal error: 'ext/slist' file not found
 # include ext/slist
   ^
 2 warnings and 1 error generated.
 make[7]: *** [stl_logging_unittest-stl_logging_unittest.o] Error 1
 
 #
 
 —
 Tom Arnfeld
 Developer // DueDil
 
 On 30 Jan 2014, at 19:46, Benjamin Mahler benjamin.mah...@gmail.com wrote:
 
 I don't believe that we compile with C++11 on gcc 4.2, and C++11 support did 
 not land in 0.16.0 IIRC.
 
 You should remove your -std=c++11 flag. Let us know if that does not work.
 
 
 On Thu, Jan 30, 2014 at 11:40 AM, Tom Arnfeld t...@duedil.com wrote:
 I’m trying to get going with Mesos to do a bit of exploration and i’m having 
 trouble compiling any version of Mesos on Mac OSX. I’m only looking to use 
 the python binding (not actually run a mesos master/slave) on OSX to talk to 
 a remote Mesos/ZK cluster i’ve got setup.
 
 I’ve sifted through a bunch of errors, and hit a wall with one I can’t seem 
 to solve. I’m using GCC 4.2 (`brew install gcc4.2`) as there are issues with 
 the GCC included in Xcode. I’ve also had to switch to using protobuf 2.5.0 
 in `mesos/3rdparty/libprocess/3rdparty.
 
  git clone https://git-wip-us.apache.org/repos/asf/mesos.git
  cd mesos
  git checkout git checkout 0.16.0-rc4
  ./bootstrap
  CC=gcc-4.2 CXXFLAGS=-std=c++11 ./configure
  make
 
 
 
 /bin/sh ./libtool