Re: ensuring a particular task is deployed to "all" Mesos Worker hosts

2017-07-01 Thread Erik Weathers
idea James! My team currently uses the mesos containerizer, so the log access between containers issue is not a problem like it might be with the docker containerizer. - Erik On Sat, Jul 1, 2017 at 4:20 PM James Peach <jor...@gmail.com> wrote: > > > On Jul 1, 2017, at 11:14 AM, E

Re: ensuring a particular task is deployed to "all" Mesos Worker hosts

2017-07-01 Thread Erik Weathers
n't guarantee it gets launched everywhere (the > same is true for a bunch of other services as well, namely metrics). > > We don't have an exact timeline for when we will build this support yet, > but we will certainly announce it once we start actively working on it. > > > Er

Re: ensuring a particular task is deployed to "all" Mesos Worker hosts

2017-07-01 Thread Erik Weathers
d the 'maybe all hosts' constraint but if it's 'up to one per > host', it sounds like a CM issue to me. ) > > On 30 June 2017 at 23:58, Erik Weathers <eweath...@groupon.com> wrote: > > hi Mesos folks! > > > > My team is largely responsible for maintaining the Storm-on-Mesos

ensuring a particular task is deployed to "all" Mesos Worker hosts

2017-06-30 Thread Erik Weathers
hi Mesos folks! My team is largely responsible for maintaining the Storm-on-Mesos framework. It suffers from a problem related to log retrieval: Storm has a process called the "logviewer" that is assumed to exist on every host, and the Storm UI provides links to contact this process to download

Re: Failure reason documentation

2016-12-04 Thread Erik Weathers
apache.org/jira/browse/MESOS-6686 to > track it. > > On Mon, Dec 5, 2016 at 2:27 AM, Erik Weathers <eweath...@groupon.com> > wrote: > > I think he's looking for documentation about what precisely each reason > *means*. A la how there are comments beside the TaskState list in

Re: Failure reason documentation

2016-12-04 Thread Erik Weathers
I think he's looking for documentation about what precisely each reason *means*. A la how there are comments beside the TaskState list in mesos.proto. - Erik On Sun, Dec 4, 2016 at 10:07 AM haosdent wrote: Hi @Wil You could find them here

Re: Using mesos' cfs limits on a docker container?

2016-08-14 Thread Erik Weathers
What was the problem and how did you overcome it? (i.e. This would be a sad resolution to this thread for someone faced with this same problem in the future.) On Sunday, August 14, 2016, Mark Hammons wrote: > I finally got this working after fiddling with it all

Re: Programmatically retrieve stdout/stderr from a node

2016-08-10 Thread Erik Weathers
Just for completeness and to provide an alternative, you can also probably leverage the dcos command line tool (https://github.com/dcos/dcos-cli) to get all the info you would need in a JSON format. e.g., 1. set up ~/.dcos/config.toml for your cluster 2. DCOS_CONFIG=~/.dcos/config.toml dcos task

Re: Escaped characters in the 'env' argument passed to mesos-execute

2016-06-30 Thread Erik Weathers
at 9:31 AM, Chris Baker <ch...@galacticfog.com> wrote: > On a side note requiring people to put JSON on the command line is a > sadistic thing to do. > > On Thu, Jun 30, 2016 at 12:28 PM Erik Weathers <eweath...@groupon.com> > wrote: > >> +1 I would wrap every

Re: Escaped characters in the 'env' argument passed to mesos-execute

2016-06-30 Thread Erik Weathers
+1 I would wrap every string in quotes... otherwise your shell doesn't know what you mean. i.e., How is the shell supposed to know that you want this all to be 1 string value for the --command parameter? --command=/home/john/anaconda3/bin/python /home/john/mesos/error_msg.py read Similarly

Re: New external dependency

2016-06-20 Thread Erik Weathers
@Kevin: FYI, it's best practice to use a commit SHA in GitHub links so that future readers are seeing the content you intended. i.e., instead of: - https://github.com/NVIDIA/nvidia-docker/blob/master/tools/src/nvidia/volumes.go#L109 It's best to do: -

Re: Mesos 0.28.2 does not start

2016-06-11 Thread Erik Weathers
I'm not 100% sure what is implied by "public" and "floating" IPs here. Are the IPs of 10.20.250.x the *actual* IPs on the hosts? i.e., if you run "ifconfig -a" or "ip a" do those IPs appear? If not then you cannot bind to them on the host. - Erik On Sat, Jun 11, 2016 at 10:32 AM, Stefano

Re: Port Resource Offers

2016-03-29 Thread Erik Weathers
hi Pradeep, Yes, that would *definitely* be a problem. e.g., the Storm Framework could easily assign Storm Workers to use those unavailable ports, and then they would fail to come up since they wouldn't be able to bind to their assigned port. I've answered a similar question before:

Re: Reusing Task IDs

2016-02-22 Thread Erik Weathers
>> > >> Is there any special case that framework has to re-use the TaskID; if no > >> special case, I think we should ask framework to avoid reuse TaskID. > >> > >> > >> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer > >>

Re: Reusing Task IDs

2016-02-21 Thread Erik Weathers
tldr; *Reusing TaskIDs clashes with the mesos-agent recovery feature.* Adam Bordelon wrote: > Reusing taskIds may work if you're guaranteed to never be running two instances of the same taskId simultaneously I've encountered another scenario where reusing TaskIDs is dangerous, even if you meet

Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread Erik Weathers
hey Harry, As Vinod said, the mesos-slave/agent will issue a status update about the OOM condition. This will be received by the scheduler of the framework. In the storm-mesos framework we just log the messages (see below), but you might consider somehow exposing these messages directly to the

Re: Zookeeper cluster changes

2015-11-10 Thread Erik Weathers
Keep in mind that mesos is designed to "fail fast". So when there are problems (such as losing connectivity to the resolved ZooKeeper IP) the daemon(s) (master & slave) die. Due to this design, we are all supposed to run the mesos daemons under "supervision", which means auto-restart after they

Re: Mesos and Zookeeper TCP keepalive

2015-11-09 Thread Erik Weathers
It would really help if you (Jeremy) explained the *actual* problem you are facing. I'm *guessing* that it's a firewall timing out the sessions because there isn't activity on them for whatever the timeout of the firewall is? It seems likely to be unreasonably short, given that mesos has

preventing registry failures from happening in mesos-master?

2015-05-07 Thread Erik Weathers
I know we're supposed to run the mesos daemons under supervision (i.e., bring them back up automatically if they fail). But I'm interested in not having the mesos-master fail at all, especially a failure in the registry / replicated_log, which I am already a little scared of. Situation: -

Re: docker based executor

2015-04-17 Thread Erik Weathers
hey Tyson, I've also worked a bit on improving simplifying the mesos-storm framework -- spent the recent Mesosphere hackathon working with tnachen of Mesosphere on this. Nothing deliverable quite yet. We didn't look at dockerization at all, the hacking we did was around these goals: * Avoiding