Re: ensuring a particular task is deployed to "all" Mesos Worker hosts

2017-07-01 Thread Dick Davies
If it _needs_ to be there always then I'd roll it out with whatever
automation you use to deploy the mesos workers ; depending on
the scale you're running at launching it as a task is likely to be less
reliable due to outages etc.

( I understand the 'maybe all hosts' constraint but if it's 'up to one per
host', it sounds like a CM issue to me. )

On 30 June 2017 at 23:58, Erik Weathers  wrote:
> hi Mesos folks!
>
> My team is largely responsible for maintaining the Storm-on-Mesos framework.
> It suffers from a problem related to log retrieval:  Storm has a process
> called the "logviewer" that is assumed to exist on every host, and the Storm
> UI provides links to contact this process to download logs (and other
> debugging artifacts).   Our team manually runs this process on each Mesos
> host, but it would be nice to launch it automatically onto any Mesos host
> where Storm work gets allocated. [0]
>
> I have read that Mesos has added support for Kubernetes-esque "pods" as of
> version 1.1.0, but that feature seems somewhat insufficient for implementing
> our desired behavior from my naive understanding.  Specifically, Storm only
> has support for connecting to 1 logviewer per host, so unless pods can have
> separate containers inside each pod [1], and also dynamically change the set
> of executors and tasks inside of the pod [2], then I don't see how we'd be
> able to use them.
>
> Is there any existing feature in Mesos that might help us accomplish our
> goal?  Or any upcoming features?
>
> Thanks!!
>
> - Erik
>
> [0] Thus the "all" in quotes in the subject of this email, because it
> *might* be all hosts, but it definitely would be all hosts where Storm gets
> work assigned.
>
> [1] The Storm-on-Mesos framework leverages separate containers for each
> topology's Supervisor and Worker processes, to provide isolation between
> topologies.
>
> [2] The assignment of Storm Supervisors (a Mesos Executor) + Storm Workers
> (a Mesos Task) onto hosts is ever changing in a given instance of a
> Storm-on-Mesos framework.  i.e., as topologies get launched and die, or have
> their worker processes die, the processes are dynamically distributed to the
> various Mesos Worker hosts.  So existing containers often have more tasks
> assigned into them (thus growing their footprint) or removed from them (thus
> shrinking the footprint).


Re: Mesos (and Marathon) port mapping

2017-03-29 Thread Dick Davies
I should say this was tested around mesos 1.0, they may have changed
things - but yes this is vanilla networking, no CNI or anything like that.

But I'm guessing if you're using BRIDGE networking and specifying a
hostPort: you're causing work for yourself (unless you actually care what
port the slave is using).

On 29 March 2017 at 10:22, Thomas HUMMEL  wrote:
>
>
> On 03/28/2017 06:53 PM, Tomek Janiszewski wrote:
>
> 1. Mentioned port range is the Mesos Agent resource setting, so if you don't
> explicitly define port range it would be used.
> https://github.com/apache/mesos/blob/1.2.0/src/slave/constants.hpp#L86
>
> 2. With ports mapping two or more applications could attach to same
> container port but will be exposed under different host port.
>
>
> Thanks for your answer.
>
> 1. So it's not network/portmapping isolator specific, right ? Even without
> it, non-ephemeral ports would be considered as part of the offer and would
> be chosen in this range by default ?
>
> 2. So containers, even with network/port_mapping isolation, *share* the
> non-ephemeral port range, although doc states "The agent assigns each
> container a non-overlapping range of the ports" which I first read as "each
> container gets it's own port range", right ?
>
> So I am a bit confused since what's described here
>
> http://mesos.apache.org/documentation/latest/port-mapping-isolator/
>
> in the "Configuring network ports" seems to be valid even without port
> mapping isolator.
>
> Am I getting this right this time ?
>
> Thanks.
>
> --
> Thomas HUMMEL
>


Re: Mesos (and Marathon) port mapping

2017-03-28 Thread Dick Davies
Try setting your hostPort to 0, to tell Mesos to select one
(which it will allocate out of the pool the mesos slave is set to use).

This works for me for redis:


{
  "container": {
"type": "DOCKER",
"docker": {
  "image": "redis",
  "network": "BRIDGE",
  "portMappings": [
{ "containerPort": 6379, "hostPort": 0, "protocol": "tcp"}
  ]
}
  },
  "ports": [0],
  "instances": 1,
  "cpus": 0.1,
  "mem": 128,
  "uris": []
}

(caveat: haven't run marathon or mesos for a little while)

On 28 March 2017 at 17:53, Tomek Janiszewski  wrote:
> 1. Mentioned port range is the Mesos Agent resource setting, so if you don't
> explicitly define port range it would be used.
> https://github.com/apache/mesos/blob/1.2.0/src/slave/constants.hpp#L86
>
> 2. With ports mapping two or more applications could attach to same
> container port but will be exposed under different host port.
>
> 3. I'm not sure if ports mappings works in Host mode. Try with require ports
> option enabled.
> https://github.com/mesosphere/marathon/blob/v1.3.9/docs/docs/ports.md
>
> 4. Yes, service ports are only for Marathon and don't propagate to Mesos.
> http://stackoverflow.com/a/39468348/1387612
>
>
> wt., 28.03.2017, 18:16 użytkownik Thomas HUMMEL 
> napisał:
>>
>> Hello,
>>
>> [Sorry if this post may seem more Marathon-oriented. It still contains
>> Mesos specific questions.]
>>
>> I'm in the process of discovering/testing/trying to understand Mesos and
>> Marathon.
>>
>> After having read some books and docs, I set up a small environment (9
>> linux
>> CentOS 7.3 VMs) consisting of :
>>
>>. 3 Mesos master - quorum = 2
>>. 3 Zookeepers servers running on the same host as the mesos servers
>>. 2 Mesos slaves
>>. 3 Marathon servers
>>. 1 HAproxy facing the Mesos servers
>>
>> Mesos has been installed from sources (1.2.0 version) and Marathon is
>> the 1.3.9
>> tarball comming from mesosphere
>>
>> I've deployed :
>>
>>. mesos-dns as a Marathon (not dockerized) application on one of the
>>  slaves (with a constraint) configured with my site DNS as resolvers
>> and only
>>  "host" as IPSources
>>
>>. marathon-lb as a Marathon dockerized app ("network": "HOST") with the
>>  simple (containerPort: 9090, hostPort: 9090, servicePort: 1)
>> portMapping,
>>  on the same slave using a constraint
>>
>> Everything works fine so far.
>> I've read :
>>
>>https://mesosphere.github.io/marathon/docs/ports.html
>>
>> and
>>
>>http://mesos.apache.org/documentation/latest/port-mapping-isolator/
>>
>> but I'm still quite confused by the following port-related questions :
>>
>> [Note : I'm not using "network/port_mapping" isolation for now. I sticked
>> to
>>
>>export MESOS_containerizers=docker,mesos]
>>
>> 1. for such a simple dockerized app :
>>
>> {
>>"id": "http-server",
>>"cmd": "python3 -m http.server 8080",
>>"cpus": 0.5,
>>"mem": 32.0,
>>"container": {
>>  "type": "DOCKER",
>>  "docker": {
>>"image": "python:3",
>>"network": "BRIDGE",
>>"portMappings": [
>>  { "containerPort": 8080, "hostPort": 31000, "servicePort": 5000 }
>>]
>>  }
>>},
>>"labels":{
>>  "HAPROXY_GROUP":"external"
>>}
>> }
>>
>> a) in HOST mode ("network": "HOST"), any hostPort seems to work (or at
>> least, let say 9090)
>>
>> b) in BRIDGE mode ("network": "BRIDGE"), the valid hostPort range seems
>> to be
>> [31000 - 32000], which seems to match the Mesos non-ephemeral port range
>> given
>> as en example in
>>
>>http://mesos.apache.org/documentation/latest/port-mapping-isolator/
>>
>> But I don't quite understand why since
>>
>>- I'm not using network/port_mapping isolation
>>- I didn't configured any port range anywhere in Mesos
>>
>> 2. Obviously in my setup, 2 apps on the same slave cannot have the same
>> hostPort. Would it be the same with network/port_mapping activated
>> since the
>> doc says : "he agent assigns each container a non-overlapping range
>> of the
>> ports"
>>
>> Am I correct assuming that a Marathon hostPort is to be understood
>> as taken among the non-ephemeral Mesos ports ?
>>
>> With network/port_mapping isolation, could 2 apps have the same
>> non-ephemeal port ? same question with ephemeral-port ? I doubt it but...
>> Is what is described in this doc valid for a dockerized container also
>> ?
>>
>> 3. the portMapping I configured for the dockerized ("network": "HOST")
>> marathon-lb app is
>>
>> "portMappings": [
>>{
>>  "containerPort": 9090,
>>  "hostPort": 9090,
>>  "servicePort": 1,
>>  "protocol": "tcp"
>>
>> on the slave I can verify :
>>
>># lsof -i :9090
>>COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
>>haproxy 29610 root6u  IPv4 461745  0t0  TCP *:websm (LISTEN)
>> But Marathon tells that my app is running on :
>>
>>

Re: mirror of mesosphere's repo

2016-09-20 Thread Dick Davies
It's on s3 isn't it - maybe CloudFront?

On 20 September 2016 at 05:48, tommy xiao  wrote:
> Hi Team and Mesosphere's repo,
>
> can Mesosphere  provide a sync server way with http://repos.mesosphere.com/.
> it will help china's community to sync the package from mirror repo.
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com


Re: Fetcher cache: caching even more while an executor is alive

2016-07-07 Thread Dick Davies
I'd try the Docker image approach.
We've done this in the past and used our CM tool to 'seed' all slaves
by running 'docker pull foo:v1'  across them all in advance, saved a
lot of startup time (although we were only dealing with a Gb or so of
dependencies).

On 5 July 2016 at 11:23, Kota UENISHI  wrote:
> Thanks, it looks promising to me - I was aware of persistent volumes
> because I thought the use case was different, like for databases. I'll
> try it on.
>
> As the document says
>
>> persistent volumes are associated with roles,
>
> this makes failure handling a little bit difficult - As my framework
> is not handling failure well enough, those volume IDs must be
> remembered during framework restart or failover, or must get recovered
> after.  Restarted framework must reuse or collect already reserved
> volumes or those volumes just gets leaking.
>
> Kota UENISHI
>
>
> On Tue, Jul 5, 2016 at 6:03 PM, Aaron Carey  wrote:
>> As you're writing the framework, have you looked at reserving persistent 
>> volumes? I think it might help in your use case:
>>
>> http://mesos.apache.org/documentation/latest/persistent-volume/
>>
>> Aaron
>>
>> --
>>
>> Aaron Carey
>> Production Engineer - Cloud Pipeline
>> Industrial Light & Magic
>> London
>> 020 3751 9150
>>
>> 
>> From: 上西康太 [ueni...@nautilus-technologies.com]
>> Sent: 05 July 2016 08:24
>> To: user@mesos.apache.org
>> Subject: Fetcher cache: caching even more while an executor is alive
>>
>> Hi,
>> I'm developing my own framework - that distributes >100 independent
>> tasks across the cluster and just run them arbitrarily. My problem is,
>> each task execution environment is a bit large tarball (2~6GB, mostly
>> application jar files) and task itself finishes within 1~200 seconds,
>> while tarball extraction takes like tens of seconds every time.
>> Extracting the same tarball again and again in all tasks is a wasteful
>> overhead that cannot be ignored.
>>
>> Fetcher cache is great, but in my case, fetcher cache isn't even
>> enough and I want to preserve all files extracted from the tarball
>> while my executor is alive. If Mesos could cache all files extracted
>> from the tarball by omitting not only download but extraction, I could
>> save more time.
>>
>> In "Fetcher Cache Internals" [1] or in "Fetcher Cache" [2] section in
>> the official document, such issues or future work is not mentioned -
>> how do you solve this kind of extraction overhead problem, when you
>> have rather large resource ?
>>
>> An option would be setting up an internal docker registry and let
>> slaves cache the docker image that includes our jar files and save
>> tarball extraction. But, I want to prevent our system from additional
>> moving parts as much as I can.
>>
>> Another option might be let fetcher fetch all jar files independently
>> in slaves, but I think it feasible, but I don't think it manageable in
>> production in an easy way.
>>
>> PS Mesos is great; it is helping us a lot - I want to appreciate all
>> the efforts by the community. Thank you so much!
>>
>> [1] http://mesos.apache.org/documentation/latest/fetcher-cache-internals/
>> [2] http://mesos.apache.org/documentation/latest/fetcher/
>>
>> Kota UENISHI


Re: Mesos 0.28.2 does not start

2016-06-13 Thread Dick Davies
My guess would be your networking is still wonky. Each master is
putting their IP into zookeeper,
and the other masters use that to find each other for elections.

you can poke around in zookeeper with zkCli.sh, that should give you
an idea which IP is ending
up there - or just check each masters log, that'll normally give
similar information.

TBH it sounds like you've made yourself a difficult setup with this
openstack thing :(

On 13 June 2016 at 13:06, Stefano Bianchi <jazzist...@gmail.com> wrote:
> Hi guys i don't know why but my three masters cannot determine the leader.
> How can i give you a log file do check?
>
> 2016-06-12 10:42 GMT+02:00 Dick Davies <d...@hellooperator.net>:
>>
>> Try putting the IP you're binding to (the actual IP on the master) in
>> /etc/mesos-*/ip , and the externally accessible IP in
>> /etc/mesos-*/hostname.
>>
>> On 12 June 2016 at 00:57, Stefano Bianchi <jazzist...@gmail.com> wrote:
>> > ok i guess i figured out.
>> > The reason for which i put floating IP on hostname and ip files is
>> > written
>> > here:https://open.mesosphere.com/getting-started/install/
>> >
>> > It says:
>> > If you're unable to resolve the hostname of the machine directly (e.g.,
>> > if
>> > on a different network or using a VPN), set /etc/mesos-slave/hostname to
>> > a
>> > value that you can resolve, for example, an externally accessible IP
>> > address
>> > or DNS hostname. This will ensure all links from the Mesos console work
>> > correctly.
>> >
>> > The problem, i guess, is that the set of floating iPs 10.250.0.xxx is
>> > not
>> > externally accessible.
>> > In my other deployment i have set the floating IPs in these files and
>> > all is
>> > perfectly working, but in that case i have used externally reachable
>> > IPs.
>> >
>> > 2016-06-11 22:51 GMT+02:00 Erik Weathers <eweath...@groupon.com>:
>> >>
>> >> It depends on your setup.  I would probably not set the hostname and
>> >> instead set the "--no-hostname_lookup" flag.  I'm not sure how you do
>> >> that
>> >> with the file-based configuration style you are using.
>> >>
>> >> % mesos-master --help
>> >> ...
>> >>
>> >>   --hostname=VALUE  The hostname the master should
>> >> advertise
>> >> in ZooKeeper.
>> >> If left unset, the
>> >> hostname is resolved from the IP address
>> >> that the slave binds
>> >> to;
>> >> unless the user explicitly prevents
>> >> that, using
>> >> `--no-hostname_lookup`, in which case the IP itself
>> >> is used.
>> >>
>> >> On Sat, Jun 11, 2016 at 1:27 PM, Stefano Bianchi <jazzist...@gmail.com>
>> >> wrote:
>> >>>
>> >>> So Erik do you suggest to use the 192.* IP in both
>> >>> /etc/mesos-master/hostname nad /etc/mesos-master/ip right?
>> >>>
>> >>> Il 11/giu/2016 22:15, "Erik Weathers" <eweath...@groupon.com> ha
>> >>> scritto:
>> >>>>
>> >>>> Yeah, so there is no 10.x address on the box.  Thus you cannot bind
>> >>>> Mesos to listen to that address.   You need to use one of the 192.*
>> >>>> IPs for
>> >>>> Mesos to bind to.  I'm not sure why you say you need to use the 10.x
>> >>>> addresses for the UI, that sounds like a problem you should tackle
>> >>>> *after*
>> >>>> getting Mesos up.
>> >>>>
>> >>>> - Erik
>> >>>>
>> >>>> P.S., when using gmail in chrome, you can avoid those extraneous
>> >>>> newlines when you paste by holding "Shift" along with the Command-V
>> >>>> (at
>> >>>> least on Mac OS X!).
>> >>>>
>> >>>> On Sat, Jun 11, 2016 at 1:06 PM, Stefano Bianchi
>> >>>> <jazzist...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> ifconfig -a
>> >>>>>
>> >>>>> eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1454
>> >>>>>
>> >>>>> i

Re: Mesos 0.28.2 does not start

2016-06-12 Thread Dick Davies
Try putting the IP you're binding to (the actual IP on the master) in
/etc/mesos-*/ip , and the externally accessible IP in
/etc/mesos-*/hostname.

On 12 June 2016 at 00:57, Stefano Bianchi  wrote:
> ok i guess i figured out.
> The reason for which i put floating IP on hostname and ip files is written
> here:https://open.mesosphere.com/getting-started/install/
>
> It says:
> If you're unable to resolve the hostname of the machine directly (e.g., if
> on a different network or using a VPN), set /etc/mesos-slave/hostname to a
> value that you can resolve, for example, an externally accessible IP address
> or DNS hostname. This will ensure all links from the Mesos console work
> correctly.
>
> The problem, i guess, is that the set of floating iPs 10.250.0.xxx is not
> externally accessible.
> In my other deployment i have set the floating IPs in these files and all is
> perfectly working, but in that case i have used externally reachable IPs.
>
> 2016-06-11 22:51 GMT+02:00 Erik Weathers :
>>
>> It depends on your setup.  I would probably not set the hostname and
>> instead set the "--no-hostname_lookup" flag.  I'm not sure how you do that
>> with the file-based configuration style you are using.
>>
>> % mesos-master --help
>> ...
>>
>>   --hostname=VALUE  The hostname the master should advertise
>> in ZooKeeper.
>> If left unset, the
>> hostname is resolved from the IP address
>> that the slave binds to;
>> unless the user explicitly prevents
>> that, using
>> `--no-hostname_lookup`, in which case the IP itself
>> is used.
>>
>> On Sat, Jun 11, 2016 at 1:27 PM, Stefano Bianchi 
>> wrote:
>>>
>>> So Erik do you suggest to use the 192.* IP in both
>>> /etc/mesos-master/hostname nad /etc/mesos-master/ip right?
>>>
>>> Il 11/giu/2016 22:15, "Erik Weathers"  ha scritto:

 Yeah, so there is no 10.x address on the box.  Thus you cannot bind
 Mesos to listen to that address.   You need to use one of the 192.* IPs for
 Mesos to bind to.  I'm not sure why you say you need to use the 10.x
 addresses for the UI, that sounds like a problem you should tackle *after*
 getting Mesos up.

 - Erik

 P.S., when using gmail in chrome, you can avoid those extraneous
 newlines when you paste by holding "Shift" along with the Command-V (at
 least on Mac OS X!).

 On Sat, Jun 11, 2016 at 1:06 PM, Stefano Bianchi 
 wrote:
>
> ifconfig -a
>
> eth0: flags=4163  mtu 1454
>
> inet 192.168.100.3  netmask 255.255.255.0  broadcast
> 192.168.100.255
>
> inet6 fe80::f816:3eff:fe1c:a3bf  prefixlen 64  scopeid
> 0x20
>
> ether fa:16:3e:1c:a3:bf  txqueuelen 1000  (Ethernet)
>
> RX packets 61258  bytes 4686426 (4.4 MiB)
>
> RX errors 0  dropped 0  overruns 0  frame 0
>
> TX packets 40537  bytes 3603100 (3.4 MiB)
>
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
>
> lo: flags=73  mtu 65536
>
> inet 127.0.0.1  netmask 255.0.0.0
>
> inet6 ::1  prefixlen 128  scopeid 0x10
>
> loop  txqueuelen 0  (Local Loopback)
>
> RX packets 28468  bytes 1672684 (1.5 MiB)
>
> RX errors 0  dropped 0  overruns 0  frame 0
>
> TX packets 28468  bytes 1672684 (1.5 MiB)
>
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
>
>
> ip addr:1: lo:  mtu 65536 qdisc noqueue state
> UNKNOWN
>
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>
> inet 127.0.0.1/8 scope host lo
>
>valid_lft forever preferred_lft forever
>
> inet6 ::1/128 scope host
>
>valid_lft forever preferred_lft forever
>
> 2: eth0:  mtu 1454 qdisc pfifo_fast
> state UP qlen 1000
>
> link/ether fa:16:3e:1c:a3:bf brd ff:ff:ff:ff:ff:ff
>
> inet 192.168.100.3/24 brd 192.168.100.255 scope global dynamic eth0
>
>valid_lft 77537sec preferred_lft 77537sec
>
> inet6 fe80::f816:3eff:fe1c:a3bf/64 scope link
>
>valid_lft forever preferred_lft forever
>
>
> 2016-06-11 20:05 GMT+02:00 haosdent :
>>
>> As @Erik said, what is your `ifconfig` or `ip addr` command output?
>>
>> On Sun, Jun 12, 2016 at 2:00 AM, Stefano Bianchi
>>  wrote:
>>>
>>> the result of your command give this:
>>>
>>> [root@master ~]# nc 

Re: Mesos HA does not work (Failed to recover registrar)

2016-06-05 Thread Dick Davies
The extra zookeepers listed in the second argument will let you mesos
master process keep working if
its local zookeeper goes down for maintenance.

On 5 June 2016 at 13:55, Qian Zhang <zhq527...@gmail.com> wrote:
>> You need the 2nd command line (i.e. you have to specify all the zk
>> nodes on each master, it's
>> not like e.g. Cassandra where you can discover other nodes from the
>> first one you talk to).
>
>
> I have an Open DC/OS environment which is enabled master HA (there are 3
> master nodes) and works very well, and I see each Mesos master is started to
> only connect with local zk:
> $ cat /opt/mesosphere/etc/mesos-master | grep ZK
> MESOS_ZK=zk://127.0.0.1:2181/mesos
>
> So I think I do not have to specify all the zk on each master.
>
>
>
>
>
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Jun 5, 2016 at 4:25 PM, Dick Davies <d...@hellooperator.net> wrote:
>>
>> OK, good - that part looks as expected, you've had a successful
>> election for a leader
>> (and yes that sounds like your zookeeper layer is ok).
>>
>> You need the 2nd command line (i.e. you have to specify all the zk
>> nodes on each master, it's
>> not like e.g. Cassandra where you can discover other nodes from the
>> first one you talk to).
>>
>> The error you were getting was about the internal registry /
>> replicated log, which is a mesos master level thing.
>> You could try when Sivaram suggested - stopping the mesos master
>> processes, wiping their
>> work_dirs and starting them back up.
>> Perhaps some wonky state got in there while you were trying various
>> options?
>>
>>
>> On 5 June 2016 at 00:34, Qian Zhang <zhq527...@gmail.com> wrote:
>> > Thanks Vinod and Dick.
>> >
>> > I think my 3 ZK servers have formed a quorum, each of them has the
>> > following
>> > config:
>> > $ cat conf/zoo.cfg
>> > server.1=192.168.122.132:2888:3888
>> > server.2=192.168.122.225:2888:3888
>> > server.3=192.168.122.171:2888:3888
>> > autopurge.purgeInterval=6
>> > autopurge.snapRetainCount=5
>> > initLimit=10
>> > syncLimit=5
>> > maxClientCnxns=0
>> > clientPort=2181
>> > tickTime=2000
>> > quorumListenOnAllIPs=true
>> > dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>> > dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>> >
>> > And when I run "bin/zkServer.sh status" on each of them, I can see
>> > "Mode:
>> > leader" for one, and "Mode: follower" for the other two.
>> >
>> > I have already tried to manually start 3 masters simultaneously, and
>> > here is
>> > what I see in their log:
>> > In 192.168.122.171(this is the first master I started):
>> > I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> > I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
>> > '/mesos/log_replicas/24' in ZooKeeper
>> > I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
>> > '/mesos/json.info_25' in ZooKeeper
>> > I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> > I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> > I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
>> > is
>> > master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>> > I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
>> > master!
>> >
>> > In 192.168.122.225 (second master I started):
>> > I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
>> > (id='25')
>> > I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
>> > '/mesos/json.info_25' in ZooKeeper
>> > I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
>> > log-replica(1)@192.168.122.171:5050 }
>> > I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
>> > received a broadcasted recover request from (6)@192.168.122.225:5050
>> > I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
>> > (UPID=master@192.168.122.171:5050) is detected
>> > I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
>> > is
>> > master@192.

Re: Mesos HA does not work (Failed to recover registrar)

2016-06-05 Thread Dick Davies
OK, good - that part looks as expected, you've had a successful
election for a leader
(and yes that sounds like your zookeeper layer is ok).

You need the 2nd command line (i.e. you have to specify all the zk
nodes on each master, it's
not like e.g. Cassandra where you can discover other nodes from the
first one you talk to).

The error you were getting was about the internal registry /
replicated log, which is a mesos master level thing.
You could try when Sivaram suggested - stopping the mesos master
processes, wiping their
work_dirs and starting them back up.
Perhaps some wonky state got in there while you were trying various options?


On 5 June 2016 at 00:34, Qian Zhang <zhq527...@gmail.com> wrote:
> Thanks Vinod and Dick.
>
> I think my 3 ZK servers have formed a quorum, each of them has the following
> config:
> $ cat conf/zoo.cfg
> server.1=192.168.122.132:2888:3888
> server.2=192.168.122.225:2888:3888
> server.3=192.168.122.171:2888:3888
> autopurge.purgeInterval=6
> autopurge.snapRetainCount=5
> initLimit=10
> syncLimit=5
> maxClientCnxns=0
> clientPort=2181
> tickTime=2000
> quorumListenOnAllIPs=true
> dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
> dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>
> And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> leader" for one, and "Mode: follower" for the other two.
>
> I have already tried to manually start 3 masters simultaneously, and here is
> what I see in their log:
> In 192.168.122.171(this is the first master I started):
> I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> (id='25')
> I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> '/mesos/log_replicas/24' in ZooKeeper
> I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> '/mesos/json.info_25' in ZooKeeper
> I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
> I0605 07:12:49.423841  1186 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@192.168.122.171:5050 }
> I0605 07:12:49.424281  1187 master.cpp:1951] The newly elected leader is
> master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> master!
>
> In 192.168.122.225 (second master I started):
> I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> (id='25')
> I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> '/mesos/json.info_25' in ZooKeeper
> I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> log-replica(1)@192.168.122.171:5050 }
> I0605 07:12:51.925721  2252 replica.cpp:673] Replica in EMPTY status
> received a broadcasted recover request from (6)@192.168.122.225:5050
> I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
> I0605 07:12:51.928444  2246 master.cpp:1951] The newly elected leader is
> master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>
> In 192.168.122.132 (last master I started):
> I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> (id='25')
> I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> '/mesos/json.info_25' in ZooKeeper
> I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.171:5050) is detected
>
> So right after I started these 3 masters, the first one (192.168.122.171)
> was successfully elected as leader, but after 60s, 192.168.122.171 failed
> with the error mentioned in my first mail, and then 192.168.122.225 was
> elected as leader, but it failed with the same error too after another 60s,
> and the same thing happened to the last one (192.168.122.132). So after
> about 180s, all my 3 master were down.
>
> I tried both:
> sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
> and
> sudo ./bin/mesos-master.sh
> --zk=zk://192.168.122.132:2181,192.168.122.171:2181,192.168.122.225:2181/mesos
> --quorum=2 --work_dir=/var/lib/mesos/master
> And I see the same error for both.
>
> 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
> running on a KVM hypervisor host.
>
>
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <d...@hellooperator.net> wrote:
>>
>> You told the master it needed a quorum of 2 and it's the only one
>> online, so it's bombing out.
>> That's the expected behaviour.
>>
>> You 

Re: Mesos HA does not work (Failed to recover registrar)

2016-06-04 Thread Dick Davies
You told the master it needed a quorum of 2 and it's the only one
online, so it's bombing out.
That's the expected behaviour.

You need to start at least 2 zookeepers before it will be a functional
group, same for the masters.

You haven't mentioned how you setup your zookeeper cluster, so i'm
assuming that's working
correctly (3 nodes, all aware of the other 2 in their config). If not,
you need to sort that out first.


Also I think your zk URL is wrong - you want to list all 3 zookeeper
nodes like this:

sudo ./bin/mesos-master.sh
--zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
--work_dir=/var/lib/mesos/master

when you've run that command on 2 hosts things should start working,
you'll want all 3 up for
redundancy.

On 4 June 2016 at 16:42, Qian Zhang  wrote:
> Hi Folks,
>
> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> Zookeeper running, so they form a Zookeeper cluster. And then when I started
> the first Mesos master in one node with:
> sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
>
> I found it will hang here for 60 seconds:
>   I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
> (UPID=master@192.168.122.132:5050) is detected
>   I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader is
> master@192.168.122.132:5050 with id 40d387a6-4d61-49d6-af44-51dd41457390
>   I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
> master!
>   I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
>   I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>   I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the writer
>
> And after 60s, master will fail:
> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to
> recover registrar: Failed to perform fetch within 1mins
> *** Check failure stack trace: ***
> @ 0x7f4b81372f4e  google::LogMessage::Fail()
> @ 0x7f4b81372e9a  google::LogMessage::SendToLog()
> @ 0x7f4b8137289c  google::LogMessage::Flush()
> @ 0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f4b8040eea0  mesos::internal::master::fail()
> @ 0x7f4b804dbeb3
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi16__callIvJS1_EJLm0ELm1T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> @ 0x7f4b804ba453
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1clIJS1_EvEET0_DpOT_
> @ 0x7f4b804898d7
> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1vEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
> @ 0x7f4b804dbf80
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1vEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> @   0x49d257  std::function<>::operator()()
> @   0x49837f
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> @   0x493024  process::Future<>::fail()
> @ 0x7f4b8015ad20  process::Promise<>::fail()
> @ 0x7f4b804d9295  process::internal::thenf<>()
> @ 0x7f4b8051788f
> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi16__callIvISM_EILm0ELm1ELm2T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f4b8050fa3b  std::_Bind<>::operator()<>()
> @ 0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
> @ 0x7f4b8050fc69  std::function<>::operator()()
> @ 0x7f4b804f9609
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
> @ 0x7f4b80517936
> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
> @ 0x7f4b8050fc69  std::function<>::operator()()
> @ 0x7f4b8056b1b4  process::internal::run<>()
> @ 0x7f4b80561672  process::Future<>::fail()
> @ 0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
> @ 0x7f4b8059757f
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi16__callIbIS8_EILm0ELm1T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f4b8058fad1
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1clIJS8_EbEET0_DpOT_
> @ 0x7f4b80585a41
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1bEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
> @ 0x7f4b80597605
> 

Re: How to add other file systems to an agent

2016-05-03 Thread Dick Davies
I'd imagine it's reporting whatever partition the --work-dir argument
on the slave is set to (sandboxes live under that directory).

On 3 May 2016 at 12:21, Rinaldo Digiorgio  wrote:
> Hi,
>
> I have a configuration with a root file system and other file 
> systems. When I start an agent, the agent only reports the disk on the root 
> file system.  Is there a way to specify a list of file systems to include as 
> resources of the agent when it starts? I checked the agent options.
>
>
> Rinaldo


Re: Setting ulimits on mesos-slave

2016-04-25 Thread Dick Davies
Hi June

are you running Mesos as root, or a non-privileged user? Non-root
won't be able to up their own ulimit too high
(sorry, not an upstart expert as RHELs is laughably incomplete).

On 25 April 2016 at 19:15, June Taylor  wrote:
> What I'm saying is even putting them within the upstart script, per the
> Mesos documentation, isn't working for the file block limit. We're still
> getting 8MB useable, and as a result executors fail when attempting to write
> larger files.
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Mon, Apr 25, 2016 at 11:53 AM, haosdent  wrote:
>>
>> If you set in your upstart script, it isn't system wide and only effective
>> in that session. I think need change /etc/security/limits.conf and
>> /etc/sysctl.conf to make your ulimit work globally.
>>
>> On Tue, Apr 26, 2016 at 12:43 AM, June Taylor  wrote:
>>>
>>> Somewhere an 8MB maximum file size is being applied on just one of our
>>> slaves, for example.
>>>
>>>
>>> Thanks,
>>> June Taylor
>>> System Administrator, Minnesota Population Center
>>> University of Minnesota
>>>
>>> On Mon, Apr 25, 2016 at 11:42 AM, June Taylor  wrote:

 We are operating a 6-node cluster running on Ubuntu, and have noticed
 that the ulimit settings within the slave context are difficult to set and
 predict.

 The documentation is a bit unclear on this point, as well.

 We have had some luck adding a configuration line to
 /etc/init/mesos-slave.conf as follows:
 limit nofile 2 2
 limit fsize unlimited unlimited

 The nofile limit seems to be respected, however the fsize limit does
 not.

 It is also mysterious that the system-wide limits are not inherited by
 the slave process. We would prefer to set all of these system-wide and have
 mesos-slave observe them.

 Can you please advise where you are setting your ulimits for the
 mesos-slave if it is working for you?

 Thanks,
 June Taylor
 System Administrator, Minnesota Population Center
 University of Minnesota
>>>
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>
>


Re: removed slace "ID": (131.154.96.172): health check timed out

2016-04-18 Thread Dick Davies
On our network a lot of the hosts have multiple interfaces, which let
some asymmetric routing
issues creep in that prevented our masters replying to slaves, which
reminded me of your symptoms.

So we set an IP address in /etc/mesos-slave/ip and
/etc/mesos-master/ip so that they only listen
on one interface, and then check connectivity between those IPs.

The Ansible repo we use to build the stack now has a 'signoff'
playbook to check network connectivity
is correct between the services it deploys to a new environment.

It won't be much use to you on its own I'm afraid, but
here's a checklist cribbed from that playbook (ports might be
different in your setup).

You can SSH to the servers and check reachability between them with
netcat or telnet.


zookeepers:

- need to be able to reach each other on the election port (usually tcp/3888)

masters:

* must be able to reach zookeepers on tcp/2181
* must be able to reach each other on tcp/5050
* must be able to reach slaves on tcp/5051

mesos slaves:

- must be able to reach masters on tcp/5050
- must be able to reach zookeepers on tcp/2181
- another other connectivity to services your application needs
(database, caches, whatever)

I think that's it.

On 18 April 2016 at 20:39, Stefano Bianchi <jazzist...@gmail.com> wrote:
> Hi Dick Davies
>
> Could you please share your solution?
> How did you set up mesos/Zookeeper to interconnect masters and slaves among
> networks?
>
> Thanks a lot!
>
> 2016-04-18 20:56 GMT+02:00 Dick Davies <d...@hellooperator.net>:
>>
>> +1 for that theory, we had some screwy issues when we tried to span
>> subnets until we set every slave and master
>> to listen on a specific IP so we could tie down routing correctly.
>>
>> Saw very similar symptoms that have been described.
>>
>> On 18 April 2016 at 18:35, Alex Rukletsov <a...@mesosphere.com> wrote:
>> > I believe it's because slaves are able to connect to the master, but the
>> > master is not able to connect to the slaves. That's why you see them
>> > connected for some time and gone afterwards.
>> >
>> > On Mon, Apr 18, 2016 at 6:47 PM, Stefano Bianchi <jazzist...@gmail.com>
>> > wrote:
>> >>
>> >> Indeed, i dont know why, i am not able to reach all the machines from a
>> >> network to the other, just some machines can interconnect with some
>> >> others
>> >> among the networks.
>> >> On mesos i see that all the slaves at a certain time are all connected,
>> >> then disconnected and after a while connected again, it seems like they
>> >> are
>> >> able to connect for a while.
>> >> However is an openstack issue i guess.
>> >>
>> >> Does this also happen when master3 is leading? My guess is that you're
>> >> not
>> >> allowong incoming connections from master1 and master2 to slave3.
>> >> Generally,
>> >> masters should be able to connect to slaves, not just respond to their
>> >> requests.
>> >>
>> >> On 18 Apr 2016 13:17, "Stefano Bianchi" <jazzist...@gmail.com> wrote:
>> >>>
>> >>> Hi
>> >>> On openstack i plugged two virtual networks to the same virtual router
>> >>> so
>> >>> that the hosts on the 2 networks can communicate each other.
>> >>> this is my topology:
>> >>>
>> >>> ---internet---
>> >>> |
>> >>>Router1
>> >>> |
>> >>> 
>> >>> | |
>> >>> Net1Net2
>> >>> Master1 Master2 Master3
>> >>> Slave1 slave2  Slave3
>> >>>
>> >>> I have set zookeeper in with this line:
>> >>>
>> >>> zk://Master1_IP:2181,Master2_IP:2181,Master3_IP:2181/mesos
>> >>>
>> >>> The 3 masters, even though on 2 separated networks, elect the leader
>> >>> correclty.
>> >>> Now i have started the slaves, and in a first time i see all 3
>> >>> correctly
>> >>> registered, but after a while the slave 3, independently form who is
>> >>> the
>> >>> master, disconnects.
>> >>> I saw in the log and i get the message in the object.
>> >>> Can you help me to solve this problem?
>> >>>
>> >>>
>> >>> Thanks to all.
>> >
>> >
>
>


Re: removed slace "ID": (131.154.96.172): health check timed out

2016-04-18 Thread Dick Davies
+1 for that theory, we had some screwy issues when we tried to span
subnets until we set every slave and master
to listen on a specific IP so we could tie down routing correctly.

Saw very similar symptoms that have been described.

On 18 April 2016 at 18:35, Alex Rukletsov  wrote:
> I believe it's because slaves are able to connect to the master, but the
> master is not able to connect to the slaves. That's why you see them
> connected for some time and gone afterwards.
>
> On Mon, Apr 18, 2016 at 6:47 PM, Stefano Bianchi 
> wrote:
>>
>> Indeed, i dont know why, i am not able to reach all the machines from a
>> network to the other, just some machines can interconnect with some others
>> among the networks.
>> On mesos i see that all the slaves at a certain time are all connected,
>> then disconnected and after a while connected again, it seems like they are
>> able to connect for a while.
>> However is an openstack issue i guess.
>>
>> Does this also happen when master3 is leading? My guess is that you're not
>> allowong incoming connections from master1 and master2 to slave3. Generally,
>> masters should be able to connect to slaves, not just respond to their
>> requests.
>>
>> On 18 Apr 2016 13:17, "Stefano Bianchi"  wrote:
>>>
>>> Hi
>>> On openstack i plugged two virtual networks to the same virtual router so
>>> that the hosts on the 2 networks can communicate each other.
>>> this is my topology:
>>>
>>> ---internet---
>>> |
>>>Router1
>>> |
>>> 
>>> | |
>>> Net1Net2
>>> Master1 Master2 Master3
>>> Slave1 slave2  Slave3
>>>
>>> I have set zookeeper in with this line:
>>>
>>> zk://Master1_IP:2181,Master2_IP:2181,Master3_IP:2181/mesos
>>>
>>> The 3 masters, even though on 2 separated networks, elect the leader
>>> correclty.
>>> Now i have started the slaves, and in a first time i see all 3 correctly
>>> registered, but after a while the slave 3, independently form who is the
>>> master, disconnects.
>>> I saw in the log and i get the message in the object.
>>> Can you help me to solve this problem?
>>>
>>>
>>> Thanks to all.
>
>


Re: libmesos on alpine linux?

2016-04-17 Thread Dick Davies
Thanks - I'll give that a whirl. MESOS-4507 sounds like mesos are
starting to use
Alpine in their test suites, so hopefully the glibc/musl
incompatibilities will start to get
ironed out.

My (very basic) Spark testing has hit issues with big images
(Spark load images on demand, but anything too large triggers
timeouts, so Alpines
sizes are pretty appealing). In my experience, dockers caching isn't
as effective as it's
made out to be so I'm all for a smaller image.

Spark is the first framework I've used that needs a libmesos in the
container image,
I'm still not clear why.

On 17 April 2016 at 03:17, Sargun Dhillon <sar...@sargun.me> wrote:
> A word of warning about musl. Alpine ships with musl as its default
> libc implementation. Its DNS resolver tends to act very differently
> than glibc. This can prove problematic in certain types of
> applications where you may be interacting with slow DNS resolvers, or
> relying on glibc's behaviour.
>
> Fortunately, Alpine actually supports glibc, and there are examples of
> using it in a Docker container:
> https://hub.docker.com/r/frolvlad/alpine-glibc/~/dockerfile/ -- the
> image clocks in at about 12MB.
>
> On Sat, Apr 16, 2016 at 6:52 PM, Shuai Lin <linshuai2...@gmail.com> wrote:
>> Take a look at
>> http://stackoverflow.com/questions/35614923/errors-compiling-mesos-on-alpine-linux
>> , this guy has successfully patched an older version of the mesos to build
>> on alpine linux.
>>
>> On Sun, Apr 17, 2016 at 3:19 AM, Dick Davies <d...@hellooperator.net> wrote:
>>>
>>> Has anyone been able to build libmesos (0.28.x ideally) on Alpine Linux
>>> yet?
>>>
>>> I'm trying to get a smaller spark docker image and though that was
>>> straightforward, the docs say I need libmesos in the image to be able
>>> to use it (which I find a bit suprising, but it seems to be correct).
>>
>>


libmesos on alpine linux?

2016-04-16 Thread Dick Davies
Has anyone been able to build libmesos (0.28.x ideally) on Alpine Linux yet?

I'm trying to get a smaller spark docker image and though that was
straightforward, the docs say I need libmesos in the image to be able
to use it (which I find a bit suprising, but it seems to be correct).


Re: Prometheus Exporters on Marathon

2016-04-15 Thread Dick Davies
You are probably building on an older version of Golang - I think the
Timeout attribute was added to http.Client around 1.5 or 1.6?

On 15 April 2016 at 13:56, June Taylor  wrote:
> David,
>
> Thanks for the assistance. How did you get the mesos-exporter installed?
> When I tried the instructions from github.com/mesosphere/mesos-exporter, I
> got this error:
>
> june@-cluster:~$ go get github.com/mesosphere/mesos-exporter
> # github.com/mesosphere/mesos-exporter
> gosrc/src/github.com/mesosphere/mesos-exporter/common.go:46: unknown
> http.Client field 'Timeout' in struct literal
> gosrc/src/github.com/mesosphere/mesos-exporter/master_state.go:73: unknown
> http.Client field 'Timeout' in struct literal
> gosrc/src/github.com/mesosphere/mesos-exporter/slave_monitor.go:56: unknown
> http.Client field 'Timeout' in struct literal
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Fri, Apr 15, 2016 at 4:29 AM, David Keijser 
> wrote:
>>
>> Sure. there is not a lot to it though.
>>
>> So we have simple service file like this
>>
>> /usr/lib/systemd/system/mesos_exporter.service
>> ```
>> [Unit]
>> Description=Prometheus mesos exporter
>>
>> [Service]
>> EnvironmentFile=-/etc/sysconfig/mesos_exporter
>> ExecStart=/usr/bin/mesos_exporter $OPTIONS
>> Restart=on-failure
>> ```
>>
>> and the sysconfig is just a simple
>>
>> /etc/sysconfig/mesos_exporter
>> ```
>> OPTIONS=-master=http://10.4.72.253:5050
>> ```
>>
>> - or -
>>
>> /etc/sysconfig/mesos_exporter
>> ```
>> OPTIONS=-slave=http://10.4.72.177:5051
>> ```
>>
>> On Thu, Apr 14, 2016 at 12:22:56PM -0500, June Taylor wrote:
>> > David,
>> >
>> > Thanks for the reply. Would you be able to share your configs for
>> > starting
>> > up the exporters?
>> >
>> >
>> > Thanks,
>> > June Taylor
>> > System Administrator, Minnesota Population Center
>> > University of Minnesota
>> >
>> > On Thu, Apr 14, 2016 at 11:27 AM, David Keijser
>> > 
>> > wrote:
>> >
>> > > We run the mesos exporter [1] and the node_exporter on each host
>> > > directly
>> > > managed by systemd. For other application specific exporters we have
>> > > so far
>> > > been baking them into the docker image of the application which is
>> > > being
>> > > run by marathon.
>> > >
>> > > 1) https://github.com/mesosphere/mesos_exporter
>> > >
>> > > On Thu, 14 Apr 2016 at 18:20 June Taylor  wrote:
>> > >
>> > >> Is anyone else running Prometheus exporters on their cluster? I am
>> > >> stuck
>> > >> because I can't get a working "go build" environment right now.
>> > >>
>> > >> Is anyone else running this directly on their nodes and masters? Or,
>> > >> via
>> > >> Marathon?
>> > >>
>> > >> If so, please share your setup specifics.
>> > >>
>> > >> Thanks,
>> > >> June Taylor
>> > >> System Administrator, Minnesota Population Center
>> > >> University of Minnesota
>> > >>
>> > >
>
>


Re: Mesos Task History

2016-04-14 Thread Dick Davies
We just grab them with collectds mesos plugin and log to Graphite,
gives us long term trend details.

https://github.com/rayrod2030/collectd-mesos

Haven't used this one but it supposedly does per-task metric collection:

https://github.com/bobrik/collectd-mesos-tasks

On 14 April 2016 at 13:37, June Taylor  wrote:
> Adam,
>
> Is there a way to keep this history?
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Wed, Apr 13, 2016 at 4:32 PM, Adam Bordelon  wrote:
>>
>> Yes, these counters are only kept in-memory, so any time a Mesos master
>> starts, its counters are initialized to 0.
>>
>> On Wed, Apr 13, 2016 at 9:38 AM, June Taylor  wrote:
>>>
>>> We have a single master at the moment. Does the task history get cleared
>>> when the mesos-master restarts?
>>>
>>>
>>> Thanks,
>>> June Taylor
>>> System Administrator, Minnesota Population Center
>>> University of Minnesota
>>>
>>> On Wed, Apr 13, 2016 at 11:33 AM, Pradeep Chhetri
>>>  wrote:

 Yes, they get cleaned up whenever the mesos master leader failover
 happens.

 On Wed, Apr 13, 2016 at 3:32 PM, June Taylor  wrote:
>
> I am noticing that recently our Completed Tasks and Terminated
> Frameworks lists are empty. Where are these stored, and do they get
> automatically cleared out at some interval?
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota




 --
 Regards,
 Pradeep Chhetri
>>>
>>>
>>
>


Re: [Proposal] Remove the default value for agent work_dir

2016-04-13 Thread Dick Davies
Oh please yes!

On 13 April 2016 at 08:00, Sam  wrote:
> +1
>
> Sent from my iPhone
>
> On Apr 13, 2016, at 12:44 PM, Avinash Sridharan 
> wrote:
>
> +1
>
> On Tue, Apr 12, 2016 at 9:31 PM, Jie Yu  wrote:
>>
>> +1
>>
>> On Tue, Apr 12, 2016 at 9:29 PM, James Peach  wrote:
>>
>> >
>> > > On Apr 12, 2016, at 3:58 PM, Greg Mann  wrote:
>> > >
>> > > Hey folks!
>> > > A number of situations have arisen in which the default value of the
>> > Mesos agent `--work_dir` flag (/tmp/mesos) has caused problems on
>> > systems
>> > in which the automatic cleanup of '/tmp' deletes agent metadata. To
>> > resolve
>> > this, we would like to eliminate the default value of the agent
>> > `--work_dir` flag. You can find the relevant JIRA here.
>> > >
>> > > We considered simply changing the default value to a more appropriate
>> > location, but decided against this because the expected filesystem
>> > structure varies from platform to platform, and because it isn't
>> > guaranteed
>> > that the Mesos agent would have access to the default path on a
>> > particular
>> > platform.
>> > >
>> > > Eliminating the default `--work_dir` value means that the agent would
>> > exit immediately if the flag is not provided, whereas currently it
>> > launches
>> > successfully in this case. This will break existing infrastructure which
>> > relies on launching the Mesos agent without specifying the work
>> > directory.
>> > I believe this is an acceptable change because '/tmp/mesos' is not a
>> > suitable location for the agent work directory except for short-term
>> > local
>> > testing, and any production scenario that is currently using this
>> > location
>> > should be altered immediately.
>> >
>> > +1 from me too. Defaulting to /tmp just helps people shoot themselves in
>> > the foot.
>> >
>> > J
>
>
>
>
> --
> Avinash Sridharan, Mesosphere
> +1 (323) 702 5245


Re: Slaves not getting registered

2016-04-13 Thread Dick Davies
erminated
>
>
>
> root@master1:/var/log/mesos# tail -f mesos-master.WARNING
>
> Log file created at: 2016/04/12 11:01:49
>
> Running on machine: master1
>
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>
> W0412 11:01:49.024226  3712 authenticator.cpp:511] No credentials provided,
> authentication requests will be refused
>
>
>
> root@master1:/var/log/mesos# tail -f
> mesos-master.master1.invalid-user.log.INFO.20160412-11014
>
> tail: cannot open
> ‘mesos-master.master1.invalid-user.log.INFO.20160412-11014’ for reading: No
> such file or directory
>
> root@master1:/var/log/mesos# tail -f
> mesos-master.master1.invalid-user.log.INFO.20160412-11014
>
> mesos-master.master1.invalid-user.log.INFO.20160412-110143.3651
> mesos-master.master1.invalid-user.log.INFO.20160412-110148.3712
>
> root@master1:/var/log/mesos# tail -f
> mesos-master.master1.invalid-user.log.INFO.20160412-110143.3651
>
> I0412 11:01:46.424433  3676 replica.cpp:673] Replica in EMPTY status
> received a broadcasted recover request from (5)@30.30.30.53:5050
>
> I0412 11:01:47.068586  3675 replica.cpp:673] Replica in EMPTY status
> received a broadcasted recover request from (8)@30.30.30.53:5050
>
> I0412 11:01:47.592926  3677 replica.cpp:673] Replica in EMPTY status
> received a broadcasted recover request from (11)@30.30.30.53:5050
>
> I0412 11:01:48.188248  3680 replica.cpp:673] Replica in EMPTY status
> received a broadcasted recover request from (14)@30.30.30.53:5050
>
> I0412 11:01:48.887104  3678 group.cpp:460] Lost connection to ZooKeeper,
> attempting to reconnect ...
>
> I0412 11:01:48.887177  3674 group.cpp:460] Lost connection to ZooKeeper,
> attempting to reconnect ...
>
> I0412 11:01:48.887229  3677 group.cpp:460] Lost connection to ZooKeeper,
> attempting to reconnect ...
>
> I0412 11:01:48.919545  3675 group.cpp:519] ZooKeeper session expired
>
> I0412 11:01:48.919848  3680 detector.cpp:154] Detected a new leader: None
>
> I0412 11:01:48.919922  3680 master.cpp:1710] The newly elected leader is
> None
>
>
>
>
>
> root@slave1:/var/log/mesos# tail -f
> mesos-slave.slave1.invalid-user.log.INFO.20160412-110554.1696
>
> I0413 03:12:54.532676  1711 group.cpp:519] ZooKeeper session expired
>
> I0413 03:12:58.757953  1715 slave.cpp:4304] Current disk usage 6.44%. Max
> allowed age: 5.848917453828577days
>
> W0413 03:13:04.539577  1715 group.cpp:503] Timed out waiting to connect to
> ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
>
> I0413 03:13:04.539798  1715 group.cpp:519] ZooKeeper session expired
>
> W0413 03:13:14.542245  1713 group.cpp:503] Timed out waiting to connect to
> ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
>
> I0413 03:13:14.542434  1713 group.cpp:519] ZooKeeper session expired
>
>
>
> root@slave1:/var/log/mesos# tail -f mesos-slave.WARNING
>
> W0413 03:12:24.512336  1715 group.cpp:503] Timed out waiting to connect to
> ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
>
> W0413 03:12:34.519641  1710 group.cpp:503] Timed out waiting to connect to
> ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
>
> W0413 03:12:44.521181  1713 group.cpp:503] Timed out waiting to connect to
> ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
>
> W0413 03:12:54.532501  1711 group.cpp:503] Timed out waiting to connect to
> ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
>
>
>
> Thank you.
>
>
>
>
>
> From: June Taylor [mailto:j...@umn.edu]
> Sent: 12 April 2016 18:06
> To: user@mesos.apache.org
> Subject: Re: Slaves not getting registered
>
>
>
> Try looking in /var/log/mesos/ at these files: mesos-slave.WARNING,
> mesos-slave.INFO, mesos-slave.ERROR
>
>
>
>
> Thanks,
>
> June Taylor
>
> System Administrator, Minnesota Population Center
>
> University of Minnesota
>
>
>
> On Tue, Apr 12, 2016 at 4:36 AM, Dick Davies <d...@hellooperator.net> wrote:
>
> There's no mention of a slave there, have a look at the logs on the
> slaves filesystem and see if it is giving any errors.
>
>
> On 12 April 2016 at 10:17,  <aishwarya.adyanth...@accenture.com> wrote:
>> The GUI log shows like this:
>>
>>
>>
>> I0412 08:45:51.379609  3616 master.cpp:3673] Processing DECLINE call for
>> offers: [ 74f33592-fc48-4066-a59c-977818b4c13c-O282 ] for framework
>> 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at
>> scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208
>>
>> I0412 08:45:54.637461  3612 http.cpp:501] HTTP GET for /master/state.json
>> from 10.211.203.

Re: Slaves not getting registered

2016-04-12 Thread Dick Davies
There's no mention of a slave there, have a look at the logs on the
slaves filesystem and see if it is giving any errors.

On 12 April 2016 at 10:17,   wrote:
> The GUI log shows like this:
>
>
>
> I0412 08:45:51.379609  3616 master.cpp:3673] Processing DECLINE call for
> offers: [ 74f33592-fc48-4066-a59c-977818b4c13c-O282 ] for framework
> 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at
> scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208
>
> I0412 08:45:54.637461  3612 http.cpp:501] HTTP GET for /master/state.json
> from 10.211.203.147:59463 with User-Agent='Mozilla/5.0 (Windows NT 6.0;
> WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'
>
> I0412 08:45:57.376288  3619 master.cpp:5350] Sending 1 offers to framework
> 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at
> scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208
>
> I0412 08:45:57.385325  3613 master.cpp:3673] Processing DECLINE call for
> offers: [ 74f33592-fc48-4066-a59c-977818b4c13c-O283 ] for framework
> 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at
> scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208
>
> I0412 08:46:03.383728  3614 master.cpp:5350] Sending 1 offers to framework
> 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at
> scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208
>
> I0412 08:46:03.396531  3612 master.cpp:3673] Processing DECLINE call for
> offers: [ 74f33592-fc48-4066-a59c-977818b4c13c-O284 ] for framework
> 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at
> scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208
>
> I0412 08:46:04.665582  3612 http.cpp:501] HTTP GET for /master/state.json
> from 10.211.203.147:59464 with User-Agent='Mozilla/5.0 (Windows NT 6.0;
> WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'
>
> I0412 08:46:09.389493  3616 master.cpp:5350] Sending 1 offers to framework
> 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at
> scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208
>
>
>
>
>
> Is there a way to find out the number of masters that are present in the
> environment together through CLI/GUI?
>
>
>
>
>
>
>
> From: haosdent [mailto:haosd...@gmail.com]
> Sent: 12 April 2016 13:37
> To: user 
> Subject: Re: Slaves not getting registered
>
>
>
>>but am unable to get it registered.
>
> Hi, @aishwarya Could you post master and slave log to provide more details?
> Usually it is because of network problem.
>
>
>
> On Tue, Apr 12, 2016 at 4:02 PM,  wrote:
>
> Hi,
>
>
>
> I’m unable to get the slave registered with the master node. I’ve configured
> both the masters and slave machines but am unable to get it registered.
>
>
>
> Thank you.
>
>
>
> 
>
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed by
> local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> __
>
> www.accenture.com
>
>
>
>
>
> --
>
> Best Regards,
>
> Haosdent Huang


Re: How to kill tasks when memory exceeds the cgroup limit?

2016-03-19 Thread Dick Davies
On 18 March 2016 at 20:58, Benjamin Mahler <bmah...@apache.org> wrote:
> Interesting, why does it take down the slaves?

This was a good while back, but when swap gets low our slaves kernel
OOM killer tended to mess things up.

> Because a lot of organizations run with swap disabled (e.g. for more
> deterministic performance), we originally did not set the swap limit at all.
> When we introduced the '--cgroups_limit_swap' flag we had to make it default
> to false initially in case any users were depending on the original behavior
> of no swap limit. Now that it's been available for some time, we can
> consider moving the default to true. This is actually reflected in the TODO
> alongside the flag:
>
> https://github.com/apache/mesos/blob/0.28.0/src/slave/flags.cpp#L331-L336
>
> Want to send a patch? We'd need to communicate this change to the default
> behavior in the CHANGELOG and specify how users can keep the original
> behaviour.

I'll see if I can get time - just about to finish a consulting gig and
was going to take a break,
so it might be an option.

Thanks for the explanation, I *knew* there'd be a reason :)


> Also, there's more we would need to do in the long term for use cases that
> desire swapping. The only support today is (1) no memory limits (2) memory
> limit and no swap limit (3) both memory and swap limits. You can imagine
> scenarios where users may want to control how much they're allowed to swap,
> or maybe we want to swap for non-latency sensitive containers. However, it's
> more complicated (the user and operator have to co-operate more, there are
> more ways to run things, etc), and so the general advice is to disable swap
> to keep things simple and deterministic.
>
> On Fri, Mar 18, 2016 at 11:34 AM, Dick Davies <d...@hellooperator.net>
> wrote:
>>
>> Great!
>>
>> I'm not really sure why mesos even allows RSS limiting without VMEM,
>> it takes down slaves like the Black Death
>> if you accidentally deploy a 'leaker'. I'm sure there's a use case I'm
>> not seeing :)
>>
>> On 18 March 2016 at 16:27, Shiyao Ma <i...@introo.me> wrote:
>> > Thanks. The limit_swap works.
>
>


Re: How to kill tasks when memory exceeds the cgroup limit?

2016-03-19 Thread Dick Davies
Great!

I'm not really sure why mesos even allows RSS limiting without VMEM,
it takes down slaves like the Black Death
if you accidentally deploy a 'leaker'. I'm sure there's a use case I'm
not seeing :)

On 18 March 2016 at 16:27, Shiyao Ma  wrote:
> Thanks. The limit_swap works.


Re: How to kill tasks when memory exceeds the cgroup limit?

2016-03-19 Thread Dick Davies
Last time I tried (not on the latest release) I also had to have
cgroups set to limit swap, otherwise
as soon as the process hit the RAM limit it would just start to consume swap.

try adding --cgroups_limit_swap to the slaves startup flags.

On 17 March 2016 at 16:21, Shiyao Ma  wrote:
> Hi,
>
>
> For the slave side:
> export MESOS_RESOURCES='cpus:4;mem:180'
> export MESOS_ISOLATION='cgroups/cpu,cgroups/mem'
>
> For the framework,
> It accepts the offer from the slave and sends tasks with memory spec less
> than offered.
>
>
> However, the task actually *deliberately* asks for an arbitrary large memory
> during runtime.
>
> My assumption is that the slave will kill the task.  However, it doesn't.
>
> So here goes my question. How does slave handle the 'runtime memory
> exceeding cgroup limit' behavior? Will any handlers be invoked?
>
>
>
> Regards.


Re: rkt / appc support

2016-03-16 Thread Dick Davies
Thanks ! will keep a close eye on that ticket.

On 16 March 2016 at 09:47, Guangya Liu <gyliu...@gmail.com> wrote:
> Hi Dick,
>
> This is new functionality, you can refer to
> https://issues.apache.org/jira/browse/MESOS-2840 for more detail, there are
> also some design document link append in the JIRA ticket.
>
> Thanks,
>
> Guangya
>
> On Wed, Mar 16, 2016 at 5:24 PM, Dick Davies <d...@hellooperator.net> wrote:
>>
>> Quick question - what versions of Mesos (if any) support rkt/appc?
>>
>> Saw the announcement of the Unified Containerizer
>>
>> ( http://mesos.apache.org/documentation/container-image/ )
>>
>> but I wasn't clear if this was a refactoring of existing support, or
>> new functionality.
>
>


rkt / appc support

2016-03-16 Thread Dick Davies
Quick question - what versions of Mesos (if any) support rkt/appc?

Saw the announcement of the Unified Containerizer

( http://mesos.apache.org/documentation/container-image/ )

but I wasn't clear if this was a refactoring of existing support, or
new functionality.


Re: AW: Feature request: move in-flight containers w/o stopping them

2016-02-19 Thread Dick Davies
Agreed, vMotion always struck me as something for those monolithic
apps with a lot of local state.

The industry seems to be moving away from that as fast as its little
legs will carry it.

On 19 February 2016 at 11:35, Jason Giedymin  wrote:
> Food for thought:
>
> One should refrain from monolithic apps. If they're small and stateless you
> should be doing rolling upgrades.
>
> If you find yourself with one container and you can't easily distribute that
> work load by just scaling and load balancing then you have a monolith. Time
> to enhance it.
>
> Containers should not be treated like VMs.
>
> -Jason
>
> On Feb 19, 2016, at 6:05 AM, Mike Michel  wrote:
>
> Question is if you really need this when you are moving in the world of
> containers/microservices where it is about building stateless 12factor apps
> except databases. Why moving a service when you can just kill it and let the
> work be done by 10 other containers doing the same? I remember a talk on
> dockercon about containers and live migration. It was like: „And now where
> you know how to do it, dont’t do it!“
>
>
>
> Von: Avinash Sridharan [mailto:avin...@mesosphere.io]
> Gesendet: Freitag, 19. Februar 2016 05:48
> An: user@mesos.apache.org
> Betreff: Re: Feature request: move in-flight containers w/o stopping them
>
>
>
> One problem with implementing something like vMotion for Mesos is to address
> seamless movement of network connectivity as well. This effectively requires
> moving the IP address of the container across hosts. If the container shares
> host network stack, this won't be possible since this would imply moving the
> host IP address from one host to another. When a container has its network
> namespace, attached to the host, using a bridge, moving across L2 segments
> might be a possibility. To move across L3 segments you will need some form
> of overlay (VxLAN maybe ?) .
>
>
>
> On Thu, Feb 18, 2016 at 7:34 PM, Jay Taylor  wrote:
>
> Is this theoretically feasible with Linux checkpoint and restore, perhaps
> via CRIU?http://criu.org/Main_Page
>
>
> On Feb 18, 2016, at 4:35 AM, Paul Bell  wrote:
>
> Hello All,
>
>
>
> Has there ever been any consideration of the ability to move in-flight
> containers from one Mesos host node to another?
>
>
>
> I see this as analogous to VMware's "vMotion" facility wherein VMs can be
> moved from one ESXi host to another.
>
>
>
> I suppose something like this could be useful from a load-balancing
> perspective.
>
>
>
> Just curious if it's ever been considered and if so - and rejected - why
> rejected?
>
>
>
> Thanks.
>
>
>
> -Paul
>
>
>
>
>
>
>
>
>
> --
>
> Avinash Sridharan, Mesosphere
>
> +1 (323) 702 5245


Re: make slaves not getting tasks anymore

2015-12-30 Thread Dick Davies
It sounds like you want to use checkpointing, that should keep the
tasks alive as you update
the mesos slave process itself.

On 30 December 2015 at 11:43, Mike Michel  wrote:
> Hi,
>
>
>
> i need to update slaves from time to time and looking for a way to take them
> out of the cluster but without killing the running tasks. I need to wait
> until all tasks are done and during this time no new tasks should be started
> on this slave. My first idea was to set a constraint „status:online“ for
> every task i start and then change the attribute of the slave to „offline“,
> restart slave process while executer still runs the tasks but it seems if
> you change the attributes of a slave it can not connect to the cluster
> without rm -rf /tmp before which will kill all tasks.
>
>
>
> Also the maintenance mode seems not to be an option:
>
>
>
> „When maintenance is triggered by the operator, all agents on the machine
> are told to shutdown. These agents are subsequently removed from the master
> which causes tasks to be updated as TASK_LOST. Any agents from machines in
> maintenance are also prevented from registering with the master.“
>
>
>
> Is there another way?
>
>
>
>
>
> Cheers
>
>
>
> Mike


Re: Mesos masters and zookeeper running together?

2015-12-24 Thread Dick Davies
zookeeper really wants a dedicated cluster IMO; preferably with SSD
under it - if zookeeper
starts to run slow then everything else will start to bog down. I've
co-hosted it with mesos masters
before now for demo purposes etc. but for production it's probably
worth choosing dedicated hosts.

On 24 December 2015 at 20:36, Rodrick Brown  wrote:
> With our design we end up building out a stand alone zookeeper cluster 3
> nodes.  Zookeeper seems to be the default dumping ground for many Apache
> based products these days. You will eventually see many services and
> frameworks require a zk instance for leader election, coordination, Kv store
> etc.. I've seen situations where the masters can become extremely busy and
> cause performance problem with Zk which can be huge issue for mesos.
>
> Sent from Outlook Mobile
>
>
>
>
> On Thu, Dec 24, 2015 at 8:01 AM -0800, "Ron Lipke"  wrote:
>
>> Hello, I've been working on setting up a mesos cluster for eventual
>> production use and I have a question on configuring zookeeper alongside
>> the mesos masters.
>> Is it best practice to run zookeeper/exhibitor as a separate cluster (in
>> our case, three nodes) or on the same machines as the mesos masters?  I
>> understand the drawbacks of increased cost for compute resources that
>> will just be running a single service and most of the reference docs
>> have them running together, but just wondering if it's beneficial to
>> have them uncoupled.
>>
>> Thanks in advance for any input.
>>
>> Ron Lipke
>> @neverminding
>
>
> NOTICE TO RECIPIENTS: This communication is confidential and intended for
> the use of the addressee only. If you are not an intended recipient of this
> communication, please delete it immediately and notify the sender by return
> email. Unauthorized reading, dissemination, distribution or copying of this
> communication is prohibited. This communication does not constitute an offer
> to sell or a solicitation of an indication of interest to purchase any loan,
> security or any other financial product or instrument, nor is it an offer to
> sell or a solicitation of an indication of interest to purchase any products
> or services to any persons who are prohibited from receiving such
> information under applicable law. The contents of this communication may not
> be accurate or complete and are subject to change without notice. As such,
> Orchard App, Inc. (including its subsidiaries and affiliates, "Orchard")
> makes no representation regarding the accuracy or completeness of the
> information contained herein. The intended recipient is advised to consult
> its own professional advisors, including those specializing in legal, tax
> and accounting matters. Orchard does not provide legal, tax or accounting
> advice.


Re: what's the best way to monitor mesos cluster

2015-11-11 Thread Dick Davies
+1 for the collectd plugin. been using that for about 9 months
and it does the job nicely.

On 11 November 2015 at 06:59, Du, Fan  wrote:

> Hi Mesos experts
>
> There is server and client snapshot metrics in jason format provided by
> Mesos itself.
> but more often we want to extend the metrics a bit more than that.
>
> I have been looking for this for a couple of days, while
> https://collectd.org/ comes
> to my sight, it also has a mesos plugin
> https://github.com/rayrod2030/collectd-mesos.
>
> Is there any recommended such open source project to do this task?
> Thanks.
>


Re: Cluster Maintanence

2015-10-29 Thread Dick Davies
You might want to look at the maintenance primitives feature in 0.25.0:

https://mesos.apache.org/blog/mesos-0-25-0-released/


On 29 October 2015 at 18:19, John Omernik  wrote:

> I am wondering if there are some easy ways to take a healthy slave/agent
> and start a process to bleed processes out.
>
> Basically, without having to do something where every framework would
> support it, I'd like the option to
>
> 1. Stop offering resources to new frameworks. I.e. no new resources would
> be offered, but existing jobs/tasks continue to run.
> 2.  Offer the ability, especially in the UI, but potentially in API as
> well to "kill" a task.  This would cause a failure that force the framework
> to respond. For example, if it was a docker container running in marathon,
> if I said "please kill this task" it would, marathon would recognize the
> failure and try to restart the container. Since our agent (in point 1) is
> not offering resources, then that task would not fall on the agent in
> question.
>
>
> The reason for this manual bleeding is to say run updates on a node or
> pull it out of service for other reasons (memory upgrades etc) and do so in
> a manual way.  You may want to address what's running on the node manually,
> thus a whole scale "kill everything" while it SHOULD be doable, may not
> always be feasible. In addition, the inverse offers thing seems neat, but
> frameworks have to support it.
>
> So, is there any thing like that now and I am just missing it in the
> documentation?  I am curious to hear how others are handling this situation
> in their environments.
>
> John
>
>
>
>


Re: How production un-ready are Mesos Cassandra, Spark and Kafka Frameworks?

2015-10-12 Thread Dick Davies
Hi Chris



Spark is a Mesos native, I'd have no hesitation running it on Mesos.

Cassandra not so much -
that's not to disparage the work people are putting in there, I think
it's really interesting. But personally with complex beasts like Cassandra
I want to be running as 'stock' as possible, as it makes it easier to learn
from other peoples experiences.

On 12 October 2015 at 17:47, Chris Elsmore 
wrote:

> Hi all,
>
> Have just got back from a brilliant MesosCon Europe in Dublin, I learnt a
> huge amount and a big thank-you for putting on a great conference to all
> involved!
>
>
> I am looking to deploy a small (maybe 5 max) Cassandra & Spark cluster to
> do some data analysis at my current employer, and am a little unsure of the
> current status of the frameworks this would need to run on Mesos- both the
> mesosphere docs (which I’m guessing use the frameworks of the same name
> hosted on Github) and the Github ReadMes mention that these are not
> production ready, and the rough timeline of Q1 2016.
>
> I’m just wondering how production un-ready these are!? I am looking at
> using Mesos to deploy future stateless services in the next 6 months or so,
> and so I like the idea of adding to that system and the look of the
> configuration that is handled for you to bind nodes together in these
> frameworks. However it feels like for a smallish cluster of production
> ready machines it might be better to deploy them standalone and stay
> observant on the status of such things in the near future, and the
> configuration wins are not that large especially for a small cluster.
>
>
> Any experience and advice on the above would be greatly received!
>
>
> Chris
>
>
>
>


Re: Java detector for mess masters and leader

2015-07-07 Thread Dick Davies
The active master has a flag set in  /metrics/snapshot  :
master/elected which is 1 for the active
master and 0 otherwise, so it's easy enough to only load the metrics
from the active master.

(I use the collectd plugin and push data rather than poll, but the
same principle should apply).

On 7 July 2015 at 14:02, Donald Laidlaw donlaid...@me.com wrote:
 Has anyone ever developed Java code to detect the mesos masters and leader, 
 given a zookeeper connection?

 The reason I ask is because I would like to monitor mesos to report various 
 metrics reported by the master. This requires detecting and tracking the 
 leading master to query its /metrics/snapshot REST endpoint.

 Thanks,
 -Don


Re: Thoughts and opinions in physically building a cluster

2015-06-25 Thread Dick Davies
That doesn't sound too bad (it's a fairly typical setup e.g. on an Amazon VPC).
You probably want to avoid NAT or similar things between master and
slaves to avoid
a lot of LIBPROCESS_IP tricks so same switch sounds good.

Personally I quite like the master/slave distinction.

I wouldn't want a runaway set of  tasks to bog down the masters and
operationally we'd alert
if we're starting to lose masters whereas the slaves are 'cattle' and
we can just spin up more as
they die if need be (it's a little more tricky to scale out masters
and zookeepers so they get treated
as though they were a bit less expendable).

I co-locate the zookeeper ensemble on the masters on smaller clusters
to save VM count,
but that's more personal taste than anything.

On 25 June 2015 at 17:12, Daniel Gaston daniel.gas...@dal.ca wrote:
 So this may be another relatively noob question, but when designing a mesos 
 cluster, is it basically as simple as the nodes connected by a switch? Since 
 any of the nodes can be master nodes or acting as both master and slave, I 
 am guessing there is no need for another head node as you would have with a 
 traditional cluster design. But would each of the nodes then have to be 
 connected to the external/institutional network?

 My rough idea was for this small cluster to not be connected to the main 
 institutional network but for my workstation to be connected to both the 
 cluster's network as well as to the institutional network


 
 From: CCAAT cc...@tampabay.rr.com
 Sent: June-19-15 4:57 PM
 To: user@mesos.apache.org
 Cc: cc...@tampabay.rr.com
 Subject: Re: Thoughts and opinions in physically building a cluster

 On 06/19/2015 01:28 PM, Daniel Gaston wrote:

 On 19/06/2015 18:38, Oliver Nicholas wrote:
 Unless you have some true HA requirements, it seems intuitively
 wasteful to have 3 masters and 2 slaves (unless the cost of 5 nodes is
 inconsequential to you and you hate the environment).
 Any particular reason not to have three nodes which are acting both as
 master and slaves?

 None at all. I'm not a cluster or networking guru, and have only played with 
 mesos in
 cloud-based settings so I wasn't sure how this would work. But it makes 
 sense, that way
 the 'standby' masters are still participating in the zookeeper quorum while 
 still being
 available to do real work as slave nodes.

 Daniel. There is no such thing as a 'cluster guru'. It's all 'seat of
 the pants' flying right now; so you are fine what you are doing and
 propose. If codes do not exist to meet your specific needs and goals,
 they can  (should?) be created.


 I'm working on an architectural expansion Where nodes (virtual, actual
 or bare metal) migrate from master -- entrepreneur -- worker -- slave
 -- embedded (bare metal or specially attached hardware. I'm proposing
 to do all of this with the Autonomy_Function and decisions being made
 bottom_up as opposed to the current top_down dichotomy. I'm prolly going
 to have to 'fork codes' for a while to get things stable and then
 hope they are included; when other minds see the validity of the ideas.


 Surely one box can be set up as both master and slave. Moving slaves
 to masters, should be an automatic function and will prolly will be
 address in the future codes of mesos.


 PS: Keep pushing your ideas and do not take no for an answer!
 Mesos belongs to everybody.

 hth,
 James



Re: How to upgrade mesos version from a running mesos cluster

2015-06-19 Thread Dick Davies
Do the masters first, as described at the link.

On 19 June 2015 at 10:17, tommy xiao xia...@gmail.com wrote:
 Thanks Alex Rukletsov. In my earlier try, the newer mesos slave ( version
 0.21.1) can't connect to mesos master (version 0.20.0), So it annoies to me.
 anyway, i will test again, let me clarify the concern.

 2015-06-19 17:06 GMT+08:00 Alex Rukletsov a...@mesosphere.com:

 Tommy, you should be able to upgrade Mesos without stopping the cluster.
 If you cannot, please describe the issue you face and we'll try to figure
 out why.

 On Fri, Jun 19, 2015 at 7:58 AM, tommy xiao xia...@gmail.com wrote:

 Hi Jeff,

 Thanks for your remind. my concerns is how about upgrade without stop the
 mesos cluster. I found the mesos cluster can't support rolling upgrade
 feature.

 2015-06-19 12:20 GMT+08:00 Jeff Schroeder jeffschroe...@computer.org:

 Hello Tommy, have you read the documentation? If not, please take a look
 and then follow up with any specific questions here:

 http://mesos.apache.org/documentation/latest/upgrades/


 On Thursday, June 18, 2015, tommy xiao xia...@gmail.com wrote:

 Hi,

 I have a question on upgrade strategy:
 How about upgrade mesos cluster seamlessly from a production cluster.
 Do we have some best practice on it?

 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com



 --
 Text by Jeff, typos by iPhone




 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com





 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com


Re: cluster confusion after zookeeper blip

2015-05-18 Thread Dick Davies
Thanks Nikolay - I checked the frameworkid in zookeeper
(/marathon/state/frameworkId) matched the
one attached to the running tasks, gave the old marathon leader a
restart and everything reconnected ok

(we did have to disable our service discovery pieces to avoid getting
empty JSON back when marathon
first booted, but other than that everything is peachy).


On 18 May 2015 at 15:31, Nikolay Borodachev nbo...@adobe.com wrote:
 Have you tried to restart Marathon and Mesos processes? Once you restart them 
 they should pick zookeepers, elect leaders, etc.
 If you're using Docker containers, they should reattach themselves to the 
 respective slaves.

 Thanks
 Nikolay

 -Original Message-
 From: rasput...@gmail.com [mailto:rasput...@gmail.com] On Behalf Of Dick 
 Davies
 Sent: Monday, May 18, 2015 5:26 AM
 To: user@mesos.apache.org
 Subject: cluster confusion after zookeeper blip

 We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves.
 (mesos 0.21.0, marathon 0.7.5)

 This morning we had a network outage long enough for everything to lose 
 zookeeper.
 Now our marathon UI is empty (all 3 marathons think someone else is a master, 
 and marathons 'proxy to leader' feature means the REST API is toast).

 Odd thing is, at the mesos level, the
 mesos master UI shows no tasks running (logs mention orphaned tasks), but if 
 i click into the 'slaves' tab and dig down, the slave view details tasks that 
 are in fact active.

 Any way to bring order to this without needing to kill those tasks? we have 
 no actual outage from a user point of view, but the cluster itself is pretty 
 confused and our service discovery relies on the marathon API which is timing 
 out.

 Although mesos has checkpointing enabled, marathon isn't running with 
 checkpointing on (it's the default now but doesn't apply to existing 
 frameworks apparently, and we started this around marathon 0.4.x)

 Would enabling checkpointing help with this kind of issue? If so, how do i 
 enable it for an existing framework?


cluster confusion after zookeeper blip

2015-05-18 Thread Dick Davies
We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves.
(mesos 0.21.0, marathon 0.7.5)

This morning we had a network outage long enough for everything to
lose zookeeper.
Now our marathon UI is empty (all 3 marathons think someone else is a
master, and
marathons 'proxy to leader' feature means the REST API is toast).

Odd thing is, at the mesos level, the
mesos master UI shows no tasks running (logs mention orphaned tasks),
but if i click into the 'slaves' tab and dig down, the slave view details tasks
that are in fact active.

Any way to bring order to this without needing to kill those tasks? we
have no actual outage from a user point of view, but the cluster
itself is pretty confused and our service discovery relies on the
marathon API which is timing out.

Although mesos has checkpointing enabled, marathon isn't running with
checkpointing on (it's the default now but doesn't apply to existing
frameworks apparently, and we started this around marathon 0.4.x)

Would enabling checkpointing help with this kind of issue? If so, how
do i enable it for an existing framework?


group memory limits are always 'soft' . how do I ensure info-pid.isNone() ?

2015-04-28 Thread Dick Davies
Been banging my head against this  for a while now.

mesos 0.21.0 , marathon 0.7.5, centos 6 servers.

When I enable cgroups (flags are : --cgroups_limit_swap
--isolation=cgroups/cpu,groups/mem ) the memory limits I'm setting
are reflected in memory.soft_limit_in_bytes but not in

memory.limit_in_bytes or memory.memsw.limit_in_bytes.


Upshot is our runaway task eats all RAM and swap on the server
until the OOM steps in and starts firing into the crowd.

This line of code seems to never lower a hard limit:

https://github.com/apache/mesos/blob/master/src/slave/containerizer/isolators/cgroups/mem.cpp#L382

which means both of those tests must be true, right?

the current limit is insanely high (8192 PB if i'm reading it right) - how would
I make info-pid.isNone() be true ?

Have tried restarting the slave, scaling the marathon apps to 0 tasks
then back. Bit stumped.


Re: group memory limits are always 'soft' . how do I ensure info-pid.isNone() ?

2015-04-28 Thread Dick Davies
Thanks Ian.

Digging around the cgroup there are 3 processes in there;

* the mesos-executor
* the shell script marathon starts the app with
* the actual command to run the task ( a perl app in this case)

The line of code you mention is never run in our case, because it's
wrapped in the conditional
I'm talking about!

All I see is cpu.shares being set and then mem.soft_limit_in_bytes.


On 28 April 2015 at 17:47, Ian Downes idow...@twitter.com wrote:
 The line of code you cite is so the hard limit is not decreased on a running
 container because we can't (easily) reclaim anonymous memory from running
 processes. See the comment above the code.

 The info-pid.isNone() is for when cgroup is being configured (see the
 update() call at the end of MemIsolatorProcess::prepare()), i.e., before any
 processes are added to the cgroup.

 The limit  currentLimit.get() ensures the limit is only increased.

 The memory limit defaults to the maximum for the data type, I guess that's
 the ridiculous 8 EB. It should be set to what the initial memory allocation
 was for the container so this is not expected. Can you look in the slave
 logs for when the container was created for the log line on:
 https://github.com/apache/mesos/blob/master/src/slave/containerizer/isolators/cgroups/mem.cpp#L393

 Ian

 On Tue, Apr 28, 2015 at 7:42 AM, Dick Davies d...@hellooperator.net wrote:

 Been banging my head against this  for a while now.

 mesos 0.21.0 , marathon 0.7.5, centos 6 servers.

 When I enable cgroups (flags are : --cgroups_limit_swap
 --isolation=cgroups/cpu,groups/mem ) the memory limits I'm setting
 are reflected in memory.soft_limit_in_bytes but not in

 memory.limit_in_bytes or memory.memsw.limit_in_bytes.


 Upshot is our runaway task eats all RAM and swap on the server
 until the OOM steps in and starts firing into the crowd.

 This line of code seems to never lower a hard limit:


 https://github.com/apache/mesos/blob/master/src/slave/containerizer/isolators/cgroups/mem.cpp#L382

 which means both of those tests must be true, right?

 the current limit is insanely high (8192 PB if i'm reading it right) - how
 would
 I make info-pid.isNone() be true ?

 Have tried restarting the slave, scaling the marathon apps to 0 tasks
 then back. Bit stumped.




Re: group memory limits are always 'soft' . how do I ensure info-pid.isNone() ?

2015-04-28 Thread Dick Davies
That's what led me into reading the code - neither mem.limit_in_bytes
or mem.memsw.limit_in_bytes
are ever set down from the (insanely high) defaults. I know that
second conditional is false, so the first
must be too, right?

It's likely I'm reading the wrong branch; we're running the 0.21.0
release - but I don't see any commits
that would change this ordering.

Just to confirm - we are using the default containerizer (not docker
or anything else) - that shouldn't make
any difference though, should it?

I'm offsite til morning now (UK time), but I'll post the full slave
logs when I can get to them.

On 28 April 2015 at 18:18, Ian Downes idow...@twitter.com wrote:
 The control flow in the Mesos containerizer to launch a container is:

 1. Call prepare() on each isolator
 2. Then fork the executor
 3. Then isolate(executor_pid) on each isolator

 The last part of (1) will also call Isolator::update() to set the initial
 memory limits (see line 288). This is done *before* the executor is in the
 cgroup, i.e., info-pid.isNone() will be true and that block of code should
 *always* be executed when a container starts. The LOG(INFO) line at 393
 should be present in your logs. Can you verify this? It should be shortly
 after the LOG(INFO) on line 358.

 Ian


 On Tue, Apr 28, 2015 at 9:54 AM, Dick Davies d...@hellooperator.net wrote:

 Thanks Ian.

 Digging around the cgroup there are 3 processes in there;

 * the mesos-executor
 * the shell script marathon starts the app with
 * the actual command to run the task ( a perl app in this case)

 The line of code you mention is never run in our case, because it's
 wrapped in the conditional
 I'm talking about!

 All I see is cpu.shares being set and then mem.soft_limit_in_bytes.


 On 28 April 2015 at 17:47, Ian Downes idow...@twitter.com wrote:
  The line of code you cite is so the hard limit is not decreased on a
  running
  container because we can't (easily) reclaim anonymous memory from
  running
  processes. See the comment above the code.
 
  The info-pid.isNone() is for when cgroup is being configured (see the
  update() call at the end of MemIsolatorProcess::prepare()), i.e., before
  any
  processes are added to the cgroup.
 
  The limit  currentLimit.get() ensures the limit is only increased.
 
  The memory limit defaults to the maximum for the data type, I guess
  that's
  the ridiculous 8 EB. It should be set to what the initial memory
  allocation
  was for the container so this is not expected. Can you look in the slave
  logs for when the container was created for the log line on:
 
  https://github.com/apache/mesos/blob/master/src/slave/containerizer/isolators/cgroups/mem.cpp#L393
 
  Ian
 
  On Tue, Apr 28, 2015 at 7:42 AM, Dick Davies d...@hellooperator.net
  wrote:
 
  Been banging my head against this  for a while now.
 
  mesos 0.21.0 , marathon 0.7.5, centos 6 servers.
 
  When I enable cgroups (flags are : --cgroups_limit_swap
  --isolation=cgroups/cpu,groups/mem ) the memory limits I'm setting
  are reflected in memory.soft_limit_in_bytes but not in
 
  memory.limit_in_bytes or memory.memsw.limit_in_bytes.
 
 
  Upshot is our runaway task eats all RAM and swap on the server
  until the OOM steps in and starts firing into the crowd.
 
  This line of code seems to never lower a hard limit:
 
 
 
  https://github.com/apache/mesos/blob/master/src/slave/containerizer/isolators/cgroups/mem.cpp#L382
 
  which means both of those tests must be true, right?
 
  the current limit is insanely high (8192 PB if i'm reading it right) -
  how
  would
  I make info-pid.isNone() be true ?
 
  Have tried restarting the slave, scaling the marathon apps to 0 tasks
  then back. Bit stumped.
 
 




Re: group memory limits are always 'soft' . how do I ensure info-pid.isNone() ?

2015-04-28 Thread Dick Davies
You may very well be right, but I'd like to keep this specific thread
focussed on figuring
out why the expected/implemented behaviour isn't happening in my case
if that's ok.

On 28 April 2015 at 19:26, CCAAT cc...@tampabay.rr.com wrote:

 I really hate to be the 'old fashion computer scientist' in this group,
 but, I think that the role of and usage of 'cgroups' is going to have to
 be expanded greatly as a solution to the dynamic memory management needs of
 both the cluster(s) and the frameworks. This problem is not going away and I
 see no other serious solution to cgroup use expansion.


Re: [RESULT][VOTE] Release Apache Mesos 0.22.0 (rc4)

2015-03-25 Thread Dick Davies
Thanks Craig, that's really handy!

Dumb question for the list: are there any plans to support multiple
isolation flags somehow?
I need cgroups, but would really like the disk quota feature too (and
network isolation come to that.
And a pony).

On 25 March 2015 at 01:00, craig w codecr...@gmail.com wrote:
 Congrats, I was working on a quick post summarizing what's new (based on
 jira and the video from niklas) which I just posted (great timing)

 http://craigwickesser.com/2015/03/mesos-022-release/

 On Tue, Mar 24, 2015 at 8:30 PM, Paul Otto p...@ottoops.com wrote:

 This is awesome! Thanks for all the hard work you all have put into this!
 I am really excited to update to the latest stable version of Apache Mesos!

 Regards,
 Paul


 Paul Otto
 Principal DevOps Architect, Co-founder
 Otto Ops LLC | OttoOps.com
 970.343.4561 office
 720.381.2383 cell

 On Tue, Mar 24, 2015 at 6:04 PM, Niklas Nielsen nik...@mesosphere.io
 wrote:

 Hi all,

 The vote for Mesos 0.22.0 (rc4) has passed with the
 following votes.

 +1 (Binding)
 --
 Ben Mahler
 Tim St Clair
 Adam Bordelon
 Brenden Matthews

 +1 (Non-binding)
 --
 Alex Rukletsov
 Craig W
 Ben Whitehead
 Elizabeth Lingg
 Dario Rexin
 Jeff Schroeder
 Michael Park
 Alexander Rojas
 Andrew Langhorn

 There were no 0 or -1 votes.

 Please find the release at:
 https://dist.apache.org/repos/dist/release/mesos/0.22.0

 It is recommended to use a mirror to download the release:
 http://www.apache.org/dyn/closer.cgi

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.22.0

 The mesos-0.22.0.jar has been released to:
 https://repository.apache.org

 The website (http://mesos.apache.org) will be updated shortly to reflect
 this release.

 Thanks,
 Niklas





 --

 https://github.com/mindscratch
 https://www.google.com/+CraigWickesser
 https://twitter.com/mind_scratch
 https://twitter.com/craig_links


Re: [RESULT][VOTE] Release Apache Mesos 0.22.0 (rc4)

2015-03-25 Thread Dick Davies
Ah ok - config page at
http://mesos.apache.org/documentation/latest/configuration/ gave
me the impression this was an either/or.

I'm happy now, thanks a lot!

On 25 March 2015 at 08:47, Tim Chen t...@mesosphere.io wrote:
 Hi there,

 You can already pass in multiple values seperated by comma
 (cgroups/cpu,cgroups/mem,posix/disk)

 Tim

 On Wed, Mar 25, 2015 at 12:46 AM, Dick Davies d...@hellooperator.net
 wrote:

 Thanks Craig, that's really handy!

 Dumb question for the list: are there any plans to support multiple
 isolation flags somehow?
 I need cgroups, but would really like the disk quota feature too (and
 network isolation come to that.
 And a pony).

 On 25 March 2015 at 01:00, craig w codecr...@gmail.com wrote:
  Congrats, I was working on a quick post summarizing what's new (based on
  jira and the video from niklas) which I just posted (great timing)
 
  http://craigwickesser.com/2015/03/mesos-022-release/
 
  On Tue, Mar 24, 2015 at 8:30 PM, Paul Otto p...@ottoops.com wrote:
 
  This is awesome! Thanks for all the hard work you all have put into
  this!
  I am really excited to update to the latest stable version of Apache
  Mesos!
 
  Regards,
  Paul
 
 
  Paul Otto
  Principal DevOps Architect, Co-founder
  Otto Ops LLC | OttoOps.com
  970.343.4561 office
  720.381.2383 cell
 
  On Tue, Mar 24, 2015 at 6:04 PM, Niklas Nielsen nik...@mesosphere.io
  wrote:
 
  Hi all,
 
  The vote for Mesos 0.22.0 (rc4) has passed with the
  following votes.
 
  +1 (Binding)
  --
  Ben Mahler
  Tim St Clair
  Adam Bordelon
  Brenden Matthews
 
  +1 (Non-binding)
  --
  Alex Rukletsov
  Craig W
  Ben Whitehead
  Elizabeth Lingg
  Dario Rexin
  Jeff Schroeder
  Michael Park
  Alexander Rojas
  Andrew Langhorn
 
  There were no 0 or -1 votes.
 
  Please find the release at:
  https://dist.apache.org/repos/dist/release/mesos/0.22.0
 
  It is recommended to use a mirror to download the release:
  http://www.apache.org/dyn/closer.cgi
 
  The CHANGELOG for the release is available at:
 
 
  https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.22.0
 
  The mesos-0.22.0.jar has been released to:
  https://repository.apache.org
 
  The website (http://mesos.apache.org) will be updated shortly to
  reflect
  this release.
 
  Thanks,
  Niklas
 
 
 
 
 
  --
 
  https://github.com/mindscratch
  https://www.google.com/+CraigWickesser
  https://twitter.com/mind_scratch
  https://twitter.com/craig_links




Re: mesos-collectd-plugin

2015-03-11 Thread Dick Davies
Hi Dan

I can see a couple of things that could be wrong
(NB: not a collectd expert, but these are differences I see from
my working config).

1. Is /opt/collectd/etc/collectd.conf your main collectd config file?

otherwise, it's not being read at all by collectd.

2. I configure the plugin in that file i.e. the

Module mesos-master

block should be in  /opt/collectd/etc/collectd.conf , not tucked down
in the python module path
directory.

3. Are you sure your master listens on localhost? Mine doesn't, I
needed to set that Host line
to match the IP I set that master to listen on ( e.g. in /etc/mesos-master/ip ).

Pretty sure one of those will do the trick
(NB: you'll only get metrics from the elected master; the 'standby'
masters still get polled
but collectd will ignore any data from them unless they're the primary)

On 11 March 2015 at 19:52, Dan Dong dongda...@gmail.com wrote:
 Hi, Dick,
   I put the plugin under:
 $ ls -l /opt/collectd/lib/collectd/plugins/python/
 total 504
 -rw-r--r-- 1 root root345 Mar 10 19:40 mesos-master.conf
 -rw-r--r-- 1 root root  1 Mar 10 15:06 mesos-master.py
 -rw-r--r-- 1 root root322 Mar 10 19:44 mesos-slave.conf
 -rw-r--r-- 1 root root   6808 Mar 10 15:06 mesos-slave.py
 -rw-r--r-- 1 root root 288892 Mar 10 19:35 python.a
 -rwxr-xr-x 1 root root969 Mar 10 19:35 python.la
 -rwxr-xr-x 1 root root 188262 Mar 10 19:35 python.so

 And in /opt/collectd/etc/collectd.conf, I set:

 LoadPlugin python
 Globals true
 /LoadPlugin
 .

 Plugin python
 ModulePath /opt/collectd/lib/collectd/plugins/python/
 LogTraces true
 /Plugin

 $ cat /opt/collectd/lib/collectd/plugins/python/mesos-master.conf
 LoadPlugin python
 Globals true
 /LoadPlugin

 Plugin python
 ModulePath /opt/collectd/lib/collectd/plugins/python/
 Import mesos-master
 Module mesos-master
 Host localhost
 Port 5050
 Verbose false
 Version 0.21.0
 /Module
 /Plugin

 Anything wrong with the above settings?

 Cheers,
 Dan



 2015-03-10 17:21 GMT-05:00 Dick Davies d...@hellooperator.net:

 Hi Dan

 The .py files (the plugin) live in the collectd python path,
 it sounds like maybe you're not loading the plugin .conf file into
 your collectd config?

 The output will depend on what your collectd is set to write to, I use
 it with write_graphite.

 On 10 March 2015 at 20:41, Dan Dong dongda...@gmail.com wrote:
  Hi, All,
Does anybody use this mesos-collectd-plugin:
  https://github.com/rayrod2030/collectd-mesos
 
  I have installed collectd and this plugin, then configured it as
  instructions and restarted the collectd daemon, why seems nothing
  happens on
  the mesos:5050 web UI( python plugin has been turned on in
  collectd.conf).
 
  My question is:
  1. Should I install collectd and this mesos-collectd-plugin on each
  master
  and slave nodes and restart collectd daemon? (This is what I have done.)
  2. Should the config file mesos-master.conf only configured on master
  node
  and
  mesos-slave.conf only configured on slave node?(This is what I have
  done.)
  Or both of them should only appear on master node?
  3. Is there an example( or a figure) of what output one is expected to
  see
  by this plugin?
 
  Cheers,
  Dan
 




Re: Question on Monitoring a Mesos Cluster

2015-03-07 Thread Dick Davies
Yeah, that confused me too - I think that figure is specific to the
master/slave polled
(and that'll just be the active one since you're only reporting when
master/elected
is true.

I'm using this one https://github.com/rayrod2030/collectd-mesos  , not
sure if that's
the same as yours?


On 7 March 2015 at 18:56, Jeff Schroeder jeffschroe...@computer.org wrote:
 Responses inline

 On Sat, Mar 7, 2015 at 12:48 PM, CCAAT cc...@tampabay.rr.com wrote:

 ... snip ...

 After getting everything working, I built a few dashboards, one of which
 displays these stats from http://master:5051/metrics/snapshot:

 master/disk_percent
 master/cpus_percent
 master/mem_percent

 I had assumed that this was something like aggregate cluster
 utilization, but this seems incorrect in practice. I have a small
 cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a
 dozen or so small tasks running, and launched 500 tasks with 1G of
 memory and 1 CPU each.

 Now I'd expect to se the disk/cpu/mem percentage metrics above go up
 considerably. I did notice that cpus_percent went to around 0.94.

 What is the correct way to measure overall cluster utilization for
 capacity planning? We can have the NOC watch this and simply add more
 hardware when the number starts getting low.


 Boy, I cannot wait to read the tidbits of wisdom here. Maybe the
 development group has more accurate information if not some vague roadmap on
 resource/process monitoring. Sooner or later, this is going to become a
 quintessential need; so I hope the deep thinkers are all over this need
 both in the user and dev groups.

 In fact the monitoring can easily create a significant loading on the
 cluster/cloud, if one is not judicious in how this is architect, implemented
 and dynamically tuned.




 Monitoring via passive metrics gathering and application telemetry is one
 of the best ways to do it. That is how I've implemented things



 The beauty of the rest api is that it isn't heavyweight, and every master
 has it on port 5050 (by default) and every slave has it on port 5051 (by
 default). Since I'm throwing this all into graphite (well technically
 cassandra fronted by cyanite fronted by graphite-api... but same
 difference), I found a reasonable way to do capacity planning. Collectd will
 poll the master/slave on each mesos host every 10 seconds (localhost:5050 on
 masters and localhost:5151 on slaves). This gets put into graphite via
 collectd's write_graphite plugin. These 3 graphite targets give me
 percentages of utilization for nice graphs:

 alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used,
 collectd.mesos.clustername.gauge-master_cpu_total), Total CPU Usage)
 alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used,
 collectd.mesos.clustername.gauge-master_mem_total), Total Memory Usage)
 alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used,
 collectd.mesos.clustername.gauge-master_disk_total), Total Disk Usage)

 With that data, you can have your monitoring tools such as nagios/icinga
 poll graphite. Using the native graphite render api, you can do things like:

 * if the cpu usage is over 80% for 24 hours, send a warning event
 * if the cpu usage is over 95% for 6 hours, send a critical event

 This allows mostly no-impact monitoring since the monitoring tools are
 hitting graphite.

 Anyways, back to the original questions:

 How does everyone do proper monitoring and capacity planning for large mesos
 clusters? I expect my cluster to grow beyond what it currently is by quite a
 bit.

 --
 Jeff Schroeder

 Don't drink and derive, alcohol and analysis don't mix.
 http://www.digitalprognosis.com


Re: Mesosphere on Centos 6.6

2015-02-05 Thread Dick Davies
This is due to the upstart scripts shipped with the RPM.
mesos has been shipping these since at least 0.17.x (as
that's when we started using it).

Where's the repo to send a PR to correct the docs?

On 5 February 2015 at 09:48, Chengwei Yang chengwei.yang...@gmail.com wrote:
 On Mon, Feb 02, 2015 at 04:58:43PM -0800, Viswanathan Ramachandran wrote:
 Hi,

 I followed instructions to setup multi-node mesos cluster on CentOS 6.6 using
 http://mesosphere.com/docs/getting-started/datacenter/install/

 I found that I was able to install zookeeper, mesos and marathon using yum
 without any issues. No errors during install.

 However, there was no service mesos-master or mesos-slave or marathon
 installed. Any restart command issued would result in unrecognized service.

 Try # start mesos-master/mesos-slave instead.

 --
 Thanks,
 Chengwei


 That said the binaries were all in tact.

 I used ubuntu-trusty VMs instead, and was able to install as per 
 instructions.

 Any updated instructions for CentOS 6 available?

 Thanks
 Vish



Re: Is mesos spamming me?

2015-02-01 Thread Dick Davies
The offer is only for 455 Mb of RAM. You can check that in the slave UI,
but it looks like you have other tasks running that are using some of that
1863Mb.

On 2 February 2015 at 05:11, Hepple, Robert rhep...@tnsi.com wrote:

 Yeah but ... the slave is reporting 1863Mb RAM and 2 CPUS - so how come
 that is rejected by jenkins which is asking for the default 0.1 cpu and
 512Mb RAM???


 Thanks


 Bob


Re: Slave cannot be registered while masters keep switching to another one.

2015-01-28 Thread Dick Davies
Be careful, there's now nothing stopping those 2 masters from forming
2 clusters.
Add a third asap.



On 28 January 2015 at 08:25, xiaokun xiaokun...@gmail.com wrote:
 hi, I changed the quorum to 1. Slave can be displayed now!

 Thanks!

 2015-01-28 16:19 GMT+08:00 xiaokun xiaokun...@gmail.com:

 Thanks for your reply. I will try to modify quorum to 1.
 Here is log from server side. Attachment is added.
 I0128 03:15:36.608562 15350 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:37.552141 15346 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:38.479542 15345 network.hpp:424] ZooKeeper group memberships
 changed
 I0128 03:15:38.479799 15345 group.cpp:659] Trying to get
 '/mesos/log_replicas/002270' in ZooKeeper
 I0128 03:15:38.480613 15345 group.cpp:659] Trying to get
 '/mesos/log_replicas/002271' in ZooKeeper
 I0128 03:15:38.481050 15345 group.cpp:659] Trying to get
 '/mesos/log_replicas/002272' in ZooKeeper
 I0128 03:15:38.481679 15345 network.hpp:466] ZooKeeper group PIDs: {
 log-replica(1)@10.27.17.135:5050, log-replica(1)@10.27.16.214:5050 }
 I0128 03:15:38.621351 15345 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:39.544558 15345 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:40.072347 15343 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:41.025926 15345 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:41.695303 15349 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:42.493906 15345 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:43.086762 15343 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:43.831442 15346 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:44.787384 15343 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:45.527914 15345 replica.cpp:638] Replica in VOTING status
 received a broadcasted recover request
 I0128 03:15:46.005728 15349 detector.cpp:138] Detected a new leader:
 (id='2272')
 I0128 03:15:46.005892 15349 group.cpp:659] Trying to get
 '/mesos/info_002272' in ZooKeeper
 I0128 03:15:46.006530 15349 detector.cpp:433] A new leading master
 (UPID=master@10.27.16.214:5050) is detected
 I0128 03:15:46.006624 15349 master.cpp:1263] The newly elected leader is
 master@10.27.16.214:5050 with id 20150128-031430-3591379722-5050-15326
 I0128 03:15:46.006664 15349 master.cpp:1276] Elected as the leading
 master!




Re: how to create rpm package

2015-01-26 Thread Dick Davies
Those RPMs are built for CentOS 6 i think.

For testing, you can get it to start up by just dropping in a symlink :

/lib64/libsasl2.so.2  - /lib64/libsasl2.so.3

On 26 January 2015 at 01:33, Yu Wenhua s...@yuwh.net wrote:
 [root@zone1_0 ~]# uname -a

 Linux zone1_0 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014
 x86_64 x86_64 x86_64 GNU/Linux



 I use CentOS7 and install the rpm from

https://mesosphere.com/2014/07/17/mesosphere-package-repositories/

 then got this error message:

 mesos-slave --master=192.168.3.5:5050

 mesos-slave: error while loading shared libraries: libsasl2.so.2: cannot
 open shared object file: No such file or directory



 [root@zone1_0 ~]# ls /usr/lib64/libsasl2.so* -l

 lrwxrwxrwx. 1 root root 17 Nov 28 04:31 /usr/lib64/libsasl2.so.3 -
 libsasl2.so.3.0.0

 -rwxr-xr-x. 1 root root 121296 Jun 10  2014 /usr/lib64/libsasl2.so.3.0.0

 [root@zone1_0 ~]#



 Maybe   I have to build a rpm from the src file. Right?





 From: Tim St Clair [mailto:tstcl...@redhat.com]
 Sent: 2015年1月23日 22:59
 To: user@mesos.apache.org
 Subject: Re: how to create rpm package



 What's your distro+version?



 Cheers,

 Tim



 

 From: Yu Wenhua s...@yuwh.net
 To: user@mesos.apache.org
 Sent: Friday, January 23, 2015 3:27:36 AM
 Subject: how to create rpm package



 Hi,

   Can anyone tell me how to build a mesos rpm package? So I can deploy it to
 slave node easily

 Thanks.



 Yu.





 --

 Cheers,
 Timothy St. Clair
 Red Hat Inc.


Re: cluster wide init

2015-01-23 Thread Dick Davies
On 23 January 2015 at 21:20, Sharma Podila spod...@netflix.com wrote:


 Here's one possible scenario:
 A DataCenter runs Databases, Webservers, MicroServices, Hadoop or other
 batch jobs, stream processing jobs, etc. There's 1000s, if not 100s, of
 systems available for all of this. Ideally, systems running Databases are
 configured to run different services in their init than one running batch
 jobs. However, because one would want to achieve elasticity of different
 services (#systems running DBs vs. batch, for example) within a single Mesos
 cluster, Mesos would have to determine what services run on the system at
 the current time. It's like a newly installed system comes up and connects
 into Mesos and says, hello there, I am an 8-core 48GB 1Gb eth system ready
 for service, what would you like me to do?. Mesos can choose to make it run
 any one or more of the services which would determine the set of init
 services to launch. And that may change over time as DC traffic changes.

What you're describing here is essentially the value proposition of
Mesos+marathon.

But there are still many services you need to provide in a datacenter
that aren't as
elastic as we'd like, and don't necessarily benefit from the
flexibility you're describing.

It's easy enough to lay those out with the same automation you'd use
to setup your
mesos processes under a more conventional init system.
Someone mentioned Ansible earlier, that's worked out really well in my
experience.

There's a simple Vagrant based playbook here if anyone's interested.

https://github.com/rasputnik/mesos-centos

The nice thing about Ansible is this scales up to hundreds of servers
easily, simply
by changing the inventory file.


Re: how to create rpm package

2015-01-23 Thread Dick Davies
There's an RPM repo, see documentation at:

   https://mesosphere.com/2014/07/17/mesosphere-package-repositories/

On 23 January 2015 at 09:27, Yu Wenhua s...@yuwh.net wrote:
 Hi,

   Can anyone tell me how to build a mesos rpm package? So I can deploy it to
 slave node easily

 Thanks.



 Yu.


Re: hadoop job stuck.

2015-01-16 Thread Dick Davies
To view the slaves logs, you need to be able to connect to that URL
from your browser, not the master
(the data is read directly from the slave by your browser, it doesn't
go via the master).


On 15 January 2015 at 21:42, Dan Dong dongda...@gmail.com wrote:
 Hi, All,
   Now sandbox could be viewed on mesos UI, I see the following info( The
 same error appears on every slave sandbox.):

 Failed to connect to slave '20150115-144719-3205108908-5050-4552-S0' on
 'centos-2.local:5051'.

 Potential reasons:

 The slave's hostname, 'centos-2.local', is not accessible from your network
 The slave's port, '5051', is not accessible from your network


 I checked that:
 slave centos-2.local can be login from any machine in the cluster without
 password by ssh centos-2.local ;
 port 5051 on slave centos-2.local could be connected from master by telnet
 centos-2.local 5051

 Confused what's the problem here?

 Cheers,
 Dan



 2015-01-14 15:33 GMT-06:00 Brenden Matthews brenden.matth...@airbnb.com:

 Would need the task logs from the slave which the TaskTracker was launched
 on, to debug this further.

 On Wed, Jan 14, 2015 at 1:28 PM, Dan Dong dongda...@gmail.com wrote:

 Checked /etc/hosts is correct, master and slave can ssh login each other
 by hostname without password, and hadoop runs well without mesos, but it
 stucks when running on mesos.

 Cheers,
 Dan

 2015-01-14 15:02 GMT-06:00 Brenden Matthews
 brenden.matth...@airbnb.com:

 At a first glance, it looks like `/etc/hosts` might be set incorrectly
 and it cannot resolve the hostname of the worker.

 See here for more: https://wiki.apache.org/hadoop/UnknownHost

 On Wed, Jan 14, 2015 at 12:32 PM, Vinod Kone vinodk...@apache.org
 wrote:

 What do the master logs say?

 On Wed, Jan 14, 2015 at 12:21 PM, Dan Dong dongda...@gmail.com wrote:

 Hi,
   When I run hadoop jobs on Mesos(0.21.0), the jobs are stuck for
 ever:
 15/01/14 13:59:30 INFO mapred.FileInputFormat: Total input paths to
 process : 8
 15/01/14 13:59:30 INFO mapred.JobClient: Running job:
 job_201501141358_0001
 15/01/14 13:59:31 INFO mapred.JobClient:  map 0% reduce 0%

 From jobtracker log I see:
 2015-01-14 13:59:35,542 INFO org.apache.hadoop.mapred.ResourcePolicy:
 Launching task Task_Tracker_0 on http://centos-2.local:31911 with 
 mapSlots=1
 reduceSlots=0
 2015-01-14 14:04:35,552 WARN org.apache.hadoop.mapred.MesosScheduler:
 Tracker http://centos-2.local:31911 failed to launch within 300 seconds,
 killing it

  I started manually namenode and jobtracker on master node and
 datanode on slave, but I could not see tasktracker started by mesos on
 slave. Note that if I ran hadoop directly without Mesos( of course the 
 conf
 files are different and tasktracker will be started manually on slave),
 everything works fine. Any hints?

 Cheers,
 Dan








Re: conf files location of mesos.

2015-01-07 Thread Dick Davies
Might be worth getting a packaged release for your OS, especially
if you're new to this.

On 7 January 2015 at 16:53, Dan Dong dongda...@gmail.com wrote:

 Hi, Brian,
   It's not there:
 ls /etc/default/mesos
 ls: cannot access /etc/default/mesos: No such file or directory

 I installed mesos from source tar ball by configure;make;make install as
 normal user.

 Cheers,
 Dan


 2015-01-07 10:43 GMT-06:00 Brian Devins brian.dev...@dealer.com:

  Try ls /etc/default/mesos instead

   From: Dan Dong dongda...@gmail.com
 Reply-To: user@mesos.apache.org user@mesos.apache.org
 Date: Wednesday, January 7, 2015 at 11:38 AM
 To: user@mesos.apache.org user@mesos.apache.org
 Subject: Re: conf files location of mesos.

Hi, All,
Thanks for your helps, I'm using version 0.21.0 of mesos. But I do not
 see any of the dirs of 'etc' or 'var' under my build directory(and any
 subdirs). What is the default conf files location for mesos 0.21.0?

 ls ~/mesos-0.21.0/build/
 3rdparty  bin  config.log  config.lt  config.status  ec2  include  lib
 libexec  libtool  Makefile  mesos.pc  mpi  sbin  share  src

Cheers,
Dan

 2015-01-07 9:47 GMT-06:00 Tomas Barton barton.to...@gmail.com:

 Hi Dan,

  this depends on your distribution. Mesosphere package comes with
 wrapper script which uses configuration
 placed in /etc/default/mesos and /etc/mesos-master, /etc/mesos-slave


 https://github.com/mesosphere/mesos-deb-packaging/blob/master/mesos-init-wrapper

  which distribution do you use?

  Tomas

 On 7 January 2015 at 16:23, Dan Dong dongda...@gmail.com wrote:

   Hi,
After installation of mesos on my cluster, where could I find the
 location of configuration files?
  E.g: mesos.conf, masters, slaves etc. I could not find any of them
 under the prefix dir and subdirs (configure
 --prefix=/home/dan/mesos-0.21.0/build/). Are there examples for the conf
 files? Thanks!

  Cheers,
  Dan





 Brian Devins* |* Java Developer
 brian.dev...@dealer.com

 [image: Dealer.com]





Re: Problems of running mesos-0.20.0 with zookeeper

2014-11-06 Thread Dick Davies
The quorum flag is for the number of mesos masters, not zookeepers.

if you only have one master, it's going to have trouble reaching a
quorum of 2 :)

either set --quorum=1 or spin up more masters.

On 6 November 2014 21:01, sujinzhao sujinz...@gmail.com wrote:
 Hi,all,

 I set up zookeeper service with three machines zoo1, zoo2, zoo3, and also 
 installed 1 mesos master and 2 slaves on another three nodes, I tried to run 
 master and slaves with:
 ./mesos-master.sh --ip=master-ip 
 --zk=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos --quorum=2

 ./mesos-slave.sh --ip=slave-ip 
 --master=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos

 I also created the /mesos znode before running the above commands, but I got 
 the following error:

 Recovering from registrar
 Recovering registrar
 Recovery failed: Failed to recover registrar: Failed to perform fetch within 
 1mins
 *** Check failure stack trace: ***
 @  0x7f3c1ea105cd google::LogMessage::Fail()
 ...

 after reading the master log, I found that before causing error, master has 
 already been elected successfully, but the leader failed in recovering from 
 registrar, so I guess this error has little relationship with zookeeper.

 after googleing I found that other people also encountered this problem, but 
 with no solution, I also exclude the possible reason of ssh between 
 master/slave and zookeeper servers with no password.

 So, could somebody be kindly to tell me how to solve this error? any 
 suggestions will be appreciated.

 THANKS.


Re: Problems of running mesos-0.20.0 with zookeeper

2014-11-06 Thread Dick Davies
Golden Rule : Don't use even numbers of members with quorum systems.

You need a quorum to function so with 2 masters and quorum=2, you can't
ever take a member down. With 2 masters and quorum=1, you're asking
for split brain.

(this is exactly the same with zookeeper by the way, it's also a quorum system)

If you have 1 master, quorum=1
if you have 3 masters, quorum=2
if you have 5 masters, quorum=3

and so on. Try that and see if it helps.


On 7 November 2014 09:42, sujinzhao sujinz...@gmail.com wrote:
 In fact, I also tried with launching 2 masters on two separate machines, at
 first, one of them was successfully elected as a leader, and both of them
 printed several lines of messages:

 Replica in EMPTY status received a broadcasted recover request
 Received a recover response from a replica in EMPTY status

 then the leader master aborted after outputing errors:

 Recovery failed: Failed to recover registrar: Failed to perform fetch within
 1mins
 *** Check failure stack trace: ***
 @ 0x7f3c1ea105cd google::LogMessage::Fail()
 ..

 and next, the second master became the new leader, it also tried to recovery
 from the registrar, but also failed and printed errors before aborted:

 Recovery failed: Failed to recover registrar: Failed to perform fetch within
 1mins
 *** Check failure stack trace: ***
 @ 0x7f3c1ea105cd google::LogMessage::Fail()
 ...

 So I guess that's not problems of zookeeper, it's the elected leader can not
 recover from registrar, could somebody be kind to illustrate some principles
 of mesos registry, or give me some suggestions?

 THANKS.

 david.j.palaitis david.j.palai...@gmail.com编写:


 With a single master,  you should not set quorum=2


  Original message 
 From: sujinzhao sujinz...@gmail.com
 Date:11/06/2014 4:01 PM (GMT-05:00)
 To: user@mesos.apache.org
 Cc:
 Subject: Problems of running mesos-0.20.0 with zookeeper

 Hi,all,

 I set up zookeeper service with three machines zoo1, zoo2, zoo3, and also
 installed 1 mesos master and 2 slaves on another three nodes, I tried to run
 master and slaves with:
 ./mesos-master.sh --ip=master-ip
 --zk=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos --quorum=2

 ./mesos-slave.sh --ip=slave-ip
 --master=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos

 I also created the /mesos znode before running the above commands, but I got
 the following error:

 Recovering from registrar
 Recovering registrar
 Recovery failed: Failed to recover registrar: Failed to perform fetch within
 1mins
 *** Check failure stack trace: ***
 @  0x7f3c1ea105cd google::LogMessage::Fail()
 ...

 after reading the master log, I found that before causing error, master has
 already been elected successfully, but the leader failed in recovering from
 registrar, so I guess this error has little relationship with zookeeper.

 after googleing I found that other people also encountered this problem, but
 with no solution, I also exclude the possible reason of ssh between
 master/slave and zookeeper servers with no password.

 So, could somebody be kindly to tell me how to solve this error? any
 suggestions will be appreciated.

 THANKS.


Re: Do i really need HDFS?

2014-10-22 Thread Dick Davies
Be interested to know what that is, if you don't mind sharing.

We're thinking of deploying a Ceph cluster for another project anyway,
it seems to remove some of the chokepoints/points of failure HDFS suffers from
but I've no idea how well it can interoperate with the usual HDFS clients
(Spark in my particular case but I'm trying to keep this general).

On 21 October 2014 13:16, David Greenberg dsg123456...@gmail.com wrote:
 We use spark without HDFS--in our case, we just use ansible to copy the
 spark executors onto all hosts at the same path. We also load and store our
 spark data from non-HDFS sources.

 On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net wrote:

 I think Spark needs a way to send jobs to/from the workers - the Spark
 distro itself
 will pull down the executor ok, but in my (very basic) tests I got
 stuck without HDFS.

 So basically it depends on the framework. I think in Sparks case they
 assume most
 users are migrating from an existing Hadoop deployment, so HDFS is
 sort of assumed.


 On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com wrote:
  On 10/20/14 11:46, Steven Schlansker wrote:
 
 
  We are running Mesos entirely without HDFS with no problems.  We use
  Docker to distribute our
  application to slave nodes, and keep no state on individual nodes.
 
 
 
  Background: I'm building up a 3 node cluster to run mesos and spark. No
  legacy Hadoop needed or wanted. I am using btrfs for the local file
  system,
  with (2) drives set up for raid1 on each system.
 
  So you  are suggesting that I can install mesos + spark + docker
  and not a DFS on these (3) machines?
 
 
  Will I need any other softwares? My application is a geophysical
  fluid simulator, so scala, R, and all sorts of advanced math will
  be required on the cluster for the Finite Element Methods.
 
 
  James
 
 




Re: Cassandra Mesos Framework Issue

2014-10-19 Thread Dick Davies
Issue seems to be with how the tasks are asking for port resources -
I'd guess whichever
tutorial you're using may be using an old/invalid syntax.

What tutorial are you working from?

On 18 October 2014 15:08, David Palaitis david.palai...@twosigma.com wrote:
 I am having trouble getting Cassandra Mesos to work in a simple test
 environment. The framework connects, but tasks get lost with the following
 error.



 215872 [Thread-113] INFO mesosphere.cassandra.CassandraScheduler  - Got new
 resource offers ArrayBuffer(abc.def.ghi.com)

 215875 [Thread-113] INFO mesosphere.cassandra.CassandraScheduler  -
 resources offered: List((cpus,32.0), (mem,127877.0), (disk,2167529.0),
 (ports,0.0))

 215875 [Thread-113] INFO mesosphere.cassandra.CassandraScheduler  -
 resources required: List((cpus,1.0), (mem,2048.0), (ports,0.0),
 (disk,1000.0))

 215877 [Thread-113] INFO mesosphere.cassandra.CassandraScheduler  - Accepted
 offer: abc.def.ghi.com

 215889 [Thread-114] INFO mesosphere.cassandra.CassandraScheduler  - Received
 status update for task task1413640484861: TASK_LOST (Task uses invalid
 resources: ports(*):0)



 I tried configuring a port resource in the slave and restarting but still
 get the same error e.g.



 ${INSTALL_DIR}/sbin/mesos-slave \

 --master=zk://abc.def.ghi.com:2181/mesos \

 --resources='mem:245760;ports(*):[31000-32000]'



 Any leads?








Re: Staging docker task KILLED after 1 minute

2014-10-16 Thread Dick Davies
One gotcha - the marathon timeout is in seconds, so pass '300' in your case.

let us know if it works, I spotted this the other day and anecdotally
it addresses
the issue for some users, be good to get more feedback.

On 16 October 2014 09:49, Grzegorz Graczyk gregor...@gmail.com wrote:
 Make sure you have --task_launch_timeout in marathon set to same value as
 executor_registration_timeout.
 https://github.com/mesosphere/marathon/blob/master/docs/docs/native-docker.md#configure-marathon

 On 16 October 2014 10:37, Nils De Moor nils.de.m...@gmail.com wrote:

 Hi,

 Environment:
 - Clean vagrant install, 1 master, 1 slave (same behaviour on production
 cluster with 3 masters, 6 slaves)
 - Mesos 0.20.1
 - Marathon 0.7.3
 - Docker 1.2.0

 Slave config:
 - containerizers: docker,mesos
 - executor_registration_timeout: 5mins

 When is start docker container tasks, they start being pulled from the
 HUB, but after 1 minute mesos kills them.
 In the background though the pull is still finishing and when everything
 is pulled in the docker container is started, without mesos knowing about
 it.
 When I start the same task in mesos again (after I know the pull of the
 image is done), they run normally.

 So this leaves slaves with 'dirty' docker containers, as mesos has no
 knowledge about them.

 From the logs I get this:
 ---
 I1009 15:30:02.990291  1414 slave.cpp:1002] Got assigned task
 test-app.23755452-4fc9-11e4-839b-080027c4337a for framework
 20140904-160348-185204746-5050-27588-
 I1009 15:30:02.990979  1414 slave.cpp:1112] Launching task
 test-app.23755452-4fc9-11e4-839b-080027c4337a for framework
 20140904-160348-185204746-5050-27588-
 I1009 15:30:02.993341  1414 slave.cpp:1222] Queuing task
 'test-app.23755452-4fc9-11e4-839b-080027c4337a' for executor
 test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
 '20140904-160348-185204746-5050-27588-
 I1009 15:30:02.995818  1409 docker.cpp:743] Starting container
 '25ac3310-71e4-4d10-8a4b-38add4537308' for task
 'test-app.23755452-4fc9-11e4-839b-080027c4337a' (and executor
 'test-app.23755452-4fc9-11e4-839b-080027c4337a') of framework
 '20140904-160348-185204746-5050-27588-'

 I1009 15:31:07.033287  1413 slave.cpp:1278] Asked to kill task
 test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
 20140904-160348-185204746-5050-27588-
 I1009 15:31:07.034742  1413 slave.cpp:2088] Handling status update
 TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for task
 test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
 20140904-160348-185204746-5050-27588- from @0.0.0.0:0
 W1009 15:31:07.034881  1413 slave.cpp:1354] Killing the unregistered
 executor 'test-app.23755452-4fc9-11e4-839b-080027c4337a' of framework
 20140904-160348-185204746-5050-27588- because it has no tasks
 E1009 15:31:07.034945  1413 slave.cpp:2205] Failed to update resources for
 container 25ac3310-71e4-4d10-8a4b-38add4537308 of executor
 test-app.23755452-4fc9-11e4-839b-080027c4337a running task
 test-app.23755452-4fc9-11e4-839b-080027c4337a on status update for terminal
 task, destroying container: No container found
 I1009 15:31:07.035133  1413 status_update_manager.cpp:320] Received status
 update TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for task
 test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
 20140904-160348-185204746-5050-27588-
 I1009 15:31:07.035210  1413 status_update_manager.cpp:373] Forwarding
 status update TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for
 task test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
 20140904-160348-185204746-5050-27588- to master@10.0.10.11:5050
 I1009 15:31:07.046167  1408 status_update_manager.cpp:398] Received status
 update acknowledgement (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for task
 test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
 20140904-160348-185204746-5050-27588-

 I1009 15:35:02.993736  1414 slave.cpp:3010] Terminating executor
 test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
 20140904-160348-185204746-5050-27588- because it did not register within
 5mins
 ---

 I already posted my question on the marathon board, as I first thought it
 was an issue on marathon's end:
 https://groups.google.com/forum/#!topic/marathon-framework/NT7_YIZnNoY


 Kind regards,
 Nils




Re: Multiple disks with Mesos

2014-10-08 Thread Dick Davies
To answer point 2) - yes, your executors will create their 'sandboxes'
under work_dir.


On 8 October 2014 00:13, Arunabha Ghosh arunabha...@gmail.com wrote:
 Thanks Steven !

 On Tue, Oct 7, 2014 at 4:08 PM, Steven Schlansker
 sschlans...@opentable.com wrote:


 On Oct 7, 2014, at 4:06 PM, Arunabha Ghosh arunabha...@gmail.com wrote:

  Hi,
   I would like to run Mesos slaves on machines that have multiple
  disks. According to the Mesos configuration page I can specify a work_dir
  argument to the slaves.
 
  1) Can the work_dir argument contain multiple directories ?
 
  2) Is the work_dir where Mesos will place all of its data ? So If I
  started a task on Mesos, would the slave place the task's data (stderr,
  stdout, task created directories) inside work_dir ?

 We stitch our disks together before Mesos gets its hands on it using a
 technology such as LVM or btrfs, so that the work_dir is spread across the
 multiple disks transparently.




Re: Orphaned Docker containers in Mesos 0.20.1

2014-10-02 Thread Dick Davies
One thing to check - have you upped

--executor_registration_timeout

from the default of 1min? a docker pull can easily take longer than that.

On 2 October 2014 22:18, Michael Babineau michael.babin...@gmail.com wrote:
 I'm seeing an issue where tasks are being marked as killed but remain
 running. The tasks all run via the native Docker containerizer and are
 started from Marathon.

 The net result is additional, orphaned Docker containers that must be
 stopped/removed manually.

 Versions:
 - Mesos 0.20.1
 - Marathon 0.7.1
 - Docker 1.2.0
 - Ubuntu 14.04

 Environment:
 - 3 ZK nodes, 3 Mesos Masters, and 3 Mesos Slaves (all separate instances)
 on EC2

 Here's the task in the Mesos UI:

 (note that stderr continues to update with the latest container output)

 Here's the still-running Docker container:
 $ docker ps|grep 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f
 3d451b8213ea
 docker.thefactory.com/ace-serialization:f7aa1d4f46f72d52f5a20ef7ae8680e4acf88bc0
 \/bin/sh -c 'java26 minutes ago  Up 26 minutes   9990/tcp
 mesos-1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f

 Here are the Mesos logs associated with the task:
 $ grep eda431d7-4a74-11e4-a320-56847afe9799 /var/log/mesos/mesos-slave.INFO
 I1002 20:44:39.176024  1528 slave.cpp:1002] Got assigned task
 serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
 20140919-224934-1593967114-5050-1518-
 I1002 20:44:39.176257  1528 slave.cpp:1112] Launching task
 serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
 20140919-224934-1593967114-5050-1518-
 I1002 20:44:39.177287  1528 slave.cpp:1222] Queuing task
 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' for executor
 serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
 '20140919-224934-1593967114-5050-1518-
 I1002 20:44:39.191769  1528 docker.cpp:743] Starting container
 '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' for task
 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' (and executor
 'serialization.eda431d7-4a74-11e4-a320-56847afe9799') of framework
 '20140919-224934-1593967114-5050-1518-'
 I1002 20:44:43.707033  1521 slave.cpp:1278] Asked to kill task
 serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
 20140919-224934-1593967114-5050-1518-
 I1002 20:44:43.707811  1521 slave.cpp:2088] Handling status update
 TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
 serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
 20140919-224934-1593967114-5050-1518- from @0.0.0.0:0
 W1002 20:44:43.708273  1521 slave.cpp:1354] Killing the unregistered
 executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
 20140919-224934-1593967114-5050-1518- because it has no tasks
 E1002 20:44:43.708375  1521 slave.cpp:2205] Failed to update resources for
 container 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f of executor
 serialization.eda431d7-4a74-11e4-a320-56847afe9799 running task
 serialization.eda431d7-4a74-11e4-a320-56847afe9799 on status update for
 terminal task, destroying container: No container found
 I1002 20:44:43.708524  1521 status_update_manager.cpp:320] Received status
 update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
 serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
 20140919-224934-1593967114-5050-1518-
 I1002 20:44:43.708709  1521 status_update_manager.cpp:373] Forwarding status
 update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
 serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
 20140919-224934-1593967114-5050-1518- to master@10.2.0.182:5050
 I1002 20:44:43.728991  1526 status_update_manager.cpp:398] Received status
 update acknowledgement (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
 serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
 20140919-224934-1593967114-5050-1518-
 I1002 20:47:05.904324  1527 slave.cpp:2538] Monitoring executor
 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
 '20140919-224934-1593967114-5050-1518-' in container
 '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f'
 I1002 20:47:06.311027  1525 slave.cpp:1733] Got registration for executor
 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
 20140919-224934-1593967114-5050-1518- from executor(1)@10.2.1.34:29920

 I'll typically see a barrage of these in association with a Marathon app
 update (which deploys new tasks). Eventually, one container sticks and we
 get a RUNNING task instead of a KILLED one.

 Where else can I look?


Re: Build on Amazon Linux

2014-09-26 Thread Dick Davies
What version of docker does that give you, out of interest?


mainline EL7 is still shipping a pre-1.0 that won't work with mesos
(although since docker is just a static Go binary, it's trivial to overwrite
/usr/bin/docker and get everything to work).


On 25 September 2014 20:23, John Mickey j...@pithoslabs.com wrote:
 Thanks to all for the help

 Tim - thanks for pointing out the obvious
 CCAAT - Great article

 Here are the instructions for getting Mesos to run on Amazon Linux
 amzn-ami-hvm-2014.03.1.x86_64-ebs (ami-383a5008) (us-west-2)

 On a single instance, as root, proof of concept setup

 Install Docker
 yum -y docker
 service docker start

 Install Tools to build Mesos (From Apache Mesos documentation)
 yum -y groupinstall Development Tools
 yum -y install python-devel java-1.7.0-openjdk-devel zlib-devel
 libcurl-devel openssl-devel cyrus-sasl-devel cyrus-sasl-md5
 wget 
 http://mirror.nexcess.net/apache/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz
 tar -zxf apache-maven-3.0.5-bin.tar.gz -C /opt/
 ln -s /opt/apache-maven-3.0.5/bin/mvn /usr/bin/mvn

 Install Oracle Java (Amazon Linux ships with OpenJDK)
 wget --no-check-certificate --no-cookies --header Cookie:
 oraclelicense=accept-securebackup-cookie
 http://download.oracle.com/otn-pub/java/jdk/7u67-b01/jdk-7u67-linux-x64.rpm
 rpm -i jdk-7u67-linux-x64.rpm
 export JAVA_HOME=/usr/java/jdk1.7.0_67
 export PATH=$PATH:/usr/java/jdk1.7.0_67/bin
 alternatives --install /usr/bin/java java /usr/java/jdk1.7.0_67/bin/java 2
 alternatives --config java
 java -version

 Build Mesos
 wget http://mirror.olnevhost.net/pub/apache/mesos/0.20.1/mesos-0.20.1.tar.gz
 tar -zxf mesos-0.20.1.tar.gz
 cd mesos
 mkdir build
 cd build
 ../configure
 make
 make check (This will fail on a cgroups issues, see earlier in this thread)
 make install

 Run Mesos Master and Slave
 /usr/local/sbin/mesos-master --work_dir=/tmp/mesos
 --zk=zk://localhost:2181/mesos --quorum=1 --ip=1.2.3.4
 /usr/local/sbin/mesos-slave --master=zk://localhost:2181/mesos
 --containerizers=docker,mesos

 On Thu, Sep 25, 2014 at 1:56 PM, Tim St Clair tstcl...@redhat.com wrote:
 It looks like docker-daemon isn't running.

 Cheers,
 Tim

 - Original Message -
 From: John Mickey j...@pithoslabs.com
 To: user@mesos.apache.org
 Sent: Thursday, September 25, 2014 10:33:42 AM
 Subject: Re: Build on Amazon Linux

 I tried the --help options before replying in my previous post, but
 did not do a good job of explaining what I was seeing

 --isolation=VALUE  Isolation mechanisms to use, e.g.,
 'posix/cpu,posix/mem', or

 'cgroups/cpu,cgroups/mem', or network/port_mapping
  (configure with flag:
 --with-network-isolator to enable),
  or 'external'. (default:
 posix/cpu,posix/mem)

 If I run this (Master is running)
 $ /usr/local/sbin/mesos-slave --master=zk://localhost:2181/mesos
 --containerizers=docker,mesos --isolation=posix/cpu,posix/mem

 Slave will not start with this message
 $ I0925 15:26:19.118268 18604 main.cpp:128] Version: 0.20.0 Failed to
 create a containerizer: Could not create DockerContainerizer: Failed
 to find a mounted cgroups hierarchy for the 'cpu' subsystem; you
 probably need to mount cgroups manually!

 The default is posix/cpu,posix/mem

 Any ideas why it is still trying to use cgroups?

 Once I get this working, I will post the steps for Amazon Linux.
 Thank you again for the help.


 On Wed, Sep 24, 2014 at 4:31 PM, Tim St Clair tstcl...@redhat.com wrote:
 
  $ mesos-slave --isolation='posix/cpu,posix/mem' ...
 
  for ref:
 
  $ mesos-slave --help
 
  ...
 
  --isolation=VALUE  Isolation mechanisms to use,
  e.g., 'posix/cpu,posix/mem', or
   'cgroups/cpu,cgroups/mem', or
   network/port_mapping
   (configure with flag:
   --with-network-isolator to
   enable),
  ...
 
  Cheers,
  Tim
 
  - Original Message -
  From: John Mickey j...@pithoslabs.com
  To: user@mesos.apache.org
  Sent: Wednesday, September 24, 2014 4:03:37 PM
  Subject: Re: Build on Amazon Linux
 
  Thank you again for the responses.  What is the option to remove
  cgroups isolation from the slave start?
 
  I ran /usr/local/sbin/mesos-slave --help and do not see an option to
  remove cgroups isolation from the slave start
 
  On Wed, Sep 24, 2014 at 1:48 PM, Tim St Clair tstcl...@redhat.com 
  wrote:
   You likely have a systemd problem, and you can edit your slave startup
   to
   remove cgroups isolation until 0.21.0 is released.
  
   # systemd cgroup integration, *only* enable on master/0.21.0  
   #export MESOS_isolation='cgroups/cpu,cgroups/mem'
   #export MESOS_cgroups_root='system.slice/mesos-slave.service'
   #export 

Re: Running mesos-slave in Docker container

2014-09-23 Thread Dick Davies
The master is advertising itself as being on 127.0.0.1  - try running
it with an --ip flag.


On 23 September 2014 11:10, Grzegorz Graczyk gregor...@gmail.com wrote:
 Thanks for your response!

 Mounting /sys did the job, cgroups are working, but now mesos-slave is just
 crushing after detecting new master or so (there's nothing useful in the
 logs - is there a way to make them more verbose?)

 Last lines of logs from mesos-slave:
 I0923 10:03:24.07985910 detector.cpp:426] A new leading master
 (UPID=master@127.0.0.1:5050) is detected
 I0923 10:03:26.076053 9 slave.cpp:3195] Finished recovery
 I0923 10:03:26.076505 9 slave.cpp:589] New master detected at
 master@127.0.0.1:5050
 I0923 10:03:26.076732 9 slave.cpp:625] No credentials provided.
 Attempting to register without authentication
 I0923 10:03:26.076812 9 slave.cpp:636] Detecting new master
 I0923 10:03:26.076864 9 status_update_manager.cpp:167] New master
 detected at master@127.0.0.1:5050

 There's no problem in running mesos-master in the container(at least there
 wasn't any in my case, for now)




 On 23 September 2014 09:41, Tim Chen t...@mesosphere.io wrote:

 Hi Grzegorz,

 To run Mesos master|slave in a docker container is not straight forward
 because we utilize kernel features therefore you need to explicitly test out
 the features you like to use with Mesos with slave/master in Docker.

 Gabriel during the Mesosphere hackathon has got master and slave running
 in docker containers, and he can probably share his Dockerfile and run
 command.

 I believe one work around to get cgroups working with Docker run is to
 mount /sys into the container (mount -v /sys:/sys).

 Gabriel do you still have the command you used to run slave/master with
 Docker?

 Tim



 On Tue, Sep 23, 2014 at 12:24 AM, Grzegorz Graczyk gregor...@gmail.com
 wrote:

 I'm trying to run mesos-slave inside Docker container, but it can't start
 due to problem with mounting cgroups.

 I'm using:
 Kernel Version: 3.13.0-32-generic
 Operating System: Ubuntu 14.04.1 LTS
 Docker: 1.2.0(commit fa7b24f)
 Mesos: 0.20.0

 Following error appears:
 I0923 07:11:20.92147519 main.cpp:126] Build: 2014-08-22 05:04:26 by
 root
 I0923 07:11:20.92160819 main.cpp:128] Version: 0.20.0
 I0923 07:11:20.92162019 main.cpp:131] Git tag: 0.20.0
 I0923 07:11:20.92162819 main.cpp:135] Git SHA:
 f421ffdf8d32a8834b3a6ee483b5b59f65956497
 Failed to create a containerizer: Could not create DockerContainerizer:
 Failed to find a mounted cgroups hierarchy for the 'cpu' subsystem; you
 probably need to mount cgroups manually!

 I'm running docker container with command:
 docker run --name mesos-slave --privileged --net=host -v
 /var/run/docker.sock:/var/run/docker.sock -v /var/lib/docker:/var/lib/docker
 -v /usr/local/bin/docker:/usr/local/bin/docker gregory90/mesos-slave
 --containerizers=docker,mesos --master=zk://localhost:2181/mesos
 --ip=127.0.0.1

 Everything is running on single machine.
 Everything works as expected when mesos-slave is run outside docker
 container.

 I'd appreciate some help.





Re: [VOTE] Release Apache Mesos 0.20.1 (rc2)

2014-09-18 Thread Dick Davies
Don't suppose there's any chance of a fix for

https://issues.apache.org/jira/browse/MESOS-1195

is there?

(I'll settle for a workaround to get mesos running on EL7 somehow, mind)


On 18 September 2014 18:18, Adam Bordelon a...@mesosphere.io wrote:
 Great. I'll roll that into an rc3 today. Any other patch requests for rc3?

 On Thu, Sep 18, 2014 at 2:36 AM, Benjamin Hindman
 benjamin.hind...@gmail.com wrote:

 I've committed Tim's fix, we can cut another release candidate and restart
 the vote.

 On Wed, Sep 17, 2014 at 11:07 PM, Tim Chen t...@mesosphere.io wrote:

  -1
 
  The docker test failed when I removed the image, and found a problem
  from
  the docker pull implementation.
  I've created a reviewboard for a fix: https://reviews.apache.org/r/25758
 
  Will like to get this fixed before releasing it.
 
  Tim
 
  On Wed, Sep 17, 2014 at 9:10 PM, Vinod Kone vinodk...@gmail.com wrote:
 
  +1 (binding)
 
  make check passes on CentOS 5.5 w/ gcc 4.8.2.
 
 
 
  On Wed, Sep 17, 2014 at 7:42 PM, Adam Bordelon a...@mesosphere.io
  wrote:
 
  Update: The vote is open until Mon Sep 22 10:00:00 PDT 2014 and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  On Wed, Sep 17, 2014 at 6:27 PM, Adam Bordelon a...@mesosphere.io
  wrote:
 
  Hi all,
 
  Please vote on releasing the following candidate as Apache Mesos
  0.20.1.
 
 
  0.20.1 includes the following:
 
 
  
  Minor bug fixes for docker integration, network isolation, etc.
 
  The CHANGELOG for the release is available at:
 
 
  https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.20.1-rc2
 
 
  
 
  The candidate for Mesos 0.20.1 release is available at:
 
 
  https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc2/mesos-0.20.1.tar.gz
 
  The tag to be voted on is 0.20.1-rc2:
 
 
  https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.20.1-rc2
 
  The MD5 checksum of the tarball can be found at:
 
 
  https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc2/mesos-0.20.1.tar.gz.md5
 
  The signature of the tarball can be found at:
 
 
  https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc2/mesos-0.20.1.tar.gz.asc
 
  The PGP key used to sign the release is here:
  https://dist.apache.org/repos/dist/release/mesos/KEYS
 
  The JAR is up in Maven in a staging repository here:
 
  https://repository.apache.org/content/repositories/orgapachemesos-1034
 
  Please vote on releasing this package as Apache Mesos 0.20.1!
 
  The vote is open until  and passes if a majority of at least 3 +1 PMC
  votes are cast.
 
  [ ] +1 Release this package as Apache Mesos 0.20.1
  [ ] -1 Do not release this package because ...
 
  Thanks,
  -Adam-
 
 
 
 
 




Re: Sandbox Log Links

2014-09-04 Thread Dick Davies
I don't think that's the issue - i have a custom work_dir too and can
see the logs fine.

Don't they still get served up from the slaves themselves (port 5051)?
Maybe you've got
a firewall blocking that from where you're viewing the mesos ui?

On 4 September 2014 23:58, John Omernik j...@omernik.com wrote:
 Thanks Tim. Some testing showed that when I moved to 0.20, I setup the
 slaves to use a specific log directory rather than just default to /tmp.
 Basically, if you specify a customer work_dir for the slave, the master
 doesn't know (I am guessing?) where to find to logs? This seems like
 something that should work (if you change the work_dir, it should update the
 master with where to look for logs in the gui).  Thoughts?


 On Thu, Sep 4, 2014 at 5:34 PM, Tim Chen t...@mesosphere.io wrote:

 Hi John,

 Take a look at the slave log and see if your task failed, what was the
 failure message that was part of your task failure.

 Tim


 On Thu, Sep 4, 2014 at 3:24 PM, John Omernik j...@omernik.com wrote:

 Hey all, I upgraded to 0.20 and when I click on sandbox, the link is
 good, but there are not futher links for logs (i.e. standard err, out etc)
 like there was in 0.19. I have changed my log location, but it should still
 work... Curious on what I can look at to troubleshoot.

 Thanks!

 John





Re: Mesos 0.19 registrar upgrade

2014-07-22 Thread Dick Davies
On 22 July 2014 10:40, Tomas Barton barton.to...@gmail.com wrote:

 I have 4 Mesos masters, which would mean that quorum  2 - quorum=3, right?

Yes, that's right. 2 won't be enough.


 quorum=1, mesos-masters=1
 quorum=2, mesos-masters=3
 quorum=3, mesos-masters=5
 quorum=4, mesos-masters=7

 Is is possible to have non-even number of Mesos masters? or is it just a bad
 idea?

Yes, it's a bad idea since this change - it's always been a bad idea
to run an even
number of zookeepers and now that extends to the mesos masters.

4 masters gives you no extra redundancy over 3, and your likelihood of node loss
increases slightly (as you now have an extra server to potentially break).


Re: how to update master cluster

2014-07-16 Thread Dick Davies
For provisioning yes , for ad-hoc maintenance tasks won't help at all.

On 16 July 2014 11:29, Nayeem Syed nay...@cronycle.com wrote:
 Thanks for those! I will give it a try to get some deployment through
 ansible.

 I was also wondering if Cloudformation might be good for this? As it clears
 up the things very cleanly when you remove the formation? Though I find
 their JSON file very difficult to navigate and their Update Feature doesnt
 seem to work too well..


 On Wed, Jul 16, 2014 at 10:46 AM, Dick Davies d...@hellooperator.net
 wrote:

 I'd like to show you my playbooks, but unfortunately they're for a client
 - I can vouch for it being very easy to add nodes to a cluster etc. if you
 just have to edit an 'inventory' file and add IPs into the correct groups.

 (NB: puppet and chef will automate your infrastructure too, it's just
 they're
 not as useful for things like rolling deployments in my experience because
 they're agent based, so it's harder to control when each server will
 update and
 restart services).

 A quick Google found:


 http://blog.michaelhamrah.com/2014/06/setting-up-a-multi-node-mesos-cluster-running-docker-haproxy-and-marathon-with-ansible/

 which might be useful.

 The play books linked from that post are for bootstrapping a cluster, but
 it's
 pretty simple to add a second playbook to manage rolling deploys etc.
 There's some Ansible examples of rolling deploys (not Mesos specific)
 at :

 http://docs.ansible.com/guide_rolling_upgrade.html


 On 15 July 2014 14:41, Nayeem Syed nay...@cronycle.com wrote:
  thanks!
 
  do you have some examples of how you are using it with ansible? i dont
  have
  specific preferences, whatever works really.
 
 
  On Tue, Jul 15, 2014 at 2:35 PM, Dick Davies d...@hellooperator.net
  wrote:
 
  You want a rolling restart i'd guess, unless you want downtime for some
  reason.
 
  We use Ansible, it's pretty nice.
 
  On 15 July 2014 10:47, Nayeem Syed nay...@cronycle.com wrote:
   whats the best way to update mesos master instances. eg I want to
   update
   things in there, install new frameworks, but at the moment I am
   ssh'ing
   to
   the instances and installing them one by one. that feels wrong,
   shouldnt
   it
   be done in parallel to all the instances?
  
   what do people currently do to keep all the masters in sync?
 
 




Re: mesos isolation

2014-07-11 Thread Dick Davies
Are you using cgroups, or the default (posix) isolation?



On 11 July 2014 17:06, Asim linka...@gmail.com wrote:
 Hi,

 I am running a job on few machines in my Linux cluster. Each machine is an
 Intel 8 core (with 32 threads). I see a total of 32 CPUs in /etc/cpuinfo and
 within mesos web interface. When I launch a job using mesos, I see that all
 CPUs are used equally and not just the number of CPUs I specify for that
 task.

 Furthermore, I also see that the average per task running time within a
 single machine, with 5 tasks/machine is 1/2 as much as that with 10
 tasks/machine. Within mesos, each task has 1 CPU assigned and it is
 completely CPU bound (no dataset, no file access). As per mesos, the 5 tasks
 job uses 5 CPUs while the 10 task job uses 10 CPUs (so average task run
 times should be same unlike what I am seeing). Also, when I monitor CPU
 utilization, I see that all CPUs are used equally.  I am really confused. Is
 this how mesos/container isolation is supposed to work?

 Thanks,
 Asim



number of masters and quorum

2014-07-01 Thread Dick Davies
I might be wrong but doesn't the new quorum setting mean
it only makes sense to run an odd number of masters
(a la zookeepers)?

i.e. 4 masters is no more resilient than 3 (in fact less so, since
you increase your chance of a node failure as number of nodes
increases).


Re: Docker support in Mesos core

2014-06-21 Thread Dick Davies
That's fantastic news, really good to see some integration happening
between chocolate and peanut butter
here. Deimos has been pretty difficult for us to deploy on our
platforms (largely down to the python implementation,
which has problems on the ancient python EL6 ships with).



On 20 June 2014 23:40, Tobias Knaup t...@knaup.me wrote:
 Hi all,

 We've got a lot of feedback from folks who use Mesos to run Dockers at scale
 via Deimos, and the main wish was to make Docker a first class citizen in
 Mesos, instead of a plugin that needs to be installed separately. Mesosphere
 wants to contribute this and I already chatted with Ben H about what an
 implementation could look like.

 I'd love for folks on here that are working with Docker to chime in!
 I created a JIRA here: https://issues.apache.org/jira/browse/MESOS-1524

 Cheers,

 Tobi


Re: Failed to perform recovery: Incompatible slave info detected

2014-06-19 Thread Dick Davies
Fair enough, appreciate the explanation (and that you've clearly
thought hard about this in the design).

The cluster I hit this on was in the process of being built and had no
tasks deployed, it just violated
my Principle of Least Astonishment that dropping some more cores into
the slaves seemed to kill them
off.

I can see there must be cases where this design choice is the right
thing to do, now we know we can
work around it easily enough - so thanks for the lesson :)

On 19 June 2014 18:43, Vinod Kone vinodk...@gmail.com wrote:
 Yes. The idea behind storing the whole slave info is to provide safety.

 Imagine, the slave resources were reduced on a restart. What does this mean
 for already running tasks that are using more resources than the newly
 configured resources? Should the slave kill them? If yes, which ones?
 Similarly what happens when the slave attributes are changed (e.g., secure
 to unsecure)? Is it safe to keep running the existing tasks?

 As you can see, reconciliation of slave info is a complex problem. While
 there are some smarts we can add to the slave (e.g., increase of resources
 is OK while decrease is not) we haven't really seen a need for it yet.


 On Thu, Jun 19, 2014 at 3:03 AM, Dick Davies d...@hellooperator.net wrote:

 Fab, thanks Vinod. Turns out that feature (different FQDN to serve the ui
 up on)
 might well be really useful for us, so every cloud has a silver lining :)

 back to the metadata feature though - do you know why just the 'id' of
 the slaves isn't used?
 As it stands adding disk storage, cores or RAM to a slave will cause
 it to drop out of cluster -
 does checking the whole metadata provide any benefit vs. checking the id?

 On 18 June 2014 19:46, Vinod Kone vinodk...@gmail.com wrote:
  Filed https://issues.apache.org/jira/browse/MESOS-1506 for fixing
  flags/documentation.
 
 
  On Wed, Jun 18, 2014 at 11:33 AM, Dick Davies d...@hellooperator.net
  wrote:
 
  Thanks, it might be worth correcting the docs in that case then.
  This URL says it'll use the system hostname, not the reverse DNS of
  the ip argument:
 
  http://mesos.apache.org/documentation/latest/configuration/
 
  re: the CFS thing - this was while running Docker on the slaves - that
  also uses cgroups
  so maybe resources were getting split with mesos or something? (I'm
  still reading up on
  cgroups) - definitely wasn't the case until cfs was enabled.
 
 
  On 18 June 2014 18:34, Vinod Kone vinodk...@gmail.com wrote:
   Hey Dick,
  
   Regarding slave recovery, any changes in the SlaveInfo (see
   mesos.proto)
   are
   considered as a new slave and hence recovery doesn't proceed forward.
   This
   is because Master caches SlaveInfo and it is quite complex to
   reconcile
   the
   differences in SlaveInfo. So we decided to fail on any SlaveInfo
   changes
   for
   now.
  
   In your particular case,
   https://issues.apache.org/jira/browse/MESOS-672
   was
   committed in 0.18.0 which fixed redirection
of WebUI. Included in this fix is
   https://reviews.apache.org/r/17573/
   which
   changed how SlaveInfo.hostname is calculated. Since you are not
   providing a
   hostname via --hostname flag, slave now deduces the hostname from
   --ip
   flag. Looks like in your cluster the hostname corresponding to that
   ip
   is
   different than what 'os::hostname()' gives.
  
   Couple of options to move forward. If you want slave recovery,
   provide
   --hostname that matches the previous hostname. If you don't care
   above
   recovery, just remove the meta directory (rm -rf /var/mesos/meta)
   so
   that
   the slave starts as a fresh one (since you are not using cgroups, you
   will
   have to manually kill any old executors/tasks that are still alive on
   the
   slave).
  
   Not sure about your comment on CFS. Enabling CFS shouldn't change how
   much
   memory the slave sees as available. More details/logs would help
   diagnose
   the issue.
  
   HTH,
  
  
  
   On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies d...@hellooperator.net
   wrote:
  
   Should have said, the CLI for this is :
  
   /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos
   --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos
  
   (note IP is specified, hostname is not - docs indicated hostname arg
   will default to the fqdn of host, but it appears to be using the
   value
   passed as 'ip' instead.)
  
   On 18 June 2014 12:00, Dick Davies d...@hellooperator.net wrote:
Hi, we recently bumped 0.17.0 - 0.18.2 and the slaves
now show their IPs rather than their FQDNs on the mesos UI.
   
This broke slave recovery with the error:
   
Failed to perform recovery: Incompatible slave info detected
   
   
cpu, mem, disk, ports are all the same. so is the 'id' field.
   
the only thing that's changed is are the 'hostname' and
webui_hostname
arguments
(the CLI we're passing in is exactly the same as it was on 0.17.0,
so
presumably this is down

Re: Failed to perform recovery: Incompatible slave info detected

2014-06-18 Thread Dick Davies
Thanks, it might be worth correcting the docs in that case then.
This URL says it'll use the system hostname, not the reverse DNS of
the ip argument:

http://mesos.apache.org/documentation/latest/configuration/

re: the CFS thing - this was while running Docker on the slaves - that
also uses cgroups
so maybe resources were getting split with mesos or something? (I'm
still reading up on
cgroups) - definitely wasn't the case until cfs was enabled.


On 18 June 2014 18:34, Vinod Kone vinodk...@gmail.com wrote:
 Hey Dick,

 Regarding slave recovery, any changes in the SlaveInfo (see mesos.proto) are
 considered as a new slave and hence recovery doesn't proceed forward. This
 is because Master caches SlaveInfo and it is quite complex to reconcile the
 differences in SlaveInfo. So we decided to fail on any SlaveInfo changes for
 now.

 In your particular case, https://issues.apache.org/jira/browse/MESOS-672 was
 committed in 0.18.0 which fixed redirection
  of WebUI. Included in this fix is https://reviews.apache.org/r/17573/ which
 changed how SlaveInfo.hostname is calculated. Since you are not providing a
 hostname via --hostname flag, slave now deduces the hostname from --ip
 flag. Looks like in your cluster the hostname corresponding to that ip is
 different than what 'os::hostname()' gives.

 Couple of options to move forward. If you want slave recovery, provide
 --hostname that matches the previous hostname. If you don't care above
 recovery, just remove the meta directory (rm -rf /var/mesos/meta) so that
 the slave starts as a fresh one (since you are not using cgroups, you will
 have to manually kill any old executors/tasks that are still alive on the
 slave).

 Not sure about your comment on CFS. Enabling CFS shouldn't change how much
 memory the slave sees as available. More details/logs would help diagnose
 the issue.

 HTH,



 On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies d...@hellooperator.net wrote:

 Should have said, the CLI for this is :

 /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos
 --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos

 (note IP is specified, hostname is not - docs indicated hostname arg
 will default to the fqdn of host, but it appears to be using the value
 passed as 'ip' instead.)

 On 18 June 2014 12:00, Dick Davies d...@hellooperator.net wrote:
  Hi, we recently bumped 0.17.0 - 0.18.2 and the slaves
  now show their IPs rather than their FQDNs on the mesos UI.
 
  This broke slave recovery with the error:
 
  Failed to perform recovery: Incompatible slave info detected
 
 
  cpu, mem, disk, ports are all the same. so is the 'id' field.
 
  the only thing that's changed is are the 'hostname' and webui_hostname
  arguments
  (the CLI we're passing in is exactly the same as it was on 0.17.0, so
  presumably this is down to a change in mesos conventions).
 
  I've had similar issues enabling CFS in test environments (slaves show
  less free memory and refuse to recover).
 
  is the 'id' field not enough to uniquely identify a slave?




n00b isolation docs?

2014-06-09 Thread Dick Davies
So we're running with default isolation (posix)
and thinking about enabling cgroups (mesos 0.17.0
right now but the upgrade to 0.18.2 was seamless
in dev. so that'll probably happen too).

I just need to justify the effort and extra complexity,
so can someone explain briefly

* what croup isolation provides over stock posix / process isolation
* the configuration required to setup cgroups

Thanks!


Re: Log managment

2014-05-16 Thread Dick Davies
I'd try a newer version before you file bugs - but to be honest log rotation
is logrotates job, it's really not very hard to setup.

In our stack we run under upstart, so things make it into syslog and we
don't have to worry about rotation - scales better too as it's easier to
centralize.

On 14 May 2014 09:46, Damien Hardy dha...@viadeoteam.com wrote:
 Hello,

 Log in mesos are problematic for me so far.
 We are used to use log4j facility in java world that permit a lot of things.

 Mainly I would like log rotation (ideally with logrotate tool to be
 homogeneous with other things) without restarting processes because in
 my experience it looses history ( mesos 0.16.0 so far )

 Best regards,

 --
 Damien HARDY
 IT Infrastructure Architect
 Viadeo - 30 rue de la Victoire - 75009 Paris - France
 PGP : 45D7F89A



Re: how does the web UI get sandbox logs?

2014-05-16 Thread Dick Davies
Won't that also set the IP the slave will advertise for tasks?

( Might not be a problem but thought it was worth pointing out, since Mike said
that was currently on the internal IP. )

On 13 May 2014 18:38, Adam Bordelon a...@mesosphere.io wrote:
 mesos-slave --hostname=foo
 The hostname the slave should report.
 If left unset, system hostname will be used (recommended).


 On Tue, May 13, 2014 at 8:41 AM, Mike Milner m...@immun.io wrote:

 I just ran into something similar myself with mesos on EC2. I can reach
 the master just fine using the master's public dns name but when I go to the
 sandbox it's trying to connect to the slaves private internal DNS name.

 Is there a configuration option on the slave to manually specify the
 hostname that should be used in the web UI? I couldn't find anything on
 http://mesos.apache.org/documentation/latest/configuration/

 Thanks!
 Mike


 On Mon, May 12, 2014 at 4:27 PM, Ross Allen r...@mesosphere.io wrote:

 For example, a particular slave's webUI (forwarded through master) can
 be reached at:
 http://localhost:5050/#/slaves/201405120912-16777343-5050-23673-0


 Though it looks like the requests are being proxied through the master,
 your browser is talking directly to the slave for any slave data. Your
 browser first gets HTML, CSS, and JavaScript from the master and then sends
 requests directly to the slave's webserver via JSONP for any slave data
 shown in the UI.

 Ross Allen
 r...@mesosphere.io


 On 12 May 2014 09:21, Adam Bordelon a...@mesosphere.io wrote:

  Does each slave expose a webserver ...?
 Yes. Each slave hosts a webserver not just for the sandbox, but also for
 the slave's own webUI and RESTful API
 For example, a particular slave's webUI (forwarded through master) can
 be reached at:
 http://localhost:5050/#/slaves/201405120912-16777343-5050-23673-0


 On Thu, May 8, 2014 at 9:21 AM, Dick Davies d...@hellooperator.net
 wrote:

 I've found the sandbox logs to be very useful in debugging
 misbehaving frameworks, typos, etc.  - the usual n00b stuff I suppose.

 I've got a vagrant stack running quite nicely. If i port forward I can
 view marathon and mesos UIs nicely from my host, but I can't get
 the sandbox logs because 'slaveN' isn't resolving from outside the
 Vagrant stack.

 I was a bit surprised because I didn't expect to need to reach the
 slaves directly.

 Does each slave expose a webserver to serve up
 sandbox logs or something? Just trying to work out how/if I can
 map things so that UI can be tunnelled easily.







Re: protecting mesos from fat fingers

2014-05-02 Thread Dick Davies
Not quite - looks to me like mesos slave disks filled with failed jobs
(because marathon
continued to throw a broken .zip into them) and with /tmp on the root
fs the servers became
unresponsive. Tobi mentions there's a way to set that at deploy time,
but in this case the
guy who can't type 'hello world' correctly would have been responsible
for setting the rate limits
too (that's me by the way!) so in itself that's not protection from pilot error.

I'm not sure if GC was able to clear /var any better (I doubt it very
much, my impression
was that's on the order of days). Think it's more the deploy could be
cancelled better while the
system was still functioning (speculation - i'm still in early stages
of learning the internals of this).

On 30 April 2014 22:08, Vinod Kone vinodk...@gmail.com wrote:
 Dick, I've also briefly skimmed at your original email to marathon mailing
 list and it sounded like executor sandboxes were not getting garbage
 collected (a mesos feature) when the slave work directory was rooted in /tmp
 vs /var? Did I understand that right? If yes, I would love to see some logs.


 On Wed, Apr 30, 2014 at 1:51 PM, Tobias Knaup t...@knaup.me wrote:

 In Marathon you can specify taskRateLimit (max number of tasks to start
 per second) as part of your app definition.


 On Wed, Apr 30, 2014 at 11:30 AM, Dick Davies d...@hellooperator.net
 wrote:

 Managed to take out a mesos slave today with a typo while launching
 a marathon app, and wondered if there are throttles/limits that can be
 applied to repeated launches to limit the risk of such mistakes in the
 future.

 I started a thread on the marathon list
  (
 https://groups.google.com/forum/?hl=en#!topic/marathon-framework/4iWLqTYTvgM
 )

 [ TL:DR: marathon throws an app that will never deploy correctly at
 slaves
 until the disk fills with debris and the slave dies ]

 but I suppose this could be something available in mesos itself.

 I can't find a lot of advice about operational aspects of Mesos admin;
 could others here provide some good advice about their experience in
 preventing failed task deploys from causing trouble on their clusters?

 Thanks!





Re: How about disable the irc ASFBot to flood the irc channel?

2014-04-17 Thread Dick Davies
Can't you just '/ignore' the IRC bot if it bothers you?

On 17 April 2014 03:01, Chengwei Yang chengwei.yang...@gmail.com wrote:
 Hi All,

 I am a irc guy, maybe so as you. However, I found that there are two
 bots for JIRA, one for the mesos-dev mailing list, one for the irc
 channel.

 I generally think the bot for mailing list is fine, which push a JIRA
 msg in a mail thread, so with full context, readable and easy to
 understand the full page.

 However, the irc channel as its a room for human beings to chat with
 each other, I think not suitable if it's flood by the JIRA bot. I found
 it's very hard to me to figure out what human beings are talking about
 in the ASFBot flooding.

 Could we just kill ASFBot for the irc channel? But left the one for
 mesos-dev mailing list.

 --
 Thanks,
 Chengwei


 footnote: I have to Cc to myself otherwise I found Gmail doesn't mark
 that email as unread, so I can't pull it into my mutt mbox.