Re: ensuring a particular task is deployed to "all" Mesos Worker hosts
If it _needs_ to be there always then I'd roll it out with whatever automation you use to deploy the mesos workers ; depending on the scale you're running at launching it as a task is likely to be less reliable due to outages etc. ( I understand the 'maybe all hosts' constraint but if it's 'up to one per host', it sounds like a CM issue to me. ) On 30 June 2017 at 23:58, Erik Weatherswrote: > hi Mesos folks! > > My team is largely responsible for maintaining the Storm-on-Mesos framework. > It suffers from a problem related to log retrieval: Storm has a process > called the "logviewer" that is assumed to exist on every host, and the Storm > UI provides links to contact this process to download logs (and other > debugging artifacts). Our team manually runs this process on each Mesos > host, but it would be nice to launch it automatically onto any Mesos host > where Storm work gets allocated. [0] > > I have read that Mesos has added support for Kubernetes-esque "pods" as of > version 1.1.0, but that feature seems somewhat insufficient for implementing > our desired behavior from my naive understanding. Specifically, Storm only > has support for connecting to 1 logviewer per host, so unless pods can have > separate containers inside each pod [1], and also dynamically change the set > of executors and tasks inside of the pod [2], then I don't see how we'd be > able to use them. > > Is there any existing feature in Mesos that might help us accomplish our > goal? Or any upcoming features? > > Thanks!! > > - Erik > > [0] Thus the "all" in quotes in the subject of this email, because it > *might* be all hosts, but it definitely would be all hosts where Storm gets > work assigned. > > [1] The Storm-on-Mesos framework leverages separate containers for each > topology's Supervisor and Worker processes, to provide isolation between > topologies. > > [2] The assignment of Storm Supervisors (a Mesos Executor) + Storm Workers > (a Mesos Task) onto hosts is ever changing in a given instance of a > Storm-on-Mesos framework. i.e., as topologies get launched and die, or have > their worker processes die, the processes are dynamically distributed to the > various Mesos Worker hosts. So existing containers often have more tasks > assigned into them (thus growing their footprint) or removed from them (thus > shrinking the footprint).
Re: Mesos (and Marathon) port mapping
I should say this was tested around mesos 1.0, they may have changed things - but yes this is vanilla networking, no CNI or anything like that. But I'm guessing if you're using BRIDGE networking and specifying a hostPort: you're causing work for yourself (unless you actually care what port the slave is using). On 29 March 2017 at 10:22, Thomas HUMMELwrote: > > > On 03/28/2017 06:53 PM, Tomek Janiszewski wrote: > > 1. Mentioned port range is the Mesos Agent resource setting, so if you don't > explicitly define port range it would be used. > https://github.com/apache/mesos/blob/1.2.0/src/slave/constants.hpp#L86 > > 2. With ports mapping two or more applications could attach to same > container port but will be exposed under different host port. > > > Thanks for your answer. > > 1. So it's not network/portmapping isolator specific, right ? Even without > it, non-ephemeral ports would be considered as part of the offer and would > be chosen in this range by default ? > > 2. So containers, even with network/port_mapping isolation, *share* the > non-ephemeral port range, although doc states "The agent assigns each > container a non-overlapping range of the ports" which I first read as "each > container gets it's own port range", right ? > > So I am a bit confused since what's described here > > http://mesos.apache.org/documentation/latest/port-mapping-isolator/ > > in the "Configuring network ports" seems to be valid even without port > mapping isolator. > > Am I getting this right this time ? > > Thanks. > > -- > Thomas HUMMEL >
Re: Mesos (and Marathon) port mapping
Try setting your hostPort to 0, to tell Mesos to select one (which it will allocate out of the pool the mesos slave is set to use). This works for me for redis: { "container": { "type": "DOCKER", "docker": { "image": "redis", "network": "BRIDGE", "portMappings": [ { "containerPort": 6379, "hostPort": 0, "protocol": "tcp"} ] } }, "ports": [0], "instances": 1, "cpus": 0.1, "mem": 128, "uris": [] } (caveat: haven't run marathon or mesos for a little while) On 28 March 2017 at 17:53, Tomek Janiszewskiwrote: > 1. Mentioned port range is the Mesos Agent resource setting, so if you don't > explicitly define port range it would be used. > https://github.com/apache/mesos/blob/1.2.0/src/slave/constants.hpp#L86 > > 2. With ports mapping two or more applications could attach to same > container port but will be exposed under different host port. > > 3. I'm not sure if ports mappings works in Host mode. Try with require ports > option enabled. > https://github.com/mesosphere/marathon/blob/v1.3.9/docs/docs/ports.md > > 4. Yes, service ports are only for Marathon and don't propagate to Mesos. > http://stackoverflow.com/a/39468348/1387612 > > > wt., 28.03.2017, 18:16 użytkownik Thomas HUMMEL > napisał: >> >> Hello, >> >> [Sorry if this post may seem more Marathon-oriented. It still contains >> Mesos specific questions.] >> >> I'm in the process of discovering/testing/trying to understand Mesos and >> Marathon. >> >> After having read some books and docs, I set up a small environment (9 >> linux >> CentOS 7.3 VMs) consisting of : >> >>. 3 Mesos master - quorum = 2 >>. 3 Zookeepers servers running on the same host as the mesos servers >>. 2 Mesos slaves >>. 3 Marathon servers >>. 1 HAproxy facing the Mesos servers >> >> Mesos has been installed from sources (1.2.0 version) and Marathon is >> the 1.3.9 >> tarball comming from mesosphere >> >> I've deployed : >> >>. mesos-dns as a Marathon (not dockerized) application on one of the >> slaves (with a constraint) configured with my site DNS as resolvers >> and only >> "host" as IPSources >> >>. marathon-lb as a Marathon dockerized app ("network": "HOST") with the >> simple (containerPort: 9090, hostPort: 9090, servicePort: 1) >> portMapping, >> on the same slave using a constraint >> >> Everything works fine so far. >> I've read : >> >>https://mesosphere.github.io/marathon/docs/ports.html >> >> and >> >>http://mesos.apache.org/documentation/latest/port-mapping-isolator/ >> >> but I'm still quite confused by the following port-related questions : >> >> [Note : I'm not using "network/port_mapping" isolation for now. I sticked >> to >> >>export MESOS_containerizers=docker,mesos] >> >> 1. for such a simple dockerized app : >> >> { >>"id": "http-server", >>"cmd": "python3 -m http.server 8080", >>"cpus": 0.5, >>"mem": 32.0, >>"container": { >> "type": "DOCKER", >> "docker": { >>"image": "python:3", >>"network": "BRIDGE", >>"portMappings": [ >> { "containerPort": 8080, "hostPort": 31000, "servicePort": 5000 } >>] >> } >>}, >>"labels":{ >> "HAPROXY_GROUP":"external" >>} >> } >> >> a) in HOST mode ("network": "HOST"), any hostPort seems to work (or at >> least, let say 9090) >> >> b) in BRIDGE mode ("network": "BRIDGE"), the valid hostPort range seems >> to be >> [31000 - 32000], which seems to match the Mesos non-ephemeral port range >> given >> as en example in >> >>http://mesos.apache.org/documentation/latest/port-mapping-isolator/ >> >> But I don't quite understand why since >> >>- I'm not using network/port_mapping isolation >>- I didn't configured any port range anywhere in Mesos >> >> 2. Obviously in my setup, 2 apps on the same slave cannot have the same >> hostPort. Would it be the same with network/port_mapping activated >> since the >> doc says : "he agent assigns each container a non-overlapping range >> of the >> ports" >> >> Am I correct assuming that a Marathon hostPort is to be understood >> as taken among the non-ephemeral Mesos ports ? >> >> With network/port_mapping isolation, could 2 apps have the same >> non-ephemeal port ? same question with ephemeral-port ? I doubt it but... >> Is what is described in this doc valid for a dockerized container also >> ? >> >> 3. the portMapping I configured for the dockerized ("network": "HOST") >> marathon-lb app is >> >> "portMappings": [ >>{ >> "containerPort": 9090, >> "hostPort": 9090, >> "servicePort": 1, >> "protocol": "tcp" >> >> on the slave I can verify : >> >># lsof -i :9090 >>COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME >>haproxy 29610 root6u IPv4 461745 0t0 TCP *:websm (LISTEN) >> But Marathon tells that my app is running on : >> >>
Re: mirror of mesosphere's repo
It's on s3 isn't it - maybe CloudFront? On 20 September 2016 at 05:48, tommy xiaowrote: > Hi Team and Mesosphere's repo, > > can Mesosphere provide a sync server way with http://repos.mesosphere.com/. > it will help china's community to sync the package from mirror repo. > > -- > Deshi Xiao > Twitter: xds2000 > E-mail: xiaods(AT)gmail.com
Re: Fetcher cache: caching even more while an executor is alive
I'd try the Docker image approach. We've done this in the past and used our CM tool to 'seed' all slaves by running 'docker pull foo:v1' across them all in advance, saved a lot of startup time (although we were only dealing with a Gb or so of dependencies). On 5 July 2016 at 11:23, Kota UENISHIwrote: > Thanks, it looks promising to me - I was aware of persistent volumes > because I thought the use case was different, like for databases. I'll > try it on. > > As the document says > >> persistent volumes are associated with roles, > > this makes failure handling a little bit difficult - As my framework > is not handling failure well enough, those volume IDs must be > remembered during framework restart or failover, or must get recovered > after. Restarted framework must reuse or collect already reserved > volumes or those volumes just gets leaking. > > Kota UENISHI > > > On Tue, Jul 5, 2016 at 6:03 PM, Aaron Carey wrote: >> As you're writing the framework, have you looked at reserving persistent >> volumes? I think it might help in your use case: >> >> http://mesos.apache.org/documentation/latest/persistent-volume/ >> >> Aaron >> >> -- >> >> Aaron Carey >> Production Engineer - Cloud Pipeline >> Industrial Light & Magic >> London >> 020 3751 9150 >> >> >> From: 上西康太 [ueni...@nautilus-technologies.com] >> Sent: 05 July 2016 08:24 >> To: user@mesos.apache.org >> Subject: Fetcher cache: caching even more while an executor is alive >> >> Hi, >> I'm developing my own framework - that distributes >100 independent >> tasks across the cluster and just run them arbitrarily. My problem is, >> each task execution environment is a bit large tarball (2~6GB, mostly >> application jar files) and task itself finishes within 1~200 seconds, >> while tarball extraction takes like tens of seconds every time. >> Extracting the same tarball again and again in all tasks is a wasteful >> overhead that cannot be ignored. >> >> Fetcher cache is great, but in my case, fetcher cache isn't even >> enough and I want to preserve all files extracted from the tarball >> while my executor is alive. If Mesos could cache all files extracted >> from the tarball by omitting not only download but extraction, I could >> save more time. >> >> In "Fetcher Cache Internals" [1] or in "Fetcher Cache" [2] section in >> the official document, such issues or future work is not mentioned - >> how do you solve this kind of extraction overhead problem, when you >> have rather large resource ? >> >> An option would be setting up an internal docker registry and let >> slaves cache the docker image that includes our jar files and save >> tarball extraction. But, I want to prevent our system from additional >> moving parts as much as I can. >> >> Another option might be let fetcher fetch all jar files independently >> in slaves, but I think it feasible, but I don't think it manageable in >> production in an easy way. >> >> PS Mesos is great; it is helping us a lot - I want to appreciate all >> the efforts by the community. Thank you so much! >> >> [1] http://mesos.apache.org/documentation/latest/fetcher-cache-internals/ >> [2] http://mesos.apache.org/documentation/latest/fetcher/ >> >> Kota UENISHI
Re: Mesos 0.28.2 does not start
My guess would be your networking is still wonky. Each master is putting their IP into zookeeper, and the other masters use that to find each other for elections. you can poke around in zookeeper with zkCli.sh, that should give you an idea which IP is ending up there - or just check each masters log, that'll normally give similar information. TBH it sounds like you've made yourself a difficult setup with this openstack thing :( On 13 June 2016 at 13:06, Stefano Bianchi <jazzist...@gmail.com> wrote: > Hi guys i don't know why but my three masters cannot determine the leader. > How can i give you a log file do check? > > 2016-06-12 10:42 GMT+02:00 Dick Davies <d...@hellooperator.net>: >> >> Try putting the IP you're binding to (the actual IP on the master) in >> /etc/mesos-*/ip , and the externally accessible IP in >> /etc/mesos-*/hostname. >> >> On 12 June 2016 at 00:57, Stefano Bianchi <jazzist...@gmail.com> wrote: >> > ok i guess i figured out. >> > The reason for which i put floating IP on hostname and ip files is >> > written >> > here:https://open.mesosphere.com/getting-started/install/ >> > >> > It says: >> > If you're unable to resolve the hostname of the machine directly (e.g., >> > if >> > on a different network or using a VPN), set /etc/mesos-slave/hostname to >> > a >> > value that you can resolve, for example, an externally accessible IP >> > address >> > or DNS hostname. This will ensure all links from the Mesos console work >> > correctly. >> > >> > The problem, i guess, is that the set of floating iPs 10.250.0.xxx is >> > not >> > externally accessible. >> > In my other deployment i have set the floating IPs in these files and >> > all is >> > perfectly working, but in that case i have used externally reachable >> > IPs. >> > >> > 2016-06-11 22:51 GMT+02:00 Erik Weathers <eweath...@groupon.com>: >> >> >> >> It depends on your setup. I would probably not set the hostname and >> >> instead set the "--no-hostname_lookup" flag. I'm not sure how you do >> >> that >> >> with the file-based configuration style you are using. >> >> >> >> % mesos-master --help >> >> ... >> >> >> >> --hostname=VALUE The hostname the master should >> >> advertise >> >> in ZooKeeper. >> >> If left unset, the >> >> hostname is resolved from the IP address >> >> that the slave binds >> >> to; >> >> unless the user explicitly prevents >> >> that, using >> >> `--no-hostname_lookup`, in which case the IP itself >> >> is used. >> >> >> >> On Sat, Jun 11, 2016 at 1:27 PM, Stefano Bianchi <jazzist...@gmail.com> >> >> wrote: >> >>> >> >>> So Erik do you suggest to use the 192.* IP in both >> >>> /etc/mesos-master/hostname nad /etc/mesos-master/ip right? >> >>> >> >>> Il 11/giu/2016 22:15, "Erik Weathers" <eweath...@groupon.com> ha >> >>> scritto: >> >>>> >> >>>> Yeah, so there is no 10.x address on the box. Thus you cannot bind >> >>>> Mesos to listen to that address. You need to use one of the 192.* >> >>>> IPs for >> >>>> Mesos to bind to. I'm not sure why you say you need to use the 10.x >> >>>> addresses for the UI, that sounds like a problem you should tackle >> >>>> *after* >> >>>> getting Mesos up. >> >>>> >> >>>> - Erik >> >>>> >> >>>> P.S., when using gmail in chrome, you can avoid those extraneous >> >>>> newlines when you paste by holding "Shift" along with the Command-V >> >>>> (at >> >>>> least on Mac OS X!). >> >>>> >> >>>> On Sat, Jun 11, 2016 at 1:06 PM, Stefano Bianchi >> >>>> <jazzist...@gmail.com> >> >>>> wrote: >> >>>>> >> >>>>> ifconfig -a >> >>>>> >> >>>>> eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1454 >> >>>>> >> >>>>> i
Re: Mesos 0.28.2 does not start
Try putting the IP you're binding to (the actual IP on the master) in /etc/mesos-*/ip , and the externally accessible IP in /etc/mesos-*/hostname. On 12 June 2016 at 00:57, Stefano Bianchiwrote: > ok i guess i figured out. > The reason for which i put floating IP on hostname and ip files is written > here:https://open.mesosphere.com/getting-started/install/ > > It says: > If you're unable to resolve the hostname of the machine directly (e.g., if > on a different network or using a VPN), set /etc/mesos-slave/hostname to a > value that you can resolve, for example, an externally accessible IP address > or DNS hostname. This will ensure all links from the Mesos console work > correctly. > > The problem, i guess, is that the set of floating iPs 10.250.0.xxx is not > externally accessible. > In my other deployment i have set the floating IPs in these files and all is > perfectly working, but in that case i have used externally reachable IPs. > > 2016-06-11 22:51 GMT+02:00 Erik Weathers : >> >> It depends on your setup. I would probably not set the hostname and >> instead set the "--no-hostname_lookup" flag. I'm not sure how you do that >> with the file-based configuration style you are using. >> >> % mesos-master --help >> ... >> >> --hostname=VALUE The hostname the master should advertise >> in ZooKeeper. >> If left unset, the >> hostname is resolved from the IP address >> that the slave binds to; >> unless the user explicitly prevents >> that, using >> `--no-hostname_lookup`, in which case the IP itself >> is used. >> >> On Sat, Jun 11, 2016 at 1:27 PM, Stefano Bianchi >> wrote: >>> >>> So Erik do you suggest to use the 192.* IP in both >>> /etc/mesos-master/hostname nad /etc/mesos-master/ip right? >>> >>> Il 11/giu/2016 22:15, "Erik Weathers" ha scritto: Yeah, so there is no 10.x address on the box. Thus you cannot bind Mesos to listen to that address. You need to use one of the 192.* IPs for Mesos to bind to. I'm not sure why you say you need to use the 10.x addresses for the UI, that sounds like a problem you should tackle *after* getting Mesos up. - Erik P.S., when using gmail in chrome, you can avoid those extraneous newlines when you paste by holding "Shift" along with the Command-V (at least on Mac OS X!). On Sat, Jun 11, 2016 at 1:06 PM, Stefano Bianchi wrote: > > ifconfig -a > > eth0: flags=4163 mtu 1454 > > inet 192.168.100.3 netmask 255.255.255.0 broadcast > 192.168.100.255 > > inet6 fe80::f816:3eff:fe1c:a3bf prefixlen 64 scopeid > 0x20 > > ether fa:16:3e:1c:a3:bf txqueuelen 1000 (Ethernet) > > RX packets 61258 bytes 4686426 (4.4 MiB) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 40537 bytes 3603100 (3.4 MiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > lo: flags=73 mtu 65536 > > inet 127.0.0.1 netmask 255.0.0.0 > > inet6 ::1 prefixlen 128 scopeid 0x10 > > loop txqueuelen 0 (Local Loopback) > > RX packets 28468 bytes 1672684 (1.5 MiB) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 28468 bytes 1672684 (1.5 MiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > ip addr:1: lo: mtu 65536 qdisc noqueue state > UNKNOWN > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > inet 127.0.0.1/8 scope host lo > >valid_lft forever preferred_lft forever > > inet6 ::1/128 scope host > >valid_lft forever preferred_lft forever > > 2: eth0: mtu 1454 qdisc pfifo_fast > state UP qlen 1000 > > link/ether fa:16:3e:1c:a3:bf brd ff:ff:ff:ff:ff:ff > > inet 192.168.100.3/24 brd 192.168.100.255 scope global dynamic eth0 > >valid_lft 77537sec preferred_lft 77537sec > > inet6 fe80::f816:3eff:fe1c:a3bf/64 scope link > >valid_lft forever preferred_lft forever > > > 2016-06-11 20:05 GMT+02:00 haosdent : >> >> As @Erik said, what is your `ifconfig` or `ip addr` command output? >> >> On Sun, Jun 12, 2016 at 2:00 AM, Stefano Bianchi >> wrote: >>> >>> the result of your command give this: >>> >>> [root@master ~]# nc
Re: Mesos HA does not work (Failed to recover registrar)
The extra zookeepers listed in the second argument will let you mesos master process keep working if its local zookeeper goes down for maintenance. On 5 June 2016 at 13:55, Qian Zhang <zhq527...@gmail.com> wrote: >> You need the 2nd command line (i.e. you have to specify all the zk >> nodes on each master, it's >> not like e.g. Cassandra where you can discover other nodes from the >> first one you talk to). > > > I have an Open DC/OS environment which is enabled master HA (there are 3 > master nodes) and works very well, and I see each Mesos master is started to > only connect with local zk: > $ cat /opt/mesosphere/etc/mesos-master | grep ZK > MESOS_ZK=zk://127.0.0.1:2181/mesos > > So I think I do not have to specify all the zk on each master. > > > > > > > > Thanks, > Qian Zhang > > On Sun, Jun 5, 2016 at 4:25 PM, Dick Davies <d...@hellooperator.net> wrote: >> >> OK, good - that part looks as expected, you've had a successful >> election for a leader >> (and yes that sounds like your zookeeper layer is ok). >> >> You need the 2nd command line (i.e. you have to specify all the zk >> nodes on each master, it's >> not like e.g. Cassandra where you can discover other nodes from the >> first one you talk to). >> >> The error you were getting was about the internal registry / >> replicated log, which is a mesos master level thing. >> You could try when Sivaram suggested - stopping the mesos master >> processes, wiping their >> work_dirs and starting them back up. >> Perhaps some wonky state got in there while you were trying various >> options? >> >> >> On 5 June 2016 at 00:34, Qian Zhang <zhq527...@gmail.com> wrote: >> > Thanks Vinod and Dick. >> > >> > I think my 3 ZK servers have formed a quorum, each of them has the >> > following >> > config: >> > $ cat conf/zoo.cfg >> > server.1=192.168.122.132:2888:3888 >> > server.2=192.168.122.225:2888:3888 >> > server.3=192.168.122.171:2888:3888 >> > autopurge.purgeInterval=6 >> > autopurge.snapRetainCount=5 >> > initLimit=10 >> > syncLimit=5 >> > maxClientCnxns=0 >> > clientPort=2181 >> > tickTime=2000 >> > quorumListenOnAllIPs=true >> > dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot >> > dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions >> > >> > And when I run "bin/zkServer.sh status" on each of them, I can see >> > "Mode: >> > leader" for one, and "Mode: follower" for the other two. >> > >> > I have already tried to manually start 3 masters simultaneously, and >> > here is >> > what I see in their log: >> > In 192.168.122.171(this is the first master I started): >> > I0605 07:12:49.418721 1187 detector.cpp:152] Detected a new leader: >> > (id='25') >> > I0605 07:12:49.419276 1186 group.cpp:698] Trying to get >> > '/mesos/log_replicas/24' in ZooKeeper >> > I0605 07:12:49.420013 1188 group.cpp:698] Trying to get >> > '/mesos/json.info_25' in ZooKeeper >> > I0605 07:12:49.423807 1188 zookeeper.cpp:259] A new leading master >> > (UPID=master@192.168.122.171:5050) is detected >> > I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: { >> > log-replica(1)@192.168.122.171:5050 } >> > I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader >> > is >> > master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b >> > I0605 07:12:49.424895 1187 master.cpp:1964] Elected as the leading >> > master! >> > >> > In 192.168.122.225 (second master I started): >> > I0605 07:12:51.918702 2246 detector.cpp:152] Detected a new leader: >> > (id='25') >> > I0605 07:12:51.919983 2246 group.cpp:698] Trying to get >> > '/mesos/json.info_25' in ZooKeeper >> > I0605 07:12:51.921910 2249 network.hpp:461] ZooKeeper group PIDs: { >> > log-replica(1)@192.168.122.171:5050 } >> > I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status >> > received a broadcasted recover request from (6)@192.168.122.225:5050 >> > I0605 07:12:51.927891 2246 zookeeper.cpp:259] A new leading master >> > (UPID=master@192.168.122.171:5050) is detected >> > I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader >> > is >> > master@192.
Re: Mesos HA does not work (Failed to recover registrar)
OK, good - that part looks as expected, you've had a successful election for a leader (and yes that sounds like your zookeeper layer is ok). You need the 2nd command line (i.e. you have to specify all the zk nodes on each master, it's not like e.g. Cassandra where you can discover other nodes from the first one you talk to). The error you were getting was about the internal registry / replicated log, which is a mesos master level thing. You could try when Sivaram suggested - stopping the mesos master processes, wiping their work_dirs and starting them back up. Perhaps some wonky state got in there while you were trying various options? On 5 June 2016 at 00:34, Qian Zhang <zhq527...@gmail.com> wrote: > Thanks Vinod and Dick. > > I think my 3 ZK servers have formed a quorum, each of them has the following > config: > $ cat conf/zoo.cfg > server.1=192.168.122.132:2888:3888 > server.2=192.168.122.225:2888:3888 > server.3=192.168.122.171:2888:3888 > autopurge.purgeInterval=6 > autopurge.snapRetainCount=5 > initLimit=10 > syncLimit=5 > maxClientCnxns=0 > clientPort=2181 > tickTime=2000 > quorumListenOnAllIPs=true > dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot > dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions > > And when I run "bin/zkServer.sh status" on each of them, I can see "Mode: > leader" for one, and "Mode: follower" for the other two. > > I have already tried to manually start 3 masters simultaneously, and here is > what I see in their log: > In 192.168.122.171(this is the first master I started): > I0605 07:12:49.418721 1187 detector.cpp:152] Detected a new leader: > (id='25') > I0605 07:12:49.419276 1186 group.cpp:698] Trying to get > '/mesos/log_replicas/24' in ZooKeeper > I0605 07:12:49.420013 1188 group.cpp:698] Trying to get > '/mesos/json.info_25' in ZooKeeper > I0605 07:12:49.423807 1188 zookeeper.cpp:259] A new leading master > (UPID=master@192.168.122.171:5050) is detected > I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: { > log-replica(1)@192.168.122.171:5050 } > I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader is > master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b > I0605 07:12:49.424895 1187 master.cpp:1964] Elected as the leading > master! > > In 192.168.122.225 (second master I started): > I0605 07:12:51.918702 2246 detector.cpp:152] Detected a new leader: > (id='25') > I0605 07:12:51.919983 2246 group.cpp:698] Trying to get > '/mesos/json.info_25' in ZooKeeper > I0605 07:12:51.921910 2249 network.hpp:461] ZooKeeper group PIDs: { > log-replica(1)@192.168.122.171:5050 } > I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status > received a broadcasted recover request from (6)@192.168.122.225:5050 > I0605 07:12:51.927891 2246 zookeeper.cpp:259] A new leading master > (UPID=master@192.168.122.171:5050) is detected > I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader is > master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b > > In 192.168.122.132 (last master I started): > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader: > (id='25') > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get > '/mesos/json.info_25' in ZooKeeper > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master > (UPID=master@192.168.122.171:5050) is detected > > So right after I started these 3 masters, the first one (192.168.122.171) > was successfully elected as leader, but after 60s, 192.168.122.171 failed > with the error mentioned in my first mail, and then 192.168.122.225 was > elected as leader, but it failed with the same error too after another 60s, > and the same thing happened to the last one (192.168.122.132). So after > about 180s, all my 3 master were down. > > I tried both: > sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2 > --work_dir=/var/lib/mesos/master > and > sudo ./bin/mesos-master.sh > --zk=zk://192.168.122.132:2181,192.168.122.171:2181,192.168.122.225:2181/mesos > --quorum=2 --work_dir=/var/lib/mesos/master > And I see the same error for both. > > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are > running on a KVM hypervisor host. > > > > > Thanks, > Qian Zhang > > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <d...@hellooperator.net> wrote: >> >> You told the master it needed a quorum of 2 and it's the only one >> online, so it's bombing out. >> That's the expected behaviour. >> >> You
Re: Mesos HA does not work (Failed to recover registrar)
You told the master it needed a quorum of 2 and it's the only one online, so it's bombing out. That's the expected behaviour. You need to start at least 2 zookeepers before it will be a functional group, same for the masters. You haven't mentioned how you setup your zookeeper cluster, so i'm assuming that's working correctly (3 nodes, all aware of the other 2 in their config). If not, you need to sort that out first. Also I think your zk URL is wrong - you want to list all 3 zookeeper nodes like this: sudo ./bin/mesos-master.sh --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2 --work_dir=/var/lib/mesos/master when you've run that command on 2 hosts things should start working, you'll want all 3 up for redundancy. On 4 June 2016 at 16:42, Qian Zhangwrote: > Hi Folks, > > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a > Zookeeper running, so they form a Zookeeper cluster. And then when I started > the first Mesos master in one node with: > sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2 > --work_dir=/var/lib/mesos/master > > I found it will hang here for 60 seconds: > I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master > (UPID=master@192.168.122.132:5050) is detected > I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader is > master@192.168.122.132:5050 with id 40d387a6-4d61-49d6-af44-51dd41457390 > I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading > master! > I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar > I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar > I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the writer > > And after 60s, master will fail: > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to > recover registrar: Failed to perform fetch within 1mins > *** Check failure stack trace: *** > @ 0x7f4b81372f4e google::LogMessage::Fail() > @ 0x7f4b81372e9a google::LogMessage::SendToLog() > @ 0x7f4b8137289c google::LogMessage::Flush() > @ 0x7f4b813757b0 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f4b8040eea0 mesos::internal::master::fail() > @ 0x7f4b804dbeb3 > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi16__callIvJS1_EJLm0ELm1T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE > @ 0x7f4b804ba453 > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1clIJS1_EvEET0_DpOT_ > @ 0x7f4b804898d7 > _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1vEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_ > @ 0x7f4b804dbf80 > _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1vEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ > @ 0x49d257 std::function<>::operator()() > @ 0x49837f > _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_ > @ 0x493024 process::Future<>::fail() > @ 0x7f4b8015ad20 process::Promise<>::fail() > @ 0x7f4b804d9295 process::internal::thenf<>() > @ 0x7f4b8051788f > _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi16__callIvISM_EILm0ELm1ELm2T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f4b8050fa3b std::_Bind<>::operator()<>() > @ 0x7f4b804f94e3 std::_Function_handler<>::_M_invoke() > @ 0x7f4b8050fc69 std::function<>::operator()() > @ 0x7f4b804f9609 > _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_ > @ 0x7f4b80517936 > _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_ > @ 0x7f4b8050fc69 std::function<>::operator()() > @ 0x7f4b8056b1b4 process::internal::run<>() > @ 0x7f4b80561672 process::Future<>::fail() > @ 0x7f4b8059bf5f std::_Mem_fn<>::operator()<>() > @ 0x7f4b8059757f > _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi16__callIbIS8_EILm0ELm1T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f4b8058fad1 > _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1clIJS8_EbEET0_DpOT_ > @ 0x7f4b80585a41 > _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1bEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_ > @ 0x7f4b80597605 >
Re: How to add other file systems to an agent
I'd imagine it's reporting whatever partition the --work-dir argument on the slave is set to (sandboxes live under that directory). On 3 May 2016 at 12:21, Rinaldo Digiorgiowrote: > Hi, > > I have a configuration with a root file system and other file > systems. When I start an agent, the agent only reports the disk on the root > file system. Is there a way to specify a list of file systems to include as > resources of the agent when it starts? I checked the agent options. > > > Rinaldo
Re: Setting ulimits on mesos-slave
Hi June are you running Mesos as root, or a non-privileged user? Non-root won't be able to up their own ulimit too high (sorry, not an upstart expert as RHELs is laughably incomplete). On 25 April 2016 at 19:15, June Taylorwrote: > What I'm saying is even putting them within the upstart script, per the > Mesos documentation, isn't working for the file block limit. We're still > getting 8MB useable, and as a result executors fail when attempting to write > larger files. > > > Thanks, > June Taylor > System Administrator, Minnesota Population Center > University of Minnesota > > On Mon, Apr 25, 2016 at 11:53 AM, haosdent wrote: >> >> If you set in your upstart script, it isn't system wide and only effective >> in that session. I think need change /etc/security/limits.conf and >> /etc/sysctl.conf to make your ulimit work globally. >> >> On Tue, Apr 26, 2016 at 12:43 AM, June Taylor wrote: >>> >>> Somewhere an 8MB maximum file size is being applied on just one of our >>> slaves, for example. >>> >>> >>> Thanks, >>> June Taylor >>> System Administrator, Minnesota Population Center >>> University of Minnesota >>> >>> On Mon, Apr 25, 2016 at 11:42 AM, June Taylor wrote: We are operating a 6-node cluster running on Ubuntu, and have noticed that the ulimit settings within the slave context are difficult to set and predict. The documentation is a bit unclear on this point, as well. We have had some luck adding a configuration line to /etc/init/mesos-slave.conf as follows: limit nofile 2 2 limit fsize unlimited unlimited The nofile limit seems to be respected, however the fsize limit does not. It is also mysterious that the system-wide limits are not inherited by the slave process. We would prefer to set all of these system-wide and have mesos-slave observe them. Can you please advise where you are setting your ulimits for the mesos-slave if it is working for you? Thanks, June Taylor System Administrator, Minnesota Population Center University of Minnesota >>> >>> >> >> >> >> -- >> Best Regards, >> Haosdent Huang > >
Re: removed slace "ID": (131.154.96.172): health check timed out
On our network a lot of the hosts have multiple interfaces, which let some asymmetric routing issues creep in that prevented our masters replying to slaves, which reminded me of your symptoms. So we set an IP address in /etc/mesos-slave/ip and /etc/mesos-master/ip so that they only listen on one interface, and then check connectivity between those IPs. The Ansible repo we use to build the stack now has a 'signoff' playbook to check network connectivity is correct between the services it deploys to a new environment. It won't be much use to you on its own I'm afraid, but here's a checklist cribbed from that playbook (ports might be different in your setup). You can SSH to the servers and check reachability between them with netcat or telnet. zookeepers: - need to be able to reach each other on the election port (usually tcp/3888) masters: * must be able to reach zookeepers on tcp/2181 * must be able to reach each other on tcp/5050 * must be able to reach slaves on tcp/5051 mesos slaves: - must be able to reach masters on tcp/5050 - must be able to reach zookeepers on tcp/2181 - another other connectivity to services your application needs (database, caches, whatever) I think that's it. On 18 April 2016 at 20:39, Stefano Bianchi <jazzist...@gmail.com> wrote: > Hi Dick Davies > > Could you please share your solution? > How did you set up mesos/Zookeeper to interconnect masters and slaves among > networks? > > Thanks a lot! > > 2016-04-18 20:56 GMT+02:00 Dick Davies <d...@hellooperator.net>: >> >> +1 for that theory, we had some screwy issues when we tried to span >> subnets until we set every slave and master >> to listen on a specific IP so we could tie down routing correctly. >> >> Saw very similar symptoms that have been described. >> >> On 18 April 2016 at 18:35, Alex Rukletsov <a...@mesosphere.com> wrote: >> > I believe it's because slaves are able to connect to the master, but the >> > master is not able to connect to the slaves. That's why you see them >> > connected for some time and gone afterwards. >> > >> > On Mon, Apr 18, 2016 at 6:47 PM, Stefano Bianchi <jazzist...@gmail.com> >> > wrote: >> >> >> >> Indeed, i dont know why, i am not able to reach all the machines from a >> >> network to the other, just some machines can interconnect with some >> >> others >> >> among the networks. >> >> On mesos i see that all the slaves at a certain time are all connected, >> >> then disconnected and after a while connected again, it seems like they >> >> are >> >> able to connect for a while. >> >> However is an openstack issue i guess. >> >> >> >> Does this also happen when master3 is leading? My guess is that you're >> >> not >> >> allowong incoming connections from master1 and master2 to slave3. >> >> Generally, >> >> masters should be able to connect to slaves, not just respond to their >> >> requests. >> >> >> >> On 18 Apr 2016 13:17, "Stefano Bianchi" <jazzist...@gmail.com> wrote: >> >>> >> >>> Hi >> >>> On openstack i plugged two virtual networks to the same virtual router >> >>> so >> >>> that the hosts on the 2 networks can communicate each other. >> >>> this is my topology: >> >>> >> >>> ---internet--- >> >>> | >> >>>Router1 >> >>> | >> >>> >> >>> | | >> >>> Net1Net2 >> >>> Master1 Master2 Master3 >> >>> Slave1 slave2 Slave3 >> >>> >> >>> I have set zookeeper in with this line: >> >>> >> >>> zk://Master1_IP:2181,Master2_IP:2181,Master3_IP:2181/mesos >> >>> >> >>> The 3 masters, even though on 2 separated networks, elect the leader >> >>> correclty. >> >>> Now i have started the slaves, and in a first time i see all 3 >> >>> correctly >> >>> registered, but after a while the slave 3, independently form who is >> >>> the >> >>> master, disconnects. >> >>> I saw in the log and i get the message in the object. >> >>> Can you help me to solve this problem? >> >>> >> >>> >> >>> Thanks to all. >> > >> > > >
Re: removed slace "ID": (131.154.96.172): health check timed out
+1 for that theory, we had some screwy issues when we tried to span subnets until we set every slave and master to listen on a specific IP so we could tie down routing correctly. Saw very similar symptoms that have been described. On 18 April 2016 at 18:35, Alex Rukletsovwrote: > I believe it's because slaves are able to connect to the master, but the > master is not able to connect to the slaves. That's why you see them > connected for some time and gone afterwards. > > On Mon, Apr 18, 2016 at 6:47 PM, Stefano Bianchi > wrote: >> >> Indeed, i dont know why, i am not able to reach all the machines from a >> network to the other, just some machines can interconnect with some others >> among the networks. >> On mesos i see that all the slaves at a certain time are all connected, >> then disconnected and after a while connected again, it seems like they are >> able to connect for a while. >> However is an openstack issue i guess. >> >> Does this also happen when master3 is leading? My guess is that you're not >> allowong incoming connections from master1 and master2 to slave3. Generally, >> masters should be able to connect to slaves, not just respond to their >> requests. >> >> On 18 Apr 2016 13:17, "Stefano Bianchi" wrote: >>> >>> Hi >>> On openstack i plugged two virtual networks to the same virtual router so >>> that the hosts on the 2 networks can communicate each other. >>> this is my topology: >>> >>> ---internet--- >>> | >>>Router1 >>> | >>> >>> | | >>> Net1Net2 >>> Master1 Master2 Master3 >>> Slave1 slave2 Slave3 >>> >>> I have set zookeeper in with this line: >>> >>> zk://Master1_IP:2181,Master2_IP:2181,Master3_IP:2181/mesos >>> >>> The 3 masters, even though on 2 separated networks, elect the leader >>> correclty. >>> Now i have started the slaves, and in a first time i see all 3 correctly >>> registered, but after a while the slave 3, independently form who is the >>> master, disconnects. >>> I saw in the log and i get the message in the object. >>> Can you help me to solve this problem? >>> >>> >>> Thanks to all. > >
Re: libmesos on alpine linux?
Thanks - I'll give that a whirl. MESOS-4507 sounds like mesos are starting to use Alpine in their test suites, so hopefully the glibc/musl incompatibilities will start to get ironed out. My (very basic) Spark testing has hit issues with big images (Spark load images on demand, but anything too large triggers timeouts, so Alpines sizes are pretty appealing). In my experience, dockers caching isn't as effective as it's made out to be so I'm all for a smaller image. Spark is the first framework I've used that needs a libmesos in the container image, I'm still not clear why. On 17 April 2016 at 03:17, Sargun Dhillon <sar...@sargun.me> wrote: > A word of warning about musl. Alpine ships with musl as its default > libc implementation. Its DNS resolver tends to act very differently > than glibc. This can prove problematic in certain types of > applications where you may be interacting with slow DNS resolvers, or > relying on glibc's behaviour. > > Fortunately, Alpine actually supports glibc, and there are examples of > using it in a Docker container: > https://hub.docker.com/r/frolvlad/alpine-glibc/~/dockerfile/ -- the > image clocks in at about 12MB. > > On Sat, Apr 16, 2016 at 6:52 PM, Shuai Lin <linshuai2...@gmail.com> wrote: >> Take a look at >> http://stackoverflow.com/questions/35614923/errors-compiling-mesos-on-alpine-linux >> , this guy has successfully patched an older version of the mesos to build >> on alpine linux. >> >> On Sun, Apr 17, 2016 at 3:19 AM, Dick Davies <d...@hellooperator.net> wrote: >>> >>> Has anyone been able to build libmesos (0.28.x ideally) on Alpine Linux >>> yet? >>> >>> I'm trying to get a smaller spark docker image and though that was >>> straightforward, the docs say I need libmesos in the image to be able >>> to use it (which I find a bit suprising, but it seems to be correct). >> >>
libmesos on alpine linux?
Has anyone been able to build libmesos (0.28.x ideally) on Alpine Linux yet? I'm trying to get a smaller spark docker image and though that was straightforward, the docs say I need libmesos in the image to be able to use it (which I find a bit suprising, but it seems to be correct).
Re: Prometheus Exporters on Marathon
You are probably building on an older version of Golang - I think the Timeout attribute was added to http.Client around 1.5 or 1.6? On 15 April 2016 at 13:56, June Taylorwrote: > David, > > Thanks for the assistance. How did you get the mesos-exporter installed? > When I tried the instructions from github.com/mesosphere/mesos-exporter, I > got this error: > > june@-cluster:~$ go get github.com/mesosphere/mesos-exporter > # github.com/mesosphere/mesos-exporter > gosrc/src/github.com/mesosphere/mesos-exporter/common.go:46: unknown > http.Client field 'Timeout' in struct literal > gosrc/src/github.com/mesosphere/mesos-exporter/master_state.go:73: unknown > http.Client field 'Timeout' in struct literal > gosrc/src/github.com/mesosphere/mesos-exporter/slave_monitor.go:56: unknown > http.Client field 'Timeout' in struct literal > > > Thanks, > June Taylor > System Administrator, Minnesota Population Center > University of Minnesota > > On Fri, Apr 15, 2016 at 4:29 AM, David Keijser > wrote: >> >> Sure. there is not a lot to it though. >> >> So we have simple service file like this >> >> /usr/lib/systemd/system/mesos_exporter.service >> ``` >> [Unit] >> Description=Prometheus mesos exporter >> >> [Service] >> EnvironmentFile=-/etc/sysconfig/mesos_exporter >> ExecStart=/usr/bin/mesos_exporter $OPTIONS >> Restart=on-failure >> ``` >> >> and the sysconfig is just a simple >> >> /etc/sysconfig/mesos_exporter >> ``` >> OPTIONS=-master=http://10.4.72.253:5050 >> ``` >> >> - or - >> >> /etc/sysconfig/mesos_exporter >> ``` >> OPTIONS=-slave=http://10.4.72.177:5051 >> ``` >> >> On Thu, Apr 14, 2016 at 12:22:56PM -0500, June Taylor wrote: >> > David, >> > >> > Thanks for the reply. Would you be able to share your configs for >> > starting >> > up the exporters? >> > >> > >> > Thanks, >> > June Taylor >> > System Administrator, Minnesota Population Center >> > University of Minnesota >> > >> > On Thu, Apr 14, 2016 at 11:27 AM, David Keijser >> > >> > wrote: >> > >> > > We run the mesos exporter [1] and the node_exporter on each host >> > > directly >> > > managed by systemd. For other application specific exporters we have >> > > so far >> > > been baking them into the docker image of the application which is >> > > being >> > > run by marathon. >> > > >> > > 1) https://github.com/mesosphere/mesos_exporter >> > > >> > > On Thu, 14 Apr 2016 at 18:20 June Taylor wrote: >> > > >> > >> Is anyone else running Prometheus exporters on their cluster? I am >> > >> stuck >> > >> because I can't get a working "go build" environment right now. >> > >> >> > >> Is anyone else running this directly on their nodes and masters? Or, >> > >> via >> > >> Marathon? >> > >> >> > >> If so, please share your setup specifics. >> > >> >> > >> Thanks, >> > >> June Taylor >> > >> System Administrator, Minnesota Population Center >> > >> University of Minnesota >> > >> >> > > > >
Re: Mesos Task History
We just grab them with collectds mesos plugin and log to Graphite, gives us long term trend details. https://github.com/rayrod2030/collectd-mesos Haven't used this one but it supposedly does per-task metric collection: https://github.com/bobrik/collectd-mesos-tasks On 14 April 2016 at 13:37, June Taylorwrote: > Adam, > > Is there a way to keep this history? > > > Thanks, > June Taylor > System Administrator, Minnesota Population Center > University of Minnesota > > On Wed, Apr 13, 2016 at 4:32 PM, Adam Bordelon wrote: >> >> Yes, these counters are only kept in-memory, so any time a Mesos master >> starts, its counters are initialized to 0. >> >> On Wed, Apr 13, 2016 at 9:38 AM, June Taylor wrote: >>> >>> We have a single master at the moment. Does the task history get cleared >>> when the mesos-master restarts? >>> >>> >>> Thanks, >>> June Taylor >>> System Administrator, Minnesota Population Center >>> University of Minnesota >>> >>> On Wed, Apr 13, 2016 at 11:33 AM, Pradeep Chhetri >>> wrote: Yes, they get cleaned up whenever the mesos master leader failover happens. On Wed, Apr 13, 2016 at 3:32 PM, June Taylor wrote: > > I am noticing that recently our Completed Tasks and Terminated > Frameworks lists are empty. Where are these stored, and do they get > automatically cleared out at some interval? > > Thanks, > June Taylor > System Administrator, Minnesota Population Center > University of Minnesota -- Regards, Pradeep Chhetri >>> >>> >> >
Re: [Proposal] Remove the default value for agent work_dir
Oh please yes! On 13 April 2016 at 08:00, Samwrote: > +1 > > Sent from my iPhone > > On Apr 13, 2016, at 12:44 PM, Avinash Sridharan > wrote: > > +1 > > On Tue, Apr 12, 2016 at 9:31 PM, Jie Yu wrote: >> >> +1 >> >> On Tue, Apr 12, 2016 at 9:29 PM, James Peach wrote: >> >> > >> > > On Apr 12, 2016, at 3:58 PM, Greg Mann wrote: >> > > >> > > Hey folks! >> > > A number of situations have arisen in which the default value of the >> > Mesos agent `--work_dir` flag (/tmp/mesos) has caused problems on >> > systems >> > in which the automatic cleanup of '/tmp' deletes agent metadata. To >> > resolve >> > this, we would like to eliminate the default value of the agent >> > `--work_dir` flag. You can find the relevant JIRA here. >> > > >> > > We considered simply changing the default value to a more appropriate >> > location, but decided against this because the expected filesystem >> > structure varies from platform to platform, and because it isn't >> > guaranteed >> > that the Mesos agent would have access to the default path on a >> > particular >> > platform. >> > > >> > > Eliminating the default `--work_dir` value means that the agent would >> > exit immediately if the flag is not provided, whereas currently it >> > launches >> > successfully in this case. This will break existing infrastructure which >> > relies on launching the Mesos agent without specifying the work >> > directory. >> > I believe this is an acceptable change because '/tmp/mesos' is not a >> > suitable location for the agent work directory except for short-term >> > local >> > testing, and any production scenario that is currently using this >> > location >> > should be altered immediately. >> > >> > +1 from me too. Defaulting to /tmp just helps people shoot themselves in >> > the foot. >> > >> > J > > > > > -- > Avinash Sridharan, Mesosphere > +1 (323) 702 5245
Re: Slaves not getting registered
erminated > > > > root@master1:/var/log/mesos# tail -f mesos-master.WARNING > > Log file created at: 2016/04/12 11:01:49 > > Running on machine: master1 > > Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg > > W0412 11:01:49.024226 3712 authenticator.cpp:511] No credentials provided, > authentication requests will be refused > > > > root@master1:/var/log/mesos# tail -f > mesos-master.master1.invalid-user.log.INFO.20160412-11014 > > tail: cannot open > ‘mesos-master.master1.invalid-user.log.INFO.20160412-11014’ for reading: No > such file or directory > > root@master1:/var/log/mesos# tail -f > mesos-master.master1.invalid-user.log.INFO.20160412-11014 > > mesos-master.master1.invalid-user.log.INFO.20160412-110143.3651 > mesos-master.master1.invalid-user.log.INFO.20160412-110148.3712 > > root@master1:/var/log/mesos# tail -f > mesos-master.master1.invalid-user.log.INFO.20160412-110143.3651 > > I0412 11:01:46.424433 3676 replica.cpp:673] Replica in EMPTY status > received a broadcasted recover request from (5)@30.30.30.53:5050 > > I0412 11:01:47.068586 3675 replica.cpp:673] Replica in EMPTY status > received a broadcasted recover request from (8)@30.30.30.53:5050 > > I0412 11:01:47.592926 3677 replica.cpp:673] Replica in EMPTY status > received a broadcasted recover request from (11)@30.30.30.53:5050 > > I0412 11:01:48.188248 3680 replica.cpp:673] Replica in EMPTY status > received a broadcasted recover request from (14)@30.30.30.53:5050 > > I0412 11:01:48.887104 3678 group.cpp:460] Lost connection to ZooKeeper, > attempting to reconnect ... > > I0412 11:01:48.887177 3674 group.cpp:460] Lost connection to ZooKeeper, > attempting to reconnect ... > > I0412 11:01:48.887229 3677 group.cpp:460] Lost connection to ZooKeeper, > attempting to reconnect ... > > I0412 11:01:48.919545 3675 group.cpp:519] ZooKeeper session expired > > I0412 11:01:48.919848 3680 detector.cpp:154] Detected a new leader: None > > I0412 11:01:48.919922 3680 master.cpp:1710] The newly elected leader is > None > > > > > > root@slave1:/var/log/mesos# tail -f > mesos-slave.slave1.invalid-user.log.INFO.20160412-110554.1696 > > I0413 03:12:54.532676 1711 group.cpp:519] ZooKeeper session expired > > I0413 03:12:58.757953 1715 slave.cpp:4304] Current disk usage 6.44%. Max > allowed age: 5.848917453828577days > > W0413 03:13:04.539577 1715 group.cpp:503] Timed out waiting to connect to > ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration > > I0413 03:13:04.539798 1715 group.cpp:519] ZooKeeper session expired > > W0413 03:13:14.542245 1713 group.cpp:503] Timed out waiting to connect to > ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration > > I0413 03:13:14.542434 1713 group.cpp:519] ZooKeeper session expired > > > > root@slave1:/var/log/mesos# tail -f mesos-slave.WARNING > > W0413 03:12:24.512336 1715 group.cpp:503] Timed out waiting to connect to > ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration > > W0413 03:12:34.519641 1710 group.cpp:503] Timed out waiting to connect to > ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration > > W0413 03:12:44.521181 1713 group.cpp:503] Timed out waiting to connect to > ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration > > W0413 03:12:54.532501 1711 group.cpp:503] Timed out waiting to connect to > ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration > > > > Thank you. > > > > > > From: June Taylor [mailto:j...@umn.edu] > Sent: 12 April 2016 18:06 > To: user@mesos.apache.org > Subject: Re: Slaves not getting registered > > > > Try looking in /var/log/mesos/ at these files: mesos-slave.WARNING, > mesos-slave.INFO, mesos-slave.ERROR > > > > > Thanks, > > June Taylor > > System Administrator, Minnesota Population Center > > University of Minnesota > > > > On Tue, Apr 12, 2016 at 4:36 AM, Dick Davies <d...@hellooperator.net> wrote: > > There's no mention of a slave there, have a look at the logs on the > slaves filesystem and see if it is giving any errors. > > > On 12 April 2016 at 10:17, <aishwarya.adyanth...@accenture.com> wrote: >> The GUI log shows like this: >> >> >> >> I0412 08:45:51.379609 3616 master.cpp:3673] Processing DECLINE call for >> offers: [ 74f33592-fc48-4066-a59c-977818b4c13c-O282 ] for framework >> 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at >> scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208 >> >> I0412 08:45:54.637461 3612 http.cpp:501] HTTP GET for /master/state.json >> from 10.211.203.
Re: Slaves not getting registered
There's no mention of a slave there, have a look at the logs on the slaves filesystem and see if it is giving any errors. On 12 April 2016 at 10:17,wrote: > The GUI log shows like this: > > > > I0412 08:45:51.379609 3616 master.cpp:3673] Processing DECLINE call for > offers: [ 74f33592-fc48-4066-a59c-977818b4c13c-O282 ] for framework > 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at > scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208 > > I0412 08:45:54.637461 3612 http.cpp:501] HTTP GET for /master/state.json > from 10.211.203.147:59463 with User-Agent='Mozilla/5.0 (Windows NT 6.0; > WOW64; rv:43.0) Gecko/20100101 Firefox/43.0' > > I0412 08:45:57.376288 3619 master.cpp:5350] Sending 1 offers to framework > 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at > scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208 > > I0412 08:45:57.385325 3613 master.cpp:3673] Processing DECLINE call for > offers: [ 74f33592-fc48-4066-a59c-977818b4c13c-O283 ] for framework > 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at > scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208 > > I0412 08:46:03.383728 3614 master.cpp:5350] Sending 1 offers to framework > 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at > scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208 > > I0412 08:46:03.396531 3612 master.cpp:3673] Processing DECLINE call for > offers: [ 74f33592-fc48-4066-a59c-977818b4c13c-O284 ] for framework > 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at > scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208 > > I0412 08:46:04.665582 3612 http.cpp:501] HTTP GET for /master/state.json > from 10.211.203.147:59464 with User-Agent='Mozilla/5.0 (Windows NT 6.0; > WOW64; rv:43.0) Gecko/20100101 Firefox/43.0' > > I0412 08:46:09.389493 3616 master.cpp:5350] Sending 1 offers to framework > 74f33592-fc48-4066-a59c-977818b4c13c-0001 (chronos-2.4.0) at > scheduler-15022696-44ec-43d2-b193-a3cc4021d20e@30.30.30.48:42208 > > > > > > Is there a way to find out the number of masters that are present in the > environment together through CLI/GUI? > > > > > > > > From: haosdent [mailto:haosd...@gmail.com] > Sent: 12 April 2016 13:37 > To: user > Subject: Re: Slaves not getting registered > > > >>but am unable to get it registered. > > Hi, @aishwarya Could you post master and slave log to provide more details? > Usually it is because of network problem. > > > > On Tue, Apr 12, 2016 at 4:02 PM, wrote: > > Hi, > > > > I’m unable to get the slave registered with the master node. I’ve configured > both the masters and slave machines but am unable to get it registered. > > > > Thank you. > > > > > > > This message is for the designated recipient only and may contain > privileged, proprietary, or otherwise confidential information. If you have > received it in error, please notify the sender immediately and delete the > original. Any other use of the e-mail by you is prohibited. Where allowed by > local law, electronic communications with Accenture and its affiliates, > including e-mail and instant messaging (including content), may be scanned > by our systems for the purposes of information security and assessment of > internal compliance with Accenture policy. > __ > > www.accenture.com > > > > > > -- > > Best Regards, > > Haosdent Huang
Re: How to kill tasks when memory exceeds the cgroup limit?
On 18 March 2016 at 20:58, Benjamin Mahler <bmah...@apache.org> wrote: > Interesting, why does it take down the slaves? This was a good while back, but when swap gets low our slaves kernel OOM killer tended to mess things up. > Because a lot of organizations run with swap disabled (e.g. for more > deterministic performance), we originally did not set the swap limit at all. > When we introduced the '--cgroups_limit_swap' flag we had to make it default > to false initially in case any users were depending on the original behavior > of no swap limit. Now that it's been available for some time, we can > consider moving the default to true. This is actually reflected in the TODO > alongside the flag: > > https://github.com/apache/mesos/blob/0.28.0/src/slave/flags.cpp#L331-L336 > > Want to send a patch? We'd need to communicate this change to the default > behavior in the CHANGELOG and specify how users can keep the original > behaviour. I'll see if I can get time - just about to finish a consulting gig and was going to take a break, so it might be an option. Thanks for the explanation, I *knew* there'd be a reason :) > Also, there's more we would need to do in the long term for use cases that > desire swapping. The only support today is (1) no memory limits (2) memory > limit and no swap limit (3) both memory and swap limits. You can imagine > scenarios where users may want to control how much they're allowed to swap, > or maybe we want to swap for non-latency sensitive containers. However, it's > more complicated (the user and operator have to co-operate more, there are > more ways to run things, etc), and so the general advice is to disable swap > to keep things simple and deterministic. > > On Fri, Mar 18, 2016 at 11:34 AM, Dick Davies <d...@hellooperator.net> > wrote: >> >> Great! >> >> I'm not really sure why mesos even allows RSS limiting without VMEM, >> it takes down slaves like the Black Death >> if you accidentally deploy a 'leaker'. I'm sure there's a use case I'm >> not seeing :) >> >> On 18 March 2016 at 16:27, Shiyao Ma <i...@introo.me> wrote: >> > Thanks. The limit_swap works. > >
Re: How to kill tasks when memory exceeds the cgroup limit?
Great! I'm not really sure why mesos even allows RSS limiting without VMEM, it takes down slaves like the Black Death if you accidentally deploy a 'leaker'. I'm sure there's a use case I'm not seeing :) On 18 March 2016 at 16:27, Shiyao Mawrote: > Thanks. The limit_swap works.
Re: How to kill tasks when memory exceeds the cgroup limit?
Last time I tried (not on the latest release) I also had to have cgroups set to limit swap, otherwise as soon as the process hit the RAM limit it would just start to consume swap. try adding --cgroups_limit_swap to the slaves startup flags. On 17 March 2016 at 16:21, Shiyao Mawrote: > Hi, > > > For the slave side: > export MESOS_RESOURCES='cpus:4;mem:180' > export MESOS_ISOLATION='cgroups/cpu,cgroups/mem' > > For the framework, > It accepts the offer from the slave and sends tasks with memory spec less > than offered. > > > However, the task actually *deliberately* asks for an arbitrary large memory > during runtime. > > My assumption is that the slave will kill the task. However, it doesn't. > > So here goes my question. How does slave handle the 'runtime memory > exceeding cgroup limit' behavior? Will any handlers be invoked? > > > > Regards.
Re: rkt / appc support
Thanks ! will keep a close eye on that ticket. On 16 March 2016 at 09:47, Guangya Liu <gyliu...@gmail.com> wrote: > Hi Dick, > > This is new functionality, you can refer to > https://issues.apache.org/jira/browse/MESOS-2840 for more detail, there are > also some design document link append in the JIRA ticket. > > Thanks, > > Guangya > > On Wed, Mar 16, 2016 at 5:24 PM, Dick Davies <d...@hellooperator.net> wrote: >> >> Quick question - what versions of Mesos (if any) support rkt/appc? >> >> Saw the announcement of the Unified Containerizer >> >> ( http://mesos.apache.org/documentation/container-image/ ) >> >> but I wasn't clear if this was a refactoring of existing support, or >> new functionality. > >
rkt / appc support
Quick question - what versions of Mesos (if any) support rkt/appc? Saw the announcement of the Unified Containerizer ( http://mesos.apache.org/documentation/container-image/ ) but I wasn't clear if this was a refactoring of existing support, or new functionality.
Re: AW: Feature request: move in-flight containers w/o stopping them
Agreed, vMotion always struck me as something for those monolithic apps with a lot of local state. The industry seems to be moving away from that as fast as its little legs will carry it. On 19 February 2016 at 11:35, Jason Giedyminwrote: > Food for thought: > > One should refrain from monolithic apps. If they're small and stateless you > should be doing rolling upgrades. > > If you find yourself with one container and you can't easily distribute that > work load by just scaling and load balancing then you have a monolith. Time > to enhance it. > > Containers should not be treated like VMs. > > -Jason > > On Feb 19, 2016, at 6:05 AM, Mike Michel wrote: > > Question is if you really need this when you are moving in the world of > containers/microservices where it is about building stateless 12factor apps > except databases. Why moving a service when you can just kill it and let the > work be done by 10 other containers doing the same? I remember a talk on > dockercon about containers and live migration. It was like: „And now where > you know how to do it, dont’t do it!“ > > > > Von: Avinash Sridharan [mailto:avin...@mesosphere.io] > Gesendet: Freitag, 19. Februar 2016 05:48 > An: user@mesos.apache.org > Betreff: Re: Feature request: move in-flight containers w/o stopping them > > > > One problem with implementing something like vMotion for Mesos is to address > seamless movement of network connectivity as well. This effectively requires > moving the IP address of the container across hosts. If the container shares > host network stack, this won't be possible since this would imply moving the > host IP address from one host to another. When a container has its network > namespace, attached to the host, using a bridge, moving across L2 segments > might be a possibility. To move across L3 segments you will need some form > of overlay (VxLAN maybe ?) . > > > > On Thu, Feb 18, 2016 at 7:34 PM, Jay Taylor wrote: > > Is this theoretically feasible with Linux checkpoint and restore, perhaps > via CRIU?http://criu.org/Main_Page > > > On Feb 18, 2016, at 4:35 AM, Paul Bell wrote: > > Hello All, > > > > Has there ever been any consideration of the ability to move in-flight > containers from one Mesos host node to another? > > > > I see this as analogous to VMware's "vMotion" facility wherein VMs can be > moved from one ESXi host to another. > > > > I suppose something like this could be useful from a load-balancing > perspective. > > > > Just curious if it's ever been considered and if so - and rejected - why > rejected? > > > > Thanks. > > > > -Paul > > > > > > > > > > -- > > Avinash Sridharan, Mesosphere > > +1 (323) 702 5245
Re: make slaves not getting tasks anymore
It sounds like you want to use checkpointing, that should keep the tasks alive as you update the mesos slave process itself. On 30 December 2015 at 11:43, Mike Michelwrote: > Hi, > > > > i need to update slaves from time to time and looking for a way to take them > out of the cluster but without killing the running tasks. I need to wait > until all tasks are done and during this time no new tasks should be started > on this slave. My first idea was to set a constraint „status:online“ for > every task i start and then change the attribute of the slave to „offline“, > restart slave process while executer still runs the tasks but it seems if > you change the attributes of a slave it can not connect to the cluster > without rm -rf /tmp before which will kill all tasks. > > > > Also the maintenance mode seems not to be an option: > > > > „When maintenance is triggered by the operator, all agents on the machine > are told to shutdown. These agents are subsequently removed from the master > which causes tasks to be updated as TASK_LOST. Any agents from machines in > maintenance are also prevented from registering with the master.“ > > > > Is there another way? > > > > > > Cheers > > > > Mike
Re: Mesos masters and zookeeper running together?
zookeeper really wants a dedicated cluster IMO; preferably with SSD under it - if zookeeper starts to run slow then everything else will start to bog down. I've co-hosted it with mesos masters before now for demo purposes etc. but for production it's probably worth choosing dedicated hosts. On 24 December 2015 at 20:36, Rodrick Brownwrote: > With our design we end up building out a stand alone zookeeper cluster 3 > nodes. Zookeeper seems to be the default dumping ground for many Apache > based products these days. You will eventually see many services and > frameworks require a zk instance for leader election, coordination, Kv store > etc.. I've seen situations where the masters can become extremely busy and > cause performance problem with Zk which can be huge issue for mesos. > > Sent from Outlook Mobile > > > > > On Thu, Dec 24, 2015 at 8:01 AM -0800, "Ron Lipke" wrote: > >> Hello, I've been working on setting up a mesos cluster for eventual >> production use and I have a question on configuring zookeeper alongside >> the mesos masters. >> Is it best practice to run zookeeper/exhibitor as a separate cluster (in >> our case, three nodes) or on the same machines as the mesos masters? I >> understand the drawbacks of increased cost for compute resources that >> will just be running a single service and most of the reference docs >> have them running together, but just wondering if it's beneficial to >> have them uncoupled. >> >> Thanks in advance for any input. >> >> Ron Lipke >> @neverminding > > > NOTICE TO RECIPIENTS: This communication is confidential and intended for > the use of the addressee only. If you are not an intended recipient of this > communication, please delete it immediately and notify the sender by return > email. Unauthorized reading, dissemination, distribution or copying of this > communication is prohibited. This communication does not constitute an offer > to sell or a solicitation of an indication of interest to purchase any loan, > security or any other financial product or instrument, nor is it an offer to > sell or a solicitation of an indication of interest to purchase any products > or services to any persons who are prohibited from receiving such > information under applicable law. The contents of this communication may not > be accurate or complete and are subject to change without notice. As such, > Orchard App, Inc. (including its subsidiaries and affiliates, "Orchard") > makes no representation regarding the accuracy or completeness of the > information contained herein. The intended recipient is advised to consult > its own professional advisors, including those specializing in legal, tax > and accounting matters. Orchard does not provide legal, tax or accounting > advice.
Re: what's the best way to monitor mesos cluster
+1 for the collectd plugin. been using that for about 9 months and it does the job nicely. On 11 November 2015 at 06:59, Du, Fanwrote: > Hi Mesos experts > > There is server and client snapshot metrics in jason format provided by > Mesos itself. > but more often we want to extend the metrics a bit more than that. > > I have been looking for this for a couple of days, while > https://collectd.org/ comes > to my sight, it also has a mesos plugin > https://github.com/rayrod2030/collectd-mesos. > > Is there any recommended such open source project to do this task? > Thanks. >
Re: Cluster Maintanence
You might want to look at the maintenance primitives feature in 0.25.0: https://mesos.apache.org/blog/mesos-0-25-0-released/ On 29 October 2015 at 18:19, John Omernikwrote: > I am wondering if there are some easy ways to take a healthy slave/agent > and start a process to bleed processes out. > > Basically, without having to do something where every framework would > support it, I'd like the option to > > 1. Stop offering resources to new frameworks. I.e. no new resources would > be offered, but existing jobs/tasks continue to run. > 2. Offer the ability, especially in the UI, but potentially in API as > well to "kill" a task. This would cause a failure that force the framework > to respond. For example, if it was a docker container running in marathon, > if I said "please kill this task" it would, marathon would recognize the > failure and try to restart the container. Since our agent (in point 1) is > not offering resources, then that task would not fall on the agent in > question. > > > The reason for this manual bleeding is to say run updates on a node or > pull it out of service for other reasons (memory upgrades etc) and do so in > a manual way. You may want to address what's running on the node manually, > thus a whole scale "kill everything" while it SHOULD be doable, may not > always be feasible. In addition, the inverse offers thing seems neat, but > frameworks have to support it. > > So, is there any thing like that now and I am just missing it in the > documentation? I am curious to hear how others are handling this situation > in their environments. > > John > > > >
Re: How production un-ready are Mesos Cassandra, Spark and Kafka Frameworks?
Hi Chris Spark is a Mesos native, I'd have no hesitation running it on Mesos. Cassandra not so much - that's not to disparage the work people are putting in there, I think it's really interesting. But personally with complex beasts like Cassandra I want to be running as 'stock' as possible, as it makes it easier to learn from other peoples experiences. On 12 October 2015 at 17:47, Chris Elsmorewrote: > Hi all, > > Have just got back from a brilliant MesosCon Europe in Dublin, I learnt a > huge amount and a big thank-you for putting on a great conference to all > involved! > > > I am looking to deploy a small (maybe 5 max) Cassandra & Spark cluster to > do some data analysis at my current employer, and am a little unsure of the > current status of the frameworks this would need to run on Mesos- both the > mesosphere docs (which I’m guessing use the frameworks of the same name > hosted on Github) and the Github ReadMes mention that these are not > production ready, and the rough timeline of Q1 2016. > > I’m just wondering how production un-ready these are!? I am looking at > using Mesos to deploy future stateless services in the next 6 months or so, > and so I like the idea of adding to that system and the look of the > configuration that is handled for you to bind nodes together in these > frameworks. However it feels like for a smallish cluster of production > ready machines it might be better to deploy them standalone and stay > observant on the status of such things in the near future, and the > configuration wins are not that large especially for a small cluster. > > > Any experience and advice on the above would be greatly received! > > > Chris > > > >
Re: Java detector for mess masters and leader
The active master has a flag set in /metrics/snapshot : master/elected which is 1 for the active master and 0 otherwise, so it's easy enough to only load the metrics from the active master. (I use the collectd plugin and push data rather than poll, but the same principle should apply). On 7 July 2015 at 14:02, Donald Laidlaw donlaid...@me.com wrote: Has anyone ever developed Java code to detect the mesos masters and leader, given a zookeeper connection? The reason I ask is because I would like to monitor mesos to report various metrics reported by the master. This requires detecting and tracking the leading master to query its /metrics/snapshot REST endpoint. Thanks, -Don
Re: Thoughts and opinions in physically building a cluster
That doesn't sound too bad (it's a fairly typical setup e.g. on an Amazon VPC). You probably want to avoid NAT or similar things between master and slaves to avoid a lot of LIBPROCESS_IP tricks so same switch sounds good. Personally I quite like the master/slave distinction. I wouldn't want a runaway set of tasks to bog down the masters and operationally we'd alert if we're starting to lose masters whereas the slaves are 'cattle' and we can just spin up more as they die if need be (it's a little more tricky to scale out masters and zookeepers so they get treated as though they were a bit less expendable). I co-locate the zookeeper ensemble on the masters on smaller clusters to save VM count, but that's more personal taste than anything. On 25 June 2015 at 17:12, Daniel Gaston daniel.gas...@dal.ca wrote: So this may be another relatively noob question, but when designing a mesos cluster, is it basically as simple as the nodes connected by a switch? Since any of the nodes can be master nodes or acting as both master and slave, I am guessing there is no need for another head node as you would have with a traditional cluster design. But would each of the nodes then have to be connected to the external/institutional network? My rough idea was for this small cluster to not be connected to the main institutional network but for my workstation to be connected to both the cluster's network as well as to the institutional network From: CCAAT cc...@tampabay.rr.com Sent: June-19-15 4:57 PM To: user@mesos.apache.org Cc: cc...@tampabay.rr.com Subject: Re: Thoughts and opinions in physically building a cluster On 06/19/2015 01:28 PM, Daniel Gaston wrote: On 19/06/2015 18:38, Oliver Nicholas wrote: Unless you have some true HA requirements, it seems intuitively wasteful to have 3 masters and 2 slaves (unless the cost of 5 nodes is inconsequential to you and you hate the environment). Any particular reason not to have three nodes which are acting both as master and slaves? None at all. I'm not a cluster or networking guru, and have only played with mesos in cloud-based settings so I wasn't sure how this would work. But it makes sense, that way the 'standby' masters are still participating in the zookeeper quorum while still being available to do real work as slave nodes. Daniel. There is no such thing as a 'cluster guru'. It's all 'seat of the pants' flying right now; so you are fine what you are doing and propose. If codes do not exist to meet your specific needs and goals, they can (should?) be created. I'm working on an architectural expansion Where nodes (virtual, actual or bare metal) migrate from master -- entrepreneur -- worker -- slave -- embedded (bare metal or specially attached hardware. I'm proposing to do all of this with the Autonomy_Function and decisions being made bottom_up as opposed to the current top_down dichotomy. I'm prolly going to have to 'fork codes' for a while to get things stable and then hope they are included; when other minds see the validity of the ideas. Surely one box can be set up as both master and slave. Moving slaves to masters, should be an automatic function and will prolly will be address in the future codes of mesos. PS: Keep pushing your ideas and do not take no for an answer! Mesos belongs to everybody. hth, James
Re: How to upgrade mesos version from a running mesos cluster
Do the masters first, as described at the link. On 19 June 2015 at 10:17, tommy xiao xia...@gmail.com wrote: Thanks Alex Rukletsov. In my earlier try, the newer mesos slave ( version 0.21.1) can't connect to mesos master (version 0.20.0), So it annoies to me. anyway, i will test again, let me clarify the concern. 2015-06-19 17:06 GMT+08:00 Alex Rukletsov a...@mesosphere.com: Tommy, you should be able to upgrade Mesos without stopping the cluster. If you cannot, please describe the issue you face and we'll try to figure out why. On Fri, Jun 19, 2015 at 7:58 AM, tommy xiao xia...@gmail.com wrote: Hi Jeff, Thanks for your remind. my concerns is how about upgrade without stop the mesos cluster. I found the mesos cluster can't support rolling upgrade feature. 2015-06-19 12:20 GMT+08:00 Jeff Schroeder jeffschroe...@computer.org: Hello Tommy, have you read the documentation? If not, please take a look and then follow up with any specific questions here: http://mesos.apache.org/documentation/latest/upgrades/ On Thursday, June 18, 2015, tommy xiao xia...@gmail.com wrote: Hi, I have a question on upgrade strategy: How about upgrade mesos cluster seamlessly from a production cluster. Do we have some best practice on it? -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com -- Text by Jeff, typos by iPhone -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com
Re: cluster confusion after zookeeper blip
Thanks Nikolay - I checked the frameworkid in zookeeper (/marathon/state/frameworkId) matched the one attached to the running tasks, gave the old marathon leader a restart and everything reconnected ok (we did have to disable our service discovery pieces to avoid getting empty JSON back when marathon first booted, but other than that everything is peachy). On 18 May 2015 at 15:31, Nikolay Borodachev nbo...@adobe.com wrote: Have you tried to restart Marathon and Mesos processes? Once you restart them they should pick zookeepers, elect leaders, etc. If you're using Docker containers, they should reattach themselves to the respective slaves. Thanks Nikolay -Original Message- From: rasput...@gmail.com [mailto:rasput...@gmail.com] On Behalf Of Dick Davies Sent: Monday, May 18, 2015 5:26 AM To: user@mesos.apache.org Subject: cluster confusion after zookeeper blip We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves. (mesos 0.21.0, marathon 0.7.5) This morning we had a network outage long enough for everything to lose zookeeper. Now our marathon UI is empty (all 3 marathons think someone else is a master, and marathons 'proxy to leader' feature means the REST API is toast). Odd thing is, at the mesos level, the mesos master UI shows no tasks running (logs mention orphaned tasks), but if i click into the 'slaves' tab and dig down, the slave view details tasks that are in fact active. Any way to bring order to this without needing to kill those tasks? we have no actual outage from a user point of view, but the cluster itself is pretty confused and our service discovery relies on the marathon API which is timing out. Although mesos has checkpointing enabled, marathon isn't running with checkpointing on (it's the default now but doesn't apply to existing frameworks apparently, and we started this around marathon 0.4.x) Would enabling checkpointing help with this kind of issue? If so, how do i enable it for an existing framework?
cluster confusion after zookeeper blip
We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves. (mesos 0.21.0, marathon 0.7.5) This morning we had a network outage long enough for everything to lose zookeeper. Now our marathon UI is empty (all 3 marathons think someone else is a master, and marathons 'proxy to leader' feature means the REST API is toast). Odd thing is, at the mesos level, the mesos master UI shows no tasks running (logs mention orphaned tasks), but if i click into the 'slaves' tab and dig down, the slave view details tasks that are in fact active. Any way to bring order to this without needing to kill those tasks? we have no actual outage from a user point of view, but the cluster itself is pretty confused and our service discovery relies on the marathon API which is timing out. Although mesos has checkpointing enabled, marathon isn't running with checkpointing on (it's the default now but doesn't apply to existing frameworks apparently, and we started this around marathon 0.4.x) Would enabling checkpointing help with this kind of issue? If so, how do i enable it for an existing framework?
group memory limits are always 'soft' . how do I ensure info-pid.isNone() ?
Been banging my head against this for a while now. mesos 0.21.0 , marathon 0.7.5, centos 6 servers. When I enable cgroups (flags are : --cgroups_limit_swap --isolation=cgroups/cpu,groups/mem ) the memory limits I'm setting are reflected in memory.soft_limit_in_bytes but not in memory.limit_in_bytes or memory.memsw.limit_in_bytes. Upshot is our runaway task eats all RAM and swap on the server until the OOM steps in and starts firing into the crowd. This line of code seems to never lower a hard limit: https://github.com/apache/mesos/blob/master/src/slave/containerizer/isolators/cgroups/mem.cpp#L382 which means both of those tests must be true, right? the current limit is insanely high (8192 PB if i'm reading it right) - how would I make info-pid.isNone() be true ? Have tried restarting the slave, scaling the marathon apps to 0 tasks then back. Bit stumped.
Re: group memory limits are always 'soft' . how do I ensure info-pid.isNone() ?
Thanks Ian. Digging around the cgroup there are 3 processes in there; * the mesos-executor * the shell script marathon starts the app with * the actual command to run the task ( a perl app in this case) The line of code you mention is never run in our case, because it's wrapped in the conditional I'm talking about! All I see is cpu.shares being set and then mem.soft_limit_in_bytes. On 28 April 2015 at 17:47, Ian Downes idow...@twitter.com wrote: The line of code you cite is so the hard limit is not decreased on a running container because we can't (easily) reclaim anonymous memory from running processes. See the comment above the code. The info-pid.isNone() is for when cgroup is being configured (see the update() call at the end of MemIsolatorProcess::prepare()), i.e., before any processes are added to the cgroup. The limit currentLimit.get() ensures the limit is only increased. The memory limit defaults to the maximum for the data type, I guess that's the ridiculous 8 EB. It should be set to what the initial memory allocation was for the container so this is not expected. Can you look in the slave logs for when the container was created for the log line on: https://github.com/apache/mesos/blob/master/src/slave/containerizer/isolators/cgroups/mem.cpp#L393 Ian On Tue, Apr 28, 2015 at 7:42 AM, Dick Davies d...@hellooperator.net wrote: Been banging my head against this for a while now. mesos 0.21.0 , marathon 0.7.5, centos 6 servers. When I enable cgroups (flags are : --cgroups_limit_swap --isolation=cgroups/cpu,groups/mem ) the memory limits I'm setting are reflected in memory.soft_limit_in_bytes but not in memory.limit_in_bytes or memory.memsw.limit_in_bytes. Upshot is our runaway task eats all RAM and swap on the server until the OOM steps in and starts firing into the crowd. This line of code seems to never lower a hard limit: https://github.com/apache/mesos/blob/master/src/slave/containerizer/isolators/cgroups/mem.cpp#L382 which means both of those tests must be true, right? the current limit is insanely high (8192 PB if i'm reading it right) - how would I make info-pid.isNone() be true ? Have tried restarting the slave, scaling the marathon apps to 0 tasks then back. Bit stumped.
Re: group memory limits are always 'soft' . how do I ensure info-pid.isNone() ?
That's what led me into reading the code - neither mem.limit_in_bytes or mem.memsw.limit_in_bytes are ever set down from the (insanely high) defaults. I know that second conditional is false, so the first must be too, right? It's likely I'm reading the wrong branch; we're running the 0.21.0 release - but I don't see any commits that would change this ordering. Just to confirm - we are using the default containerizer (not docker or anything else) - that shouldn't make any difference though, should it? I'm offsite til morning now (UK time), but I'll post the full slave logs when I can get to them. On 28 April 2015 at 18:18, Ian Downes idow...@twitter.com wrote: The control flow in the Mesos containerizer to launch a container is: 1. Call prepare() on each isolator 2. Then fork the executor 3. Then isolate(executor_pid) on each isolator The last part of (1) will also call Isolator::update() to set the initial memory limits (see line 288). This is done *before* the executor is in the cgroup, i.e., info-pid.isNone() will be true and that block of code should *always* be executed when a container starts. The LOG(INFO) line at 393 should be present in your logs. Can you verify this? It should be shortly after the LOG(INFO) on line 358. Ian On Tue, Apr 28, 2015 at 9:54 AM, Dick Davies d...@hellooperator.net wrote: Thanks Ian. Digging around the cgroup there are 3 processes in there; * the mesos-executor * the shell script marathon starts the app with * the actual command to run the task ( a perl app in this case) The line of code you mention is never run in our case, because it's wrapped in the conditional I'm talking about! All I see is cpu.shares being set and then mem.soft_limit_in_bytes. On 28 April 2015 at 17:47, Ian Downes idow...@twitter.com wrote: The line of code you cite is so the hard limit is not decreased on a running container because we can't (easily) reclaim anonymous memory from running processes. See the comment above the code. The info-pid.isNone() is for when cgroup is being configured (see the update() call at the end of MemIsolatorProcess::prepare()), i.e., before any processes are added to the cgroup. The limit currentLimit.get() ensures the limit is only increased. The memory limit defaults to the maximum for the data type, I guess that's the ridiculous 8 EB. It should be set to what the initial memory allocation was for the container so this is not expected. Can you look in the slave logs for when the container was created for the log line on: https://github.com/apache/mesos/blob/master/src/slave/containerizer/isolators/cgroups/mem.cpp#L393 Ian On Tue, Apr 28, 2015 at 7:42 AM, Dick Davies d...@hellooperator.net wrote: Been banging my head against this for a while now. mesos 0.21.0 , marathon 0.7.5, centos 6 servers. When I enable cgroups (flags are : --cgroups_limit_swap --isolation=cgroups/cpu,groups/mem ) the memory limits I'm setting are reflected in memory.soft_limit_in_bytes but not in memory.limit_in_bytes or memory.memsw.limit_in_bytes. Upshot is our runaway task eats all RAM and swap on the server until the OOM steps in and starts firing into the crowd. This line of code seems to never lower a hard limit: https://github.com/apache/mesos/blob/master/src/slave/containerizer/isolators/cgroups/mem.cpp#L382 which means both of those tests must be true, right? the current limit is insanely high (8192 PB if i'm reading it right) - how would I make info-pid.isNone() be true ? Have tried restarting the slave, scaling the marathon apps to 0 tasks then back. Bit stumped.
Re: group memory limits are always 'soft' . how do I ensure info-pid.isNone() ?
You may very well be right, but I'd like to keep this specific thread focussed on figuring out why the expected/implemented behaviour isn't happening in my case if that's ok. On 28 April 2015 at 19:26, CCAAT cc...@tampabay.rr.com wrote: I really hate to be the 'old fashion computer scientist' in this group, but, I think that the role of and usage of 'cgroups' is going to have to be expanded greatly as a solution to the dynamic memory management needs of both the cluster(s) and the frameworks. This problem is not going away and I see no other serious solution to cgroup use expansion.
Re: [RESULT][VOTE] Release Apache Mesos 0.22.0 (rc4)
Thanks Craig, that's really handy! Dumb question for the list: are there any plans to support multiple isolation flags somehow? I need cgroups, but would really like the disk quota feature too (and network isolation come to that. And a pony). On 25 March 2015 at 01:00, craig w codecr...@gmail.com wrote: Congrats, I was working on a quick post summarizing what's new (based on jira and the video from niklas) which I just posted (great timing) http://craigwickesser.com/2015/03/mesos-022-release/ On Tue, Mar 24, 2015 at 8:30 PM, Paul Otto p...@ottoops.com wrote: This is awesome! Thanks for all the hard work you all have put into this! I am really excited to update to the latest stable version of Apache Mesos! Regards, Paul Paul Otto Principal DevOps Architect, Co-founder Otto Ops LLC | OttoOps.com 970.343.4561 office 720.381.2383 cell On Tue, Mar 24, 2015 at 6:04 PM, Niklas Nielsen nik...@mesosphere.io wrote: Hi all, The vote for Mesos 0.22.0 (rc4) has passed with the following votes. +1 (Binding) -- Ben Mahler Tim St Clair Adam Bordelon Brenden Matthews +1 (Non-binding) -- Alex Rukletsov Craig W Ben Whitehead Elizabeth Lingg Dario Rexin Jeff Schroeder Michael Park Alexander Rojas Andrew Langhorn There were no 0 or -1 votes. Please find the release at: https://dist.apache.org/repos/dist/release/mesos/0.22.0 It is recommended to use a mirror to download the release: http://www.apache.org/dyn/closer.cgi The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.22.0 The mesos-0.22.0.jar has been released to: https://repository.apache.org The website (http://mesos.apache.org) will be updated shortly to reflect this release. Thanks, Niklas -- https://github.com/mindscratch https://www.google.com/+CraigWickesser https://twitter.com/mind_scratch https://twitter.com/craig_links
Re: [RESULT][VOTE] Release Apache Mesos 0.22.0 (rc4)
Ah ok - config page at http://mesos.apache.org/documentation/latest/configuration/ gave me the impression this was an either/or. I'm happy now, thanks a lot! On 25 March 2015 at 08:47, Tim Chen t...@mesosphere.io wrote: Hi there, You can already pass in multiple values seperated by comma (cgroups/cpu,cgroups/mem,posix/disk) Tim On Wed, Mar 25, 2015 at 12:46 AM, Dick Davies d...@hellooperator.net wrote: Thanks Craig, that's really handy! Dumb question for the list: are there any plans to support multiple isolation flags somehow? I need cgroups, but would really like the disk quota feature too (and network isolation come to that. And a pony). On 25 March 2015 at 01:00, craig w codecr...@gmail.com wrote: Congrats, I was working on a quick post summarizing what's new (based on jira and the video from niklas) which I just posted (great timing) http://craigwickesser.com/2015/03/mesos-022-release/ On Tue, Mar 24, 2015 at 8:30 PM, Paul Otto p...@ottoops.com wrote: This is awesome! Thanks for all the hard work you all have put into this! I am really excited to update to the latest stable version of Apache Mesos! Regards, Paul Paul Otto Principal DevOps Architect, Co-founder Otto Ops LLC | OttoOps.com 970.343.4561 office 720.381.2383 cell On Tue, Mar 24, 2015 at 6:04 PM, Niklas Nielsen nik...@mesosphere.io wrote: Hi all, The vote for Mesos 0.22.0 (rc4) has passed with the following votes. +1 (Binding) -- Ben Mahler Tim St Clair Adam Bordelon Brenden Matthews +1 (Non-binding) -- Alex Rukletsov Craig W Ben Whitehead Elizabeth Lingg Dario Rexin Jeff Schroeder Michael Park Alexander Rojas Andrew Langhorn There were no 0 or -1 votes. Please find the release at: https://dist.apache.org/repos/dist/release/mesos/0.22.0 It is recommended to use a mirror to download the release: http://www.apache.org/dyn/closer.cgi The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.22.0 The mesos-0.22.0.jar has been released to: https://repository.apache.org The website (http://mesos.apache.org) will be updated shortly to reflect this release. Thanks, Niklas -- https://github.com/mindscratch https://www.google.com/+CraigWickesser https://twitter.com/mind_scratch https://twitter.com/craig_links
Re: mesos-collectd-plugin
Hi Dan I can see a couple of things that could be wrong (NB: not a collectd expert, but these are differences I see from my working config). 1. Is /opt/collectd/etc/collectd.conf your main collectd config file? otherwise, it's not being read at all by collectd. 2. I configure the plugin in that file i.e. the Module mesos-master block should be in /opt/collectd/etc/collectd.conf , not tucked down in the python module path directory. 3. Are you sure your master listens on localhost? Mine doesn't, I needed to set that Host line to match the IP I set that master to listen on ( e.g. in /etc/mesos-master/ip ). Pretty sure one of those will do the trick (NB: you'll only get metrics from the elected master; the 'standby' masters still get polled but collectd will ignore any data from them unless they're the primary) On 11 March 2015 at 19:52, Dan Dong dongda...@gmail.com wrote: Hi, Dick, I put the plugin under: $ ls -l /opt/collectd/lib/collectd/plugins/python/ total 504 -rw-r--r-- 1 root root345 Mar 10 19:40 mesos-master.conf -rw-r--r-- 1 root root 1 Mar 10 15:06 mesos-master.py -rw-r--r-- 1 root root322 Mar 10 19:44 mesos-slave.conf -rw-r--r-- 1 root root 6808 Mar 10 15:06 mesos-slave.py -rw-r--r-- 1 root root 288892 Mar 10 19:35 python.a -rwxr-xr-x 1 root root969 Mar 10 19:35 python.la -rwxr-xr-x 1 root root 188262 Mar 10 19:35 python.so And in /opt/collectd/etc/collectd.conf, I set: LoadPlugin python Globals true /LoadPlugin . Plugin python ModulePath /opt/collectd/lib/collectd/plugins/python/ LogTraces true /Plugin $ cat /opt/collectd/lib/collectd/plugins/python/mesos-master.conf LoadPlugin python Globals true /LoadPlugin Plugin python ModulePath /opt/collectd/lib/collectd/plugins/python/ Import mesos-master Module mesos-master Host localhost Port 5050 Verbose false Version 0.21.0 /Module /Plugin Anything wrong with the above settings? Cheers, Dan 2015-03-10 17:21 GMT-05:00 Dick Davies d...@hellooperator.net: Hi Dan The .py files (the plugin) live in the collectd python path, it sounds like maybe you're not loading the plugin .conf file into your collectd config? The output will depend on what your collectd is set to write to, I use it with write_graphite. On 10 March 2015 at 20:41, Dan Dong dongda...@gmail.com wrote: Hi, All, Does anybody use this mesos-collectd-plugin: https://github.com/rayrod2030/collectd-mesos I have installed collectd and this plugin, then configured it as instructions and restarted the collectd daemon, why seems nothing happens on the mesos:5050 web UI( python plugin has been turned on in collectd.conf). My question is: 1. Should I install collectd and this mesos-collectd-plugin on each master and slave nodes and restart collectd daemon? (This is what I have done.) 2. Should the config file mesos-master.conf only configured on master node and mesos-slave.conf only configured on slave node?(This is what I have done.) Or both of them should only appear on master node? 3. Is there an example( or a figure) of what output one is expected to see by this plugin? Cheers, Dan
Re: Question on Monitoring a Mesos Cluster
Yeah, that confused me too - I think that figure is specific to the master/slave polled (and that'll just be the active one since you're only reporting when master/elected is true. I'm using this one https://github.com/rayrod2030/collectd-mesos , not sure if that's the same as yours? On 7 March 2015 at 18:56, Jeff Schroeder jeffschroe...@computer.org wrote: Responses inline On Sat, Mar 7, 2015 at 12:48 PM, CCAAT cc...@tampabay.rr.com wrote: ... snip ... After getting everything working, I built a few dashboards, one of which displays these stats from http://master:5051/metrics/snapshot: master/disk_percent master/cpus_percent master/mem_percent I had assumed that this was something like aggregate cluster utilization, but this seems incorrect in practice. I have a small cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a dozen or so small tasks running, and launched 500 tasks with 1G of memory and 1 CPU each. Now I'd expect to se the disk/cpu/mem percentage metrics above go up considerably. I did notice that cpus_percent went to around 0.94. What is the correct way to measure overall cluster utilization for capacity planning? We can have the NOC watch this and simply add more hardware when the number starts getting low. Boy, I cannot wait to read the tidbits of wisdom here. Maybe the development group has more accurate information if not some vague roadmap on resource/process monitoring. Sooner or later, this is going to become a quintessential need; so I hope the deep thinkers are all over this need both in the user and dev groups. In fact the monitoring can easily create a significant loading on the cluster/cloud, if one is not judicious in how this is architect, implemented and dynamically tuned. Monitoring via passive metrics gathering and application telemetry is one of the best ways to do it. That is how I've implemented things The beauty of the rest api is that it isn't heavyweight, and every master has it on port 5050 (by default) and every slave has it on port 5051 (by default). Since I'm throwing this all into graphite (well technically cassandra fronted by cyanite fronted by graphite-api... but same difference), I found a reasonable way to do capacity planning. Collectd will poll the master/slave on each mesos host every 10 seconds (localhost:5050 on masters and localhost:5151 on slaves). This gets put into graphite via collectd's write_graphite plugin. These 3 graphite targets give me percentages of utilization for nice graphs: alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used, collectd.mesos.clustername.gauge-master_cpu_total), Total CPU Usage) alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used, collectd.mesos.clustername.gauge-master_mem_total), Total Memory Usage) alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used, collectd.mesos.clustername.gauge-master_disk_total), Total Disk Usage) With that data, you can have your monitoring tools such as nagios/icinga poll graphite. Using the native graphite render api, you can do things like: * if the cpu usage is over 80% for 24 hours, send a warning event * if the cpu usage is over 95% for 6 hours, send a critical event This allows mostly no-impact monitoring since the monitoring tools are hitting graphite. Anyways, back to the original questions: How does everyone do proper monitoring and capacity planning for large mesos clusters? I expect my cluster to grow beyond what it currently is by quite a bit. -- Jeff Schroeder Don't drink and derive, alcohol and analysis don't mix. http://www.digitalprognosis.com
Re: Mesosphere on Centos 6.6
This is due to the upstart scripts shipped with the RPM. mesos has been shipping these since at least 0.17.x (as that's when we started using it). Where's the repo to send a PR to correct the docs? On 5 February 2015 at 09:48, Chengwei Yang chengwei.yang...@gmail.com wrote: On Mon, Feb 02, 2015 at 04:58:43PM -0800, Viswanathan Ramachandran wrote: Hi, I followed instructions to setup multi-node mesos cluster on CentOS 6.6 using http://mesosphere.com/docs/getting-started/datacenter/install/ I found that I was able to install zookeeper, mesos and marathon using yum without any issues. No errors during install. However, there was no service mesos-master or mesos-slave or marathon installed. Any restart command issued would result in unrecognized service. Try # start mesos-master/mesos-slave instead. -- Thanks, Chengwei That said the binaries were all in tact. I used ubuntu-trusty VMs instead, and was able to install as per instructions. Any updated instructions for CentOS 6 available? Thanks Vish
Re: Is mesos spamming me?
The offer is only for 455 Mb of RAM. You can check that in the slave UI, but it looks like you have other tasks running that are using some of that 1863Mb. On 2 February 2015 at 05:11, Hepple, Robert rhep...@tnsi.com wrote: Yeah but ... the slave is reporting 1863Mb RAM and 2 CPUS - so how come that is rejected by jenkins which is asking for the default 0.1 cpu and 512Mb RAM??? Thanks Bob
Re: Slave cannot be registered while masters keep switching to another one.
Be careful, there's now nothing stopping those 2 masters from forming 2 clusters. Add a third asap. On 28 January 2015 at 08:25, xiaokun xiaokun...@gmail.com wrote: hi, I changed the quorum to 1. Slave can be displayed now! Thanks! 2015-01-28 16:19 GMT+08:00 xiaokun xiaokun...@gmail.com: Thanks for your reply. I will try to modify quorum to 1. Here is log from server side. Attachment is added. I0128 03:15:36.608562 15350 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:37.552141 15346 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:38.479542 15345 network.hpp:424] ZooKeeper group memberships changed I0128 03:15:38.479799 15345 group.cpp:659] Trying to get '/mesos/log_replicas/002270' in ZooKeeper I0128 03:15:38.480613 15345 group.cpp:659] Trying to get '/mesos/log_replicas/002271' in ZooKeeper I0128 03:15:38.481050 15345 group.cpp:659] Trying to get '/mesos/log_replicas/002272' in ZooKeeper I0128 03:15:38.481679 15345 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@10.27.17.135:5050, log-replica(1)@10.27.16.214:5050 } I0128 03:15:38.621351 15345 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:39.544558 15345 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:40.072347 15343 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:41.025926 15345 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:41.695303 15349 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:42.493906 15345 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:43.086762 15343 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:43.831442 15346 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:44.787384 15343 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:45.527914 15345 replica.cpp:638] Replica in VOTING status received a broadcasted recover request I0128 03:15:46.005728 15349 detector.cpp:138] Detected a new leader: (id='2272') I0128 03:15:46.005892 15349 group.cpp:659] Trying to get '/mesos/info_002272' in ZooKeeper I0128 03:15:46.006530 15349 detector.cpp:433] A new leading master (UPID=master@10.27.16.214:5050) is detected I0128 03:15:46.006624 15349 master.cpp:1263] The newly elected leader is master@10.27.16.214:5050 with id 20150128-031430-3591379722-5050-15326 I0128 03:15:46.006664 15349 master.cpp:1276] Elected as the leading master!
Re: how to create rpm package
Those RPMs are built for CentOS 6 i think. For testing, you can get it to start up by just dropping in a symlink : /lib64/libsasl2.so.2 - /lib64/libsasl2.so.3 On 26 January 2015 at 01:33, Yu Wenhua s...@yuwh.net wrote: [root@zone1_0 ~]# uname -a Linux zone1_0 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux I use CentOS7 and install the rpm from https://mesosphere.com/2014/07/17/mesosphere-package-repositories/ then got this error message: mesos-slave --master=192.168.3.5:5050 mesos-slave: error while loading shared libraries: libsasl2.so.2: cannot open shared object file: No such file or directory [root@zone1_0 ~]# ls /usr/lib64/libsasl2.so* -l lrwxrwxrwx. 1 root root 17 Nov 28 04:31 /usr/lib64/libsasl2.so.3 - libsasl2.so.3.0.0 -rwxr-xr-x. 1 root root 121296 Jun 10 2014 /usr/lib64/libsasl2.so.3.0.0 [root@zone1_0 ~]# Maybe I have to build a rpm from the src file. Right? From: Tim St Clair [mailto:tstcl...@redhat.com] Sent: 2015年1月23日 22:59 To: user@mesos.apache.org Subject: Re: how to create rpm package What's your distro+version? Cheers, Tim From: Yu Wenhua s...@yuwh.net To: user@mesos.apache.org Sent: Friday, January 23, 2015 3:27:36 AM Subject: how to create rpm package Hi, Can anyone tell me how to build a mesos rpm package? So I can deploy it to slave node easily Thanks. Yu. -- Cheers, Timothy St. Clair Red Hat Inc.
Re: cluster wide init
On 23 January 2015 at 21:20, Sharma Podila spod...@netflix.com wrote: Here's one possible scenario: A DataCenter runs Databases, Webservers, MicroServices, Hadoop or other batch jobs, stream processing jobs, etc. There's 1000s, if not 100s, of systems available for all of this. Ideally, systems running Databases are configured to run different services in their init than one running batch jobs. However, because one would want to achieve elasticity of different services (#systems running DBs vs. batch, for example) within a single Mesos cluster, Mesos would have to determine what services run on the system at the current time. It's like a newly installed system comes up and connects into Mesos and says, hello there, I am an 8-core 48GB 1Gb eth system ready for service, what would you like me to do?. Mesos can choose to make it run any one or more of the services which would determine the set of init services to launch. And that may change over time as DC traffic changes. What you're describing here is essentially the value proposition of Mesos+marathon. But there are still many services you need to provide in a datacenter that aren't as elastic as we'd like, and don't necessarily benefit from the flexibility you're describing. It's easy enough to lay those out with the same automation you'd use to setup your mesos processes under a more conventional init system. Someone mentioned Ansible earlier, that's worked out really well in my experience. There's a simple Vagrant based playbook here if anyone's interested. https://github.com/rasputnik/mesos-centos The nice thing about Ansible is this scales up to hundreds of servers easily, simply by changing the inventory file.
Re: how to create rpm package
There's an RPM repo, see documentation at: https://mesosphere.com/2014/07/17/mesosphere-package-repositories/ On 23 January 2015 at 09:27, Yu Wenhua s...@yuwh.net wrote: Hi, Can anyone tell me how to build a mesos rpm package? So I can deploy it to slave node easily Thanks. Yu.
Re: hadoop job stuck.
To view the slaves logs, you need to be able to connect to that URL from your browser, not the master (the data is read directly from the slave by your browser, it doesn't go via the master). On 15 January 2015 at 21:42, Dan Dong dongda...@gmail.com wrote: Hi, All, Now sandbox could be viewed on mesos UI, I see the following info( The same error appears on every slave sandbox.): Failed to connect to slave '20150115-144719-3205108908-5050-4552-S0' on 'centos-2.local:5051'. Potential reasons: The slave's hostname, 'centos-2.local', is not accessible from your network The slave's port, '5051', is not accessible from your network I checked that: slave centos-2.local can be login from any machine in the cluster without password by ssh centos-2.local ; port 5051 on slave centos-2.local could be connected from master by telnet centos-2.local 5051 Confused what's the problem here? Cheers, Dan 2015-01-14 15:33 GMT-06:00 Brenden Matthews brenden.matth...@airbnb.com: Would need the task logs from the slave which the TaskTracker was launched on, to debug this further. On Wed, Jan 14, 2015 at 1:28 PM, Dan Dong dongda...@gmail.com wrote: Checked /etc/hosts is correct, master and slave can ssh login each other by hostname without password, and hadoop runs well without mesos, but it stucks when running on mesos. Cheers, Dan 2015-01-14 15:02 GMT-06:00 Brenden Matthews brenden.matth...@airbnb.com: At a first glance, it looks like `/etc/hosts` might be set incorrectly and it cannot resolve the hostname of the worker. See here for more: https://wiki.apache.org/hadoop/UnknownHost On Wed, Jan 14, 2015 at 12:32 PM, Vinod Kone vinodk...@apache.org wrote: What do the master logs say? On Wed, Jan 14, 2015 at 12:21 PM, Dan Dong dongda...@gmail.com wrote: Hi, When I run hadoop jobs on Mesos(0.21.0), the jobs are stuck for ever: 15/01/14 13:59:30 INFO mapred.FileInputFormat: Total input paths to process : 8 15/01/14 13:59:30 INFO mapred.JobClient: Running job: job_201501141358_0001 15/01/14 13:59:31 INFO mapred.JobClient: map 0% reduce 0% From jobtracker log I see: 2015-01-14 13:59:35,542 INFO org.apache.hadoop.mapred.ResourcePolicy: Launching task Task_Tracker_0 on http://centos-2.local:31911 with mapSlots=1 reduceSlots=0 2015-01-14 14:04:35,552 WARN org.apache.hadoop.mapred.MesosScheduler: Tracker http://centos-2.local:31911 failed to launch within 300 seconds, killing it I started manually namenode and jobtracker on master node and datanode on slave, but I could not see tasktracker started by mesos on slave. Note that if I ran hadoop directly without Mesos( of course the conf files are different and tasktracker will be started manually on slave), everything works fine. Any hints? Cheers, Dan
Re: conf files location of mesos.
Might be worth getting a packaged release for your OS, especially if you're new to this. On 7 January 2015 at 16:53, Dan Dong dongda...@gmail.com wrote: Hi, Brian, It's not there: ls /etc/default/mesos ls: cannot access /etc/default/mesos: No such file or directory I installed mesos from source tar ball by configure;make;make install as normal user. Cheers, Dan 2015-01-07 10:43 GMT-06:00 Brian Devins brian.dev...@dealer.com: Try ls /etc/default/mesos instead From: Dan Dong dongda...@gmail.com Reply-To: user@mesos.apache.org user@mesos.apache.org Date: Wednesday, January 7, 2015 at 11:38 AM To: user@mesos.apache.org user@mesos.apache.org Subject: Re: conf files location of mesos. Hi, All, Thanks for your helps, I'm using version 0.21.0 of mesos. But I do not see any of the dirs of 'etc' or 'var' under my build directory(and any subdirs). What is the default conf files location for mesos 0.21.0? ls ~/mesos-0.21.0/build/ 3rdparty bin config.log config.lt config.status ec2 include lib libexec libtool Makefile mesos.pc mpi sbin share src Cheers, Dan 2015-01-07 9:47 GMT-06:00 Tomas Barton barton.to...@gmail.com: Hi Dan, this depends on your distribution. Mesosphere package comes with wrapper script which uses configuration placed in /etc/default/mesos and /etc/mesos-master, /etc/mesos-slave https://github.com/mesosphere/mesos-deb-packaging/blob/master/mesos-init-wrapper which distribution do you use? Tomas On 7 January 2015 at 16:23, Dan Dong dongda...@gmail.com wrote: Hi, After installation of mesos on my cluster, where could I find the location of configuration files? E.g: mesos.conf, masters, slaves etc. I could not find any of them under the prefix dir and subdirs (configure --prefix=/home/dan/mesos-0.21.0/build/). Are there examples for the conf files? Thanks! Cheers, Dan Brian Devins* |* Java Developer brian.dev...@dealer.com [image: Dealer.com]
Re: Problems of running mesos-0.20.0 with zookeeper
The quorum flag is for the number of mesos masters, not zookeepers. if you only have one master, it's going to have trouble reaching a quorum of 2 :) either set --quorum=1 or spin up more masters. On 6 November 2014 21:01, sujinzhao sujinz...@gmail.com wrote: Hi,all, I set up zookeeper service with three machines zoo1, zoo2, zoo3, and also installed 1 mesos master and 2 slaves on another three nodes, I tried to run master and slaves with: ./mesos-master.sh --ip=master-ip --zk=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos --quorum=2 ./mesos-slave.sh --ip=slave-ip --master=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos I also created the /mesos znode before running the above commands, but I got the following error: Recovering from registrar Recovering registrar Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f3c1ea105cd google::LogMessage::Fail() ... after reading the master log, I found that before causing error, master has already been elected successfully, but the leader failed in recovering from registrar, so I guess this error has little relationship with zookeeper. after googleing I found that other people also encountered this problem, but with no solution, I also exclude the possible reason of ssh between master/slave and zookeeper servers with no password. So, could somebody be kindly to tell me how to solve this error? any suggestions will be appreciated. THANKS.
Re: Problems of running mesos-0.20.0 with zookeeper
Golden Rule : Don't use even numbers of members with quorum systems. You need a quorum to function so with 2 masters and quorum=2, you can't ever take a member down. With 2 masters and quorum=1, you're asking for split brain. (this is exactly the same with zookeeper by the way, it's also a quorum system) If you have 1 master, quorum=1 if you have 3 masters, quorum=2 if you have 5 masters, quorum=3 and so on. Try that and see if it helps. On 7 November 2014 09:42, sujinzhao sujinz...@gmail.com wrote: In fact, I also tried with launching 2 masters on two separate machines, at first, one of them was successfully elected as a leader, and both of them printed several lines of messages: Replica in EMPTY status received a broadcasted recover request Received a recover response from a replica in EMPTY status then the leader master aborted after outputing errors: Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f3c1ea105cd google::LogMessage::Fail() .. and next, the second master became the new leader, it also tried to recovery from the registrar, but also failed and printed errors before aborted: Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f3c1ea105cd google::LogMessage::Fail() ... So I guess that's not problems of zookeeper, it's the elected leader can not recover from registrar, could somebody be kind to illustrate some principles of mesos registry, or give me some suggestions? THANKS. david.j.palaitis david.j.palai...@gmail.com编写: With a single master, you should not set quorum=2 Original message From: sujinzhao sujinz...@gmail.com Date:11/06/2014 4:01 PM (GMT-05:00) To: user@mesos.apache.org Cc: Subject: Problems of running mesos-0.20.0 with zookeeper Hi,all, I set up zookeeper service with three machines zoo1, zoo2, zoo3, and also installed 1 mesos master and 2 slaves on another three nodes, I tried to run master and slaves with: ./mesos-master.sh --ip=master-ip --zk=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos --quorum=2 ./mesos-slave.sh --ip=slave-ip --master=zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos I also created the /mesos znode before running the above commands, but I got the following error: Recovering from registrar Recovering registrar Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f3c1ea105cd google::LogMessage::Fail() ... after reading the master log, I found that before causing error, master has already been elected successfully, but the leader failed in recovering from registrar, so I guess this error has little relationship with zookeeper. after googleing I found that other people also encountered this problem, but with no solution, I also exclude the possible reason of ssh between master/slave and zookeeper servers with no password. So, could somebody be kindly to tell me how to solve this error? any suggestions will be appreciated. THANKS.
Re: Do i really need HDFS?
Be interested to know what that is, if you don't mind sharing. We're thinking of deploying a Ceph cluster for another project anyway, it seems to remove some of the chokepoints/points of failure HDFS suffers from but I've no idea how well it can interoperate with the usual HDFS clients (Spark in my particular case but I'm trying to keep this general). On 21 October 2014 13:16, David Greenberg dsg123456...@gmail.com wrote: We use spark without HDFS--in our case, we just use ansible to copy the spark executors onto all hosts at the same path. We also load and store our spark data from non-HDFS sources. On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net wrote: I think Spark needs a way to send jobs to/from the workers - the Spark distro itself will pull down the executor ok, but in my (very basic) tests I got stuck without HDFS. So basically it depends on the framework. I think in Sparks case they assume most users are migrating from an existing Hadoop deployment, so HDFS is sort of assumed. On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com wrote: On 10/20/14 11:46, Steven Schlansker wrote: We are running Mesos entirely without HDFS with no problems. We use Docker to distribute our application to slave nodes, and keep no state on individual nodes. Background: I'm building up a 3 node cluster to run mesos and spark. No legacy Hadoop needed or wanted. I am using btrfs for the local file system, with (2) drives set up for raid1 on each system. So you are suggesting that I can install mesos + spark + docker and not a DFS on these (3) machines? Will I need any other softwares? My application is a geophysical fluid simulator, so scala, R, and all sorts of advanced math will be required on the cluster for the Finite Element Methods. James
Re: Cassandra Mesos Framework Issue
Issue seems to be with how the tasks are asking for port resources - I'd guess whichever tutorial you're using may be using an old/invalid syntax. What tutorial are you working from? On 18 October 2014 15:08, David Palaitis david.palai...@twosigma.com wrote: I am having trouble getting Cassandra Mesos to work in a simple test environment. The framework connects, but tasks get lost with the following error. 215872 [Thread-113] INFO mesosphere.cassandra.CassandraScheduler - Got new resource offers ArrayBuffer(abc.def.ghi.com) 215875 [Thread-113] INFO mesosphere.cassandra.CassandraScheduler - resources offered: List((cpus,32.0), (mem,127877.0), (disk,2167529.0), (ports,0.0)) 215875 [Thread-113] INFO mesosphere.cassandra.CassandraScheduler - resources required: List((cpus,1.0), (mem,2048.0), (ports,0.0), (disk,1000.0)) 215877 [Thread-113] INFO mesosphere.cassandra.CassandraScheduler - Accepted offer: abc.def.ghi.com 215889 [Thread-114] INFO mesosphere.cassandra.CassandraScheduler - Received status update for task task1413640484861: TASK_LOST (Task uses invalid resources: ports(*):0) I tried configuring a port resource in the slave and restarting but still get the same error e.g. ${INSTALL_DIR}/sbin/mesos-slave \ --master=zk://abc.def.ghi.com:2181/mesos \ --resources='mem:245760;ports(*):[31000-32000]' Any leads?
Re: Staging docker task KILLED after 1 minute
One gotcha - the marathon timeout is in seconds, so pass '300' in your case. let us know if it works, I spotted this the other day and anecdotally it addresses the issue for some users, be good to get more feedback. On 16 October 2014 09:49, Grzegorz Graczyk gregor...@gmail.com wrote: Make sure you have --task_launch_timeout in marathon set to same value as executor_registration_timeout. https://github.com/mesosphere/marathon/blob/master/docs/docs/native-docker.md#configure-marathon On 16 October 2014 10:37, Nils De Moor nils.de.m...@gmail.com wrote: Hi, Environment: - Clean vagrant install, 1 master, 1 slave (same behaviour on production cluster with 3 masters, 6 slaves) - Mesos 0.20.1 - Marathon 0.7.3 - Docker 1.2.0 Slave config: - containerizers: docker,mesos - executor_registration_timeout: 5mins When is start docker container tasks, they start being pulled from the HUB, but after 1 minute mesos kills them. In the background though the pull is still finishing and when everything is pulled in the docker container is started, without mesos knowing about it. When I start the same task in mesos again (after I know the pull of the image is done), they run normally. So this leaves slaves with 'dirty' docker containers, as mesos has no knowledge about them. From the logs I get this: --- I1009 15:30:02.990291 1414 slave.cpp:1002] Got assigned task test-app.23755452-4fc9-11e4-839b-080027c4337a for framework 20140904-160348-185204746-5050-27588- I1009 15:30:02.990979 1414 slave.cpp:1112] Launching task test-app.23755452-4fc9-11e4-839b-080027c4337a for framework 20140904-160348-185204746-5050-27588- I1009 15:30:02.993341 1414 slave.cpp:1222] Queuing task 'test-app.23755452-4fc9-11e4-839b-080027c4337a' for executor test-app.23755452-4fc9-11e4-839b-080027c4337a of framework '20140904-160348-185204746-5050-27588- I1009 15:30:02.995818 1409 docker.cpp:743] Starting container '25ac3310-71e4-4d10-8a4b-38add4537308' for task 'test-app.23755452-4fc9-11e4-839b-080027c4337a' (and executor 'test-app.23755452-4fc9-11e4-839b-080027c4337a') of framework '20140904-160348-185204746-5050-27588-' I1009 15:31:07.033287 1413 slave.cpp:1278] Asked to kill task test-app.23755452-4fc9-11e4-839b-080027c4337a of framework 20140904-160348-185204746-5050-27588- I1009 15:31:07.034742 1413 slave.cpp:2088] Handling status update TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for task test-app.23755452-4fc9-11e4-839b-080027c4337a of framework 20140904-160348-185204746-5050-27588- from @0.0.0.0:0 W1009 15:31:07.034881 1413 slave.cpp:1354] Killing the unregistered executor 'test-app.23755452-4fc9-11e4-839b-080027c4337a' of framework 20140904-160348-185204746-5050-27588- because it has no tasks E1009 15:31:07.034945 1413 slave.cpp:2205] Failed to update resources for container 25ac3310-71e4-4d10-8a4b-38add4537308 of executor test-app.23755452-4fc9-11e4-839b-080027c4337a running task test-app.23755452-4fc9-11e4-839b-080027c4337a on status update for terminal task, destroying container: No container found I1009 15:31:07.035133 1413 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for task test-app.23755452-4fc9-11e4-839b-080027c4337a of framework 20140904-160348-185204746-5050-27588- I1009 15:31:07.035210 1413 status_update_manager.cpp:373] Forwarding status update TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for task test-app.23755452-4fc9-11e4-839b-080027c4337a of framework 20140904-160348-185204746-5050-27588- to master@10.0.10.11:5050 I1009 15:31:07.046167 1408 status_update_manager.cpp:398] Received status update acknowledgement (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for task test-app.23755452-4fc9-11e4-839b-080027c4337a of framework 20140904-160348-185204746-5050-27588- I1009 15:35:02.993736 1414 slave.cpp:3010] Terminating executor test-app.23755452-4fc9-11e4-839b-080027c4337a of framework 20140904-160348-185204746-5050-27588- because it did not register within 5mins --- I already posted my question on the marathon board, as I first thought it was an issue on marathon's end: https://groups.google.com/forum/#!topic/marathon-framework/NT7_YIZnNoY Kind regards, Nils
Re: Multiple disks with Mesos
To answer point 2) - yes, your executors will create their 'sandboxes' under work_dir. On 8 October 2014 00:13, Arunabha Ghosh arunabha...@gmail.com wrote: Thanks Steven ! On Tue, Oct 7, 2014 at 4:08 PM, Steven Schlansker sschlans...@opentable.com wrote: On Oct 7, 2014, at 4:06 PM, Arunabha Ghosh arunabha...@gmail.com wrote: Hi, I would like to run Mesos slaves on machines that have multiple disks. According to the Mesos configuration page I can specify a work_dir argument to the slaves. 1) Can the work_dir argument contain multiple directories ? 2) Is the work_dir where Mesos will place all of its data ? So If I started a task on Mesos, would the slave place the task's data (stderr, stdout, task created directories) inside work_dir ? We stitch our disks together before Mesos gets its hands on it using a technology such as LVM or btrfs, so that the work_dir is spread across the multiple disks transparently.
Re: Orphaned Docker containers in Mesos 0.20.1
One thing to check - have you upped --executor_registration_timeout from the default of 1min? a docker pull can easily take longer than that. On 2 October 2014 22:18, Michael Babineau michael.babin...@gmail.com wrote: I'm seeing an issue where tasks are being marked as killed but remain running. The tasks all run via the native Docker containerizer and are started from Marathon. The net result is additional, orphaned Docker containers that must be stopped/removed manually. Versions: - Mesos 0.20.1 - Marathon 0.7.1 - Docker 1.2.0 - Ubuntu 14.04 Environment: - 3 ZK nodes, 3 Mesos Masters, and 3 Mesos Slaves (all separate instances) on EC2 Here's the task in the Mesos UI: (note that stderr continues to update with the latest container output) Here's the still-running Docker container: $ docker ps|grep 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f 3d451b8213ea docker.thefactory.com/ace-serialization:f7aa1d4f46f72d52f5a20ef7ae8680e4acf88bc0 \/bin/sh -c 'java26 minutes ago Up 26 minutes 9990/tcp mesos-1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f Here are the Mesos logs associated with the task: $ grep eda431d7-4a74-11e4-a320-56847afe9799 /var/log/mesos/mesos-slave.INFO I1002 20:44:39.176024 1528 slave.cpp:1002] Got assigned task serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework 20140919-224934-1593967114-5050-1518- I1002 20:44:39.176257 1528 slave.cpp:1112] Launching task serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework 20140919-224934-1593967114-5050-1518- I1002 20:44:39.177287 1528 slave.cpp:1222] Queuing task 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' for executor serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework '20140919-224934-1593967114-5050-1518- I1002 20:44:39.191769 1528 docker.cpp:743] Starting container '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' for task 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' (and executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799') of framework '20140919-224934-1593967114-5050-1518-' I1002 20:44:43.707033 1521 slave.cpp:1278] Asked to kill task serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework 20140919-224934-1593967114-5050-1518- I1002 20:44:43.707811 1521 slave.cpp:2088] Handling status update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework 20140919-224934-1593967114-5050-1518- from @0.0.0.0:0 W1002 20:44:43.708273 1521 slave.cpp:1354] Killing the unregistered executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework 20140919-224934-1593967114-5050-1518- because it has no tasks E1002 20:44:43.708375 1521 slave.cpp:2205] Failed to update resources for container 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f of executor serialization.eda431d7-4a74-11e4-a320-56847afe9799 running task serialization.eda431d7-4a74-11e4-a320-56847afe9799 on status update for terminal task, destroying container: No container found I1002 20:44:43.708524 1521 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework 20140919-224934-1593967114-5050-1518- I1002 20:44:43.708709 1521 status_update_manager.cpp:373] Forwarding status update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework 20140919-224934-1593967114-5050-1518- to master@10.2.0.182:5050 I1002 20:44:43.728991 1526 status_update_manager.cpp:398] Received status update acknowledgement (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework 20140919-224934-1593967114-5050-1518- I1002 20:47:05.904324 1527 slave.cpp:2538] Monitoring executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework '20140919-224934-1593967114-5050-1518-' in container '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' I1002 20:47:06.311027 1525 slave.cpp:1733] Got registration for executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework 20140919-224934-1593967114-5050-1518- from executor(1)@10.2.1.34:29920 I'll typically see a barrage of these in association with a Marathon app update (which deploys new tasks). Eventually, one container sticks and we get a RUNNING task instead of a KILLED one. Where else can I look?
Re: Build on Amazon Linux
What version of docker does that give you, out of interest? mainline EL7 is still shipping a pre-1.0 that won't work with mesos (although since docker is just a static Go binary, it's trivial to overwrite /usr/bin/docker and get everything to work). On 25 September 2014 20:23, John Mickey j...@pithoslabs.com wrote: Thanks to all for the help Tim - thanks for pointing out the obvious CCAAT - Great article Here are the instructions for getting Mesos to run on Amazon Linux amzn-ami-hvm-2014.03.1.x86_64-ebs (ami-383a5008) (us-west-2) On a single instance, as root, proof of concept setup Install Docker yum -y docker service docker start Install Tools to build Mesos (From Apache Mesos documentation) yum -y groupinstall Development Tools yum -y install python-devel java-1.7.0-openjdk-devel zlib-devel libcurl-devel openssl-devel cyrus-sasl-devel cyrus-sasl-md5 wget http://mirror.nexcess.net/apache/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz tar -zxf apache-maven-3.0.5-bin.tar.gz -C /opt/ ln -s /opt/apache-maven-3.0.5/bin/mvn /usr/bin/mvn Install Oracle Java (Amazon Linux ships with OpenJDK) wget --no-check-certificate --no-cookies --header Cookie: oraclelicense=accept-securebackup-cookie http://download.oracle.com/otn-pub/java/jdk/7u67-b01/jdk-7u67-linux-x64.rpm rpm -i jdk-7u67-linux-x64.rpm export JAVA_HOME=/usr/java/jdk1.7.0_67 export PATH=$PATH:/usr/java/jdk1.7.0_67/bin alternatives --install /usr/bin/java java /usr/java/jdk1.7.0_67/bin/java 2 alternatives --config java java -version Build Mesos wget http://mirror.olnevhost.net/pub/apache/mesos/0.20.1/mesos-0.20.1.tar.gz tar -zxf mesos-0.20.1.tar.gz cd mesos mkdir build cd build ../configure make make check (This will fail on a cgroups issues, see earlier in this thread) make install Run Mesos Master and Slave /usr/local/sbin/mesos-master --work_dir=/tmp/mesos --zk=zk://localhost:2181/mesos --quorum=1 --ip=1.2.3.4 /usr/local/sbin/mesos-slave --master=zk://localhost:2181/mesos --containerizers=docker,mesos On Thu, Sep 25, 2014 at 1:56 PM, Tim St Clair tstcl...@redhat.com wrote: It looks like docker-daemon isn't running. Cheers, Tim - Original Message - From: John Mickey j...@pithoslabs.com To: user@mesos.apache.org Sent: Thursday, September 25, 2014 10:33:42 AM Subject: Re: Build on Amazon Linux I tried the --help options before replying in my previous post, but did not do a good job of explaining what I was seeing --isolation=VALUE Isolation mechanisms to use, e.g., 'posix/cpu,posix/mem', or 'cgroups/cpu,cgroups/mem', or network/port_mapping (configure with flag: --with-network-isolator to enable), or 'external'. (default: posix/cpu,posix/mem) If I run this (Master is running) $ /usr/local/sbin/mesos-slave --master=zk://localhost:2181/mesos --containerizers=docker,mesos --isolation=posix/cpu,posix/mem Slave will not start with this message $ I0925 15:26:19.118268 18604 main.cpp:128] Version: 0.20.0 Failed to create a containerizer: Could not create DockerContainerizer: Failed to find a mounted cgroups hierarchy for the 'cpu' subsystem; you probably need to mount cgroups manually! The default is posix/cpu,posix/mem Any ideas why it is still trying to use cgroups? Once I get this working, I will post the steps for Amazon Linux. Thank you again for the help. On Wed, Sep 24, 2014 at 4:31 PM, Tim St Clair tstcl...@redhat.com wrote: $ mesos-slave --isolation='posix/cpu,posix/mem' ... for ref: $ mesos-slave --help ... --isolation=VALUE Isolation mechanisms to use, e.g., 'posix/cpu,posix/mem', or 'cgroups/cpu,cgroups/mem', or network/port_mapping (configure with flag: --with-network-isolator to enable), ... Cheers, Tim - Original Message - From: John Mickey j...@pithoslabs.com To: user@mesos.apache.org Sent: Wednesday, September 24, 2014 4:03:37 PM Subject: Re: Build on Amazon Linux Thank you again for the responses. What is the option to remove cgroups isolation from the slave start? I ran /usr/local/sbin/mesos-slave --help and do not see an option to remove cgroups isolation from the slave start On Wed, Sep 24, 2014 at 1:48 PM, Tim St Clair tstcl...@redhat.com wrote: You likely have a systemd problem, and you can edit your slave startup to remove cgroups isolation until 0.21.0 is released. # systemd cgroup integration, *only* enable on master/0.21.0 #export MESOS_isolation='cgroups/cpu,cgroups/mem' #export MESOS_cgroups_root='system.slice/mesos-slave.service' #export
Re: Running mesos-slave in Docker container
The master is advertising itself as being on 127.0.0.1 - try running it with an --ip flag. On 23 September 2014 11:10, Grzegorz Graczyk gregor...@gmail.com wrote: Thanks for your response! Mounting /sys did the job, cgroups are working, but now mesos-slave is just crushing after detecting new master or so (there's nothing useful in the logs - is there a way to make them more verbose?) Last lines of logs from mesos-slave: I0923 10:03:24.07985910 detector.cpp:426] A new leading master (UPID=master@127.0.0.1:5050) is detected I0923 10:03:26.076053 9 slave.cpp:3195] Finished recovery I0923 10:03:26.076505 9 slave.cpp:589] New master detected at master@127.0.0.1:5050 I0923 10:03:26.076732 9 slave.cpp:625] No credentials provided. Attempting to register without authentication I0923 10:03:26.076812 9 slave.cpp:636] Detecting new master I0923 10:03:26.076864 9 status_update_manager.cpp:167] New master detected at master@127.0.0.1:5050 There's no problem in running mesos-master in the container(at least there wasn't any in my case, for now) On 23 September 2014 09:41, Tim Chen t...@mesosphere.io wrote: Hi Grzegorz, To run Mesos master|slave in a docker container is not straight forward because we utilize kernel features therefore you need to explicitly test out the features you like to use with Mesos with slave/master in Docker. Gabriel during the Mesosphere hackathon has got master and slave running in docker containers, and he can probably share his Dockerfile and run command. I believe one work around to get cgroups working with Docker run is to mount /sys into the container (mount -v /sys:/sys). Gabriel do you still have the command you used to run slave/master with Docker? Tim On Tue, Sep 23, 2014 at 12:24 AM, Grzegorz Graczyk gregor...@gmail.com wrote: I'm trying to run mesos-slave inside Docker container, but it can't start due to problem with mounting cgroups. I'm using: Kernel Version: 3.13.0-32-generic Operating System: Ubuntu 14.04.1 LTS Docker: 1.2.0(commit fa7b24f) Mesos: 0.20.0 Following error appears: I0923 07:11:20.92147519 main.cpp:126] Build: 2014-08-22 05:04:26 by root I0923 07:11:20.92160819 main.cpp:128] Version: 0.20.0 I0923 07:11:20.92162019 main.cpp:131] Git tag: 0.20.0 I0923 07:11:20.92162819 main.cpp:135] Git SHA: f421ffdf8d32a8834b3a6ee483b5b59f65956497 Failed to create a containerizer: Could not create DockerContainerizer: Failed to find a mounted cgroups hierarchy for the 'cpu' subsystem; you probably need to mount cgroups manually! I'm running docker container with command: docker run --name mesos-slave --privileged --net=host -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/docker:/var/lib/docker -v /usr/local/bin/docker:/usr/local/bin/docker gregory90/mesos-slave --containerizers=docker,mesos --master=zk://localhost:2181/mesos --ip=127.0.0.1 Everything is running on single machine. Everything works as expected when mesos-slave is run outside docker container. I'd appreciate some help.
Re: [VOTE] Release Apache Mesos 0.20.1 (rc2)
Don't suppose there's any chance of a fix for https://issues.apache.org/jira/browse/MESOS-1195 is there? (I'll settle for a workaround to get mesos running on EL7 somehow, mind) On 18 September 2014 18:18, Adam Bordelon a...@mesosphere.io wrote: Great. I'll roll that into an rc3 today. Any other patch requests for rc3? On Thu, Sep 18, 2014 at 2:36 AM, Benjamin Hindman benjamin.hind...@gmail.com wrote: I've committed Tim's fix, we can cut another release candidate and restart the vote. On Wed, Sep 17, 2014 at 11:07 PM, Tim Chen t...@mesosphere.io wrote: -1 The docker test failed when I removed the image, and found a problem from the docker pull implementation. I've created a reviewboard for a fix: https://reviews.apache.org/r/25758 Will like to get this fixed before releasing it. Tim On Wed, Sep 17, 2014 at 9:10 PM, Vinod Kone vinodk...@gmail.com wrote: +1 (binding) make check passes on CentOS 5.5 w/ gcc 4.8.2. On Wed, Sep 17, 2014 at 7:42 PM, Adam Bordelon a...@mesosphere.io wrote: Update: The vote is open until Mon Sep 22 10:00:00 PDT 2014 and passes if a majority of at least 3 +1 PMC votes are cast. On Wed, Sep 17, 2014 at 6:27 PM, Adam Bordelon a...@mesosphere.io wrote: Hi all, Please vote on releasing the following candidate as Apache Mesos 0.20.1. 0.20.1 includes the following: Minor bug fixes for docker integration, network isolation, etc. The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.20.1-rc2 The candidate for Mesos 0.20.1 release is available at: https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc2/mesos-0.20.1.tar.gz The tag to be voted on is 0.20.1-rc2: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.20.1-rc2 The MD5 checksum of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc2/mesos-0.20.1.tar.gz.md5 The signature of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc2/mesos-0.20.1.tar.gz.asc The PGP key used to sign the release is here: https://dist.apache.org/repos/dist/release/mesos/KEYS The JAR is up in Maven in a staging repository here: https://repository.apache.org/content/repositories/orgapachemesos-1034 Please vote on releasing this package as Apache Mesos 0.20.1! The vote is open until and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Mesos 0.20.1 [ ] -1 Do not release this package because ... Thanks, -Adam-
Re: Sandbox Log Links
I don't think that's the issue - i have a custom work_dir too and can see the logs fine. Don't they still get served up from the slaves themselves (port 5051)? Maybe you've got a firewall blocking that from where you're viewing the mesos ui? On 4 September 2014 23:58, John Omernik j...@omernik.com wrote: Thanks Tim. Some testing showed that when I moved to 0.20, I setup the slaves to use a specific log directory rather than just default to /tmp. Basically, if you specify a customer work_dir for the slave, the master doesn't know (I am guessing?) where to find to logs? This seems like something that should work (if you change the work_dir, it should update the master with where to look for logs in the gui). Thoughts? On Thu, Sep 4, 2014 at 5:34 PM, Tim Chen t...@mesosphere.io wrote: Hi John, Take a look at the slave log and see if your task failed, what was the failure message that was part of your task failure. Tim On Thu, Sep 4, 2014 at 3:24 PM, John Omernik j...@omernik.com wrote: Hey all, I upgraded to 0.20 and when I click on sandbox, the link is good, but there are not futher links for logs (i.e. standard err, out etc) like there was in 0.19. I have changed my log location, but it should still work... Curious on what I can look at to troubleshoot. Thanks! John
Re: Mesos 0.19 registrar upgrade
On 22 July 2014 10:40, Tomas Barton barton.to...@gmail.com wrote: I have 4 Mesos masters, which would mean that quorum 2 - quorum=3, right? Yes, that's right. 2 won't be enough. quorum=1, mesos-masters=1 quorum=2, mesos-masters=3 quorum=3, mesos-masters=5 quorum=4, mesos-masters=7 Is is possible to have non-even number of Mesos masters? or is it just a bad idea? Yes, it's a bad idea since this change - it's always been a bad idea to run an even number of zookeepers and now that extends to the mesos masters. 4 masters gives you no extra redundancy over 3, and your likelihood of node loss increases slightly (as you now have an extra server to potentially break).
Re: how to update master cluster
For provisioning yes , for ad-hoc maintenance tasks won't help at all. On 16 July 2014 11:29, Nayeem Syed nay...@cronycle.com wrote: Thanks for those! I will give it a try to get some deployment through ansible. I was also wondering if Cloudformation might be good for this? As it clears up the things very cleanly when you remove the formation? Though I find their JSON file very difficult to navigate and their Update Feature doesnt seem to work too well.. On Wed, Jul 16, 2014 at 10:46 AM, Dick Davies d...@hellooperator.net wrote: I'd like to show you my playbooks, but unfortunately they're for a client - I can vouch for it being very easy to add nodes to a cluster etc. if you just have to edit an 'inventory' file and add IPs into the correct groups. (NB: puppet and chef will automate your infrastructure too, it's just they're not as useful for things like rolling deployments in my experience because they're agent based, so it's harder to control when each server will update and restart services). A quick Google found: http://blog.michaelhamrah.com/2014/06/setting-up-a-multi-node-mesos-cluster-running-docker-haproxy-and-marathon-with-ansible/ which might be useful. The play books linked from that post are for bootstrapping a cluster, but it's pretty simple to add a second playbook to manage rolling deploys etc. There's some Ansible examples of rolling deploys (not Mesos specific) at : http://docs.ansible.com/guide_rolling_upgrade.html On 15 July 2014 14:41, Nayeem Syed nay...@cronycle.com wrote: thanks! do you have some examples of how you are using it with ansible? i dont have specific preferences, whatever works really. On Tue, Jul 15, 2014 at 2:35 PM, Dick Davies d...@hellooperator.net wrote: You want a rolling restart i'd guess, unless you want downtime for some reason. We use Ansible, it's pretty nice. On 15 July 2014 10:47, Nayeem Syed nay...@cronycle.com wrote: whats the best way to update mesos master instances. eg I want to update things in there, install new frameworks, but at the moment I am ssh'ing to the instances and installing them one by one. that feels wrong, shouldnt it be done in parallel to all the instances? what do people currently do to keep all the masters in sync?
Re: mesos isolation
Are you using cgroups, or the default (posix) isolation? On 11 July 2014 17:06, Asim linka...@gmail.com wrote: Hi, I am running a job on few machines in my Linux cluster. Each machine is an Intel 8 core (with 32 threads). I see a total of 32 CPUs in /etc/cpuinfo and within mesos web interface. When I launch a job using mesos, I see that all CPUs are used equally and not just the number of CPUs I specify for that task. Furthermore, I also see that the average per task running time within a single machine, with 5 tasks/machine is 1/2 as much as that with 10 tasks/machine. Within mesos, each task has 1 CPU assigned and it is completely CPU bound (no dataset, no file access). As per mesos, the 5 tasks job uses 5 CPUs while the 10 task job uses 10 CPUs (so average task run times should be same unlike what I am seeing). Also, when I monitor CPU utilization, I see that all CPUs are used equally. I am really confused. Is this how mesos/container isolation is supposed to work? Thanks, Asim
number of masters and quorum
I might be wrong but doesn't the new quorum setting mean it only makes sense to run an odd number of masters (a la zookeepers)? i.e. 4 masters is no more resilient than 3 (in fact less so, since you increase your chance of a node failure as number of nodes increases).
Re: Docker support in Mesos core
That's fantastic news, really good to see some integration happening between chocolate and peanut butter here. Deimos has been pretty difficult for us to deploy on our platforms (largely down to the python implementation, which has problems on the ancient python EL6 ships with). On 20 June 2014 23:40, Tobias Knaup t...@knaup.me wrote: Hi all, We've got a lot of feedback from folks who use Mesos to run Dockers at scale via Deimos, and the main wish was to make Docker a first class citizen in Mesos, instead of a plugin that needs to be installed separately. Mesosphere wants to contribute this and I already chatted with Ben H about what an implementation could look like. I'd love for folks on here that are working with Docker to chime in! I created a JIRA here: https://issues.apache.org/jira/browse/MESOS-1524 Cheers, Tobi
Re: Failed to perform recovery: Incompatible slave info detected
Fair enough, appreciate the explanation (and that you've clearly thought hard about this in the design). The cluster I hit this on was in the process of being built and had no tasks deployed, it just violated my Principle of Least Astonishment that dropping some more cores into the slaves seemed to kill them off. I can see there must be cases where this design choice is the right thing to do, now we know we can work around it easily enough - so thanks for the lesson :) On 19 June 2014 18:43, Vinod Kone vinodk...@gmail.com wrote: Yes. The idea behind storing the whole slave info is to provide safety. Imagine, the slave resources were reduced on a restart. What does this mean for already running tasks that are using more resources than the newly configured resources? Should the slave kill them? If yes, which ones? Similarly what happens when the slave attributes are changed (e.g., secure to unsecure)? Is it safe to keep running the existing tasks? As you can see, reconciliation of slave info is a complex problem. While there are some smarts we can add to the slave (e.g., increase of resources is OK while decrease is not) we haven't really seen a need for it yet. On Thu, Jun 19, 2014 at 3:03 AM, Dick Davies d...@hellooperator.net wrote: Fab, thanks Vinod. Turns out that feature (different FQDN to serve the ui up on) might well be really useful for us, so every cloud has a silver lining :) back to the metadata feature though - do you know why just the 'id' of the slaves isn't used? As it stands adding disk storage, cores or RAM to a slave will cause it to drop out of cluster - does checking the whole metadata provide any benefit vs. checking the id? On 18 June 2014 19:46, Vinod Kone vinodk...@gmail.com wrote: Filed https://issues.apache.org/jira/browse/MESOS-1506 for fixing flags/documentation. On Wed, Jun 18, 2014 at 11:33 AM, Dick Davies d...@hellooperator.net wrote: Thanks, it might be worth correcting the docs in that case then. This URL says it'll use the system hostname, not the reverse DNS of the ip argument: http://mesos.apache.org/documentation/latest/configuration/ re: the CFS thing - this was while running Docker on the slaves - that also uses cgroups so maybe resources were getting split with mesos or something? (I'm still reading up on cgroups) - definitely wasn't the case until cfs was enabled. On 18 June 2014 18:34, Vinod Kone vinodk...@gmail.com wrote: Hey Dick, Regarding slave recovery, any changes in the SlaveInfo (see mesos.proto) are considered as a new slave and hence recovery doesn't proceed forward. This is because Master caches SlaveInfo and it is quite complex to reconcile the differences in SlaveInfo. So we decided to fail on any SlaveInfo changes for now. In your particular case, https://issues.apache.org/jira/browse/MESOS-672 was committed in 0.18.0 which fixed redirection of WebUI. Included in this fix is https://reviews.apache.org/r/17573/ which changed how SlaveInfo.hostname is calculated. Since you are not providing a hostname via --hostname flag, slave now deduces the hostname from --ip flag. Looks like in your cluster the hostname corresponding to that ip is different than what 'os::hostname()' gives. Couple of options to move forward. If you want slave recovery, provide --hostname that matches the previous hostname. If you don't care above recovery, just remove the meta directory (rm -rf /var/mesos/meta) so that the slave starts as a fresh one (since you are not using cgroups, you will have to manually kill any old executors/tasks that are still alive on the slave). Not sure about your comment on CFS. Enabling CFS shouldn't change how much memory the slave sees as available. More details/logs would help diagnose the issue. HTH, On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies d...@hellooperator.net wrote: Should have said, the CLI for this is : /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos (note IP is specified, hostname is not - docs indicated hostname arg will default to the fqdn of host, but it appears to be using the value passed as 'ip' instead.) On 18 June 2014 12:00, Dick Davies d...@hellooperator.net wrote: Hi, we recently bumped 0.17.0 - 0.18.2 and the slaves now show their IPs rather than their FQDNs on the mesos UI. This broke slave recovery with the error: Failed to perform recovery: Incompatible slave info detected cpu, mem, disk, ports are all the same. so is the 'id' field. the only thing that's changed is are the 'hostname' and webui_hostname arguments (the CLI we're passing in is exactly the same as it was on 0.17.0, so presumably this is down
Re: Failed to perform recovery: Incompatible slave info detected
Thanks, it might be worth correcting the docs in that case then. This URL says it'll use the system hostname, not the reverse DNS of the ip argument: http://mesos.apache.org/documentation/latest/configuration/ re: the CFS thing - this was while running Docker on the slaves - that also uses cgroups so maybe resources were getting split with mesos or something? (I'm still reading up on cgroups) - definitely wasn't the case until cfs was enabled. On 18 June 2014 18:34, Vinod Kone vinodk...@gmail.com wrote: Hey Dick, Regarding slave recovery, any changes in the SlaveInfo (see mesos.proto) are considered as a new slave and hence recovery doesn't proceed forward. This is because Master caches SlaveInfo and it is quite complex to reconcile the differences in SlaveInfo. So we decided to fail on any SlaveInfo changes for now. In your particular case, https://issues.apache.org/jira/browse/MESOS-672 was committed in 0.18.0 which fixed redirection of WebUI. Included in this fix is https://reviews.apache.org/r/17573/ which changed how SlaveInfo.hostname is calculated. Since you are not providing a hostname via --hostname flag, slave now deduces the hostname from --ip flag. Looks like in your cluster the hostname corresponding to that ip is different than what 'os::hostname()' gives. Couple of options to move forward. If you want slave recovery, provide --hostname that matches the previous hostname. If you don't care above recovery, just remove the meta directory (rm -rf /var/mesos/meta) so that the slave starts as a fresh one (since you are not using cgroups, you will have to manually kill any old executors/tasks that are still alive on the slave). Not sure about your comment on CFS. Enabling CFS shouldn't change how much memory the slave sees as available. More details/logs would help diagnose the issue. HTH, On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies d...@hellooperator.net wrote: Should have said, the CLI for this is : /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos (note IP is specified, hostname is not - docs indicated hostname arg will default to the fqdn of host, but it appears to be using the value passed as 'ip' instead.) On 18 June 2014 12:00, Dick Davies d...@hellooperator.net wrote: Hi, we recently bumped 0.17.0 - 0.18.2 and the slaves now show their IPs rather than their FQDNs on the mesos UI. This broke slave recovery with the error: Failed to perform recovery: Incompatible slave info detected cpu, mem, disk, ports are all the same. so is the 'id' field. the only thing that's changed is are the 'hostname' and webui_hostname arguments (the CLI we're passing in is exactly the same as it was on 0.17.0, so presumably this is down to a change in mesos conventions). I've had similar issues enabling CFS in test environments (slaves show less free memory and refuse to recover). is the 'id' field not enough to uniquely identify a slave?
n00b isolation docs?
So we're running with default isolation (posix) and thinking about enabling cgroups (mesos 0.17.0 right now but the upgrade to 0.18.2 was seamless in dev. so that'll probably happen too). I just need to justify the effort and extra complexity, so can someone explain briefly * what croup isolation provides over stock posix / process isolation * the configuration required to setup cgroups Thanks!
Re: Log managment
I'd try a newer version before you file bugs - but to be honest log rotation is logrotates job, it's really not very hard to setup. In our stack we run under upstart, so things make it into syslog and we don't have to worry about rotation - scales better too as it's easier to centralize. On 14 May 2014 09:46, Damien Hardy dha...@viadeoteam.com wrote: Hello, Log in mesos are problematic for me so far. We are used to use log4j facility in java world that permit a lot of things. Mainly I would like log rotation (ideally with logrotate tool to be homogeneous with other things) without restarting processes because in my experience it looses history ( mesos 0.16.0 so far ) Best regards, -- Damien HARDY IT Infrastructure Architect Viadeo - 30 rue de la Victoire - 75009 Paris - France PGP : 45D7F89A
Re: how does the web UI get sandbox logs?
Won't that also set the IP the slave will advertise for tasks? ( Might not be a problem but thought it was worth pointing out, since Mike said that was currently on the internal IP. ) On 13 May 2014 18:38, Adam Bordelon a...@mesosphere.io wrote: mesos-slave --hostname=foo The hostname the slave should report. If left unset, system hostname will be used (recommended). On Tue, May 13, 2014 at 8:41 AM, Mike Milner m...@immun.io wrote: I just ran into something similar myself with mesos on EC2. I can reach the master just fine using the master's public dns name but when I go to the sandbox it's trying to connect to the slaves private internal DNS name. Is there a configuration option on the slave to manually specify the hostname that should be used in the web UI? I couldn't find anything on http://mesos.apache.org/documentation/latest/configuration/ Thanks! Mike On Mon, May 12, 2014 at 4:27 PM, Ross Allen r...@mesosphere.io wrote: For example, a particular slave's webUI (forwarded through master) can be reached at: http://localhost:5050/#/slaves/201405120912-16777343-5050-23673-0 Though it looks like the requests are being proxied through the master, your browser is talking directly to the slave for any slave data. Your browser first gets HTML, CSS, and JavaScript from the master and then sends requests directly to the slave's webserver via JSONP for any slave data shown in the UI. Ross Allen r...@mesosphere.io On 12 May 2014 09:21, Adam Bordelon a...@mesosphere.io wrote: Does each slave expose a webserver ...? Yes. Each slave hosts a webserver not just for the sandbox, but also for the slave's own webUI and RESTful API For example, a particular slave's webUI (forwarded through master) can be reached at: http://localhost:5050/#/slaves/201405120912-16777343-5050-23673-0 On Thu, May 8, 2014 at 9:21 AM, Dick Davies d...@hellooperator.net wrote: I've found the sandbox logs to be very useful in debugging misbehaving frameworks, typos, etc. - the usual n00b stuff I suppose. I've got a vagrant stack running quite nicely. If i port forward I can view marathon and mesos UIs nicely from my host, but I can't get the sandbox logs because 'slaveN' isn't resolving from outside the Vagrant stack. I was a bit surprised because I didn't expect to need to reach the slaves directly. Does each slave expose a webserver to serve up sandbox logs or something? Just trying to work out how/if I can map things so that UI can be tunnelled easily.
Re: protecting mesos from fat fingers
Not quite - looks to me like mesos slave disks filled with failed jobs (because marathon continued to throw a broken .zip into them) and with /tmp on the root fs the servers became unresponsive. Tobi mentions there's a way to set that at deploy time, but in this case the guy who can't type 'hello world' correctly would have been responsible for setting the rate limits too (that's me by the way!) so in itself that's not protection from pilot error. I'm not sure if GC was able to clear /var any better (I doubt it very much, my impression was that's on the order of days). Think it's more the deploy could be cancelled better while the system was still functioning (speculation - i'm still in early stages of learning the internals of this). On 30 April 2014 22:08, Vinod Kone vinodk...@gmail.com wrote: Dick, I've also briefly skimmed at your original email to marathon mailing list and it sounded like executor sandboxes were not getting garbage collected (a mesos feature) when the slave work directory was rooted in /tmp vs /var? Did I understand that right? If yes, I would love to see some logs. On Wed, Apr 30, 2014 at 1:51 PM, Tobias Knaup t...@knaup.me wrote: In Marathon you can specify taskRateLimit (max number of tasks to start per second) as part of your app definition. On Wed, Apr 30, 2014 at 11:30 AM, Dick Davies d...@hellooperator.net wrote: Managed to take out a mesos slave today with a typo while launching a marathon app, and wondered if there are throttles/limits that can be applied to repeated launches to limit the risk of such mistakes in the future. I started a thread on the marathon list ( https://groups.google.com/forum/?hl=en#!topic/marathon-framework/4iWLqTYTvgM ) [ TL:DR: marathon throws an app that will never deploy correctly at slaves until the disk fills with debris and the slave dies ] but I suppose this could be something available in mesos itself. I can't find a lot of advice about operational aspects of Mesos admin; could others here provide some good advice about their experience in preventing failed task deploys from causing trouble on their clusters? Thanks!
Re: How about disable the irc ASFBot to flood the irc channel?
Can't you just '/ignore' the IRC bot if it bothers you? On 17 April 2014 03:01, Chengwei Yang chengwei.yang...@gmail.com wrote: Hi All, I am a irc guy, maybe so as you. However, I found that there are two bots for JIRA, one for the mesos-dev mailing list, one for the irc channel. I generally think the bot for mailing list is fine, which push a JIRA msg in a mail thread, so with full context, readable and easy to understand the full page. However, the irc channel as its a room for human beings to chat with each other, I think not suitable if it's flood by the JIRA bot. I found it's very hard to me to figure out what human beings are talking about in the ASFBot flooding. Could we just kill ASFBot for the irc channel? But left the one for mesos-dev mailing list. -- Thanks, Chengwei footnote: I have to Cc to myself otherwise I found Gmail doesn't mark that email as unread, so I can't pull it into my mutt mbox.