Re: [Breaking Change, MESOS-1865] Redirect to the leader master when current master is not a leader
Hi, sorry, I have not kept up with all the new endpoints :) If there is already an endpoint (/redirect ?) that essentially addresses the issue raised by MESOS-3841 (https://issues.apache.org/jira/browse/MESOS-3841) then I'd suggest to close it and add a note. (I just saw it and thought it would have been useful, and fun to fix). -- *Marco Massenzio* http://codetrips.com On Sat, Apr 30, 2016 at 9:24 PM, haosdent <haosd...@gmail.com> wrote: > Oh, @Marco. Thank you very much for your reply, vinodkone shepherd this > and it have already submitted after other kindly guys reviews. > > For MESOS-3841, it should be resolved now because could get the leading > master by "/redirect" endpoint. Do you have any concerns about it? I would > like to solve your concerns. > > On Sun, May 1, 2016 at 12:12 PM, Marco Massenzio <m.massen...@gmail.com> > wrote: > >> @haosdent - thanks for doing this, very useful indeed! >> >> On a related issue [0], I'd like to that one on: >> >> - can anyone comment if that's a good idea/bad idea; and >> - would anyone be willing to shepherd it? >> >> Thanks! >> >> [0] Master HTTP API support to get the leader ( >> https://issues.apache.org/jira/browse/MESOS-3841) >> >> -- >> *Marco Massenzio* >> http://codetrips.com >> >> On Tue, Apr 19, 2016 at 12:34 AM, haosdent <haosd...@gmail.com> wrote: >> >>> Hi All, >>> >>> We intend to introduce a breaking change[1] in the http endpoints >>> without the deprecation cycle. >>> For below http endpoints, when user request to a master which is not a >>> leader, >>> user would get a 307 redirect(TEMPORARY_REDIRECT) to the leader master. >>> >>> * /create-volumes >>> * /destroy-volumes >>> * /frameworks >>> * /reserve >>> * /slaves >>> * /quota >>> * /weights >>> * /state >>> * /state.json >>> * /state-summary >>> * /tasks >>> * /tasks.json >>> * /roles >>> * /roles.json >>> * /teardown >>> * /maintenance/schedule >>> * /maintenance/status >>> * /machine/down >>> * /machine/up >>> * /unreserve >>> >>> For other endpoints in master, the behaviour is not change. >>> >>> If your existing framework/tool relied on the old behaviour, I suggest >>> to add a logic to handle 307 redirect response. >>> Please let me know if you have any queries/concerns. Any comments will >>> be appreciated. >>> >>> Links: >>> [1] Tracking JIRA: https://issues.apache.org/jira/browse/MESOS-1865 >>> >>> -- >>> Best Regards, >>> Haosdent Huang >>> >> >> > > > -- > Best Regards, > Haosdent Huang >
Re: [Breaking Change, MESOS-1865] Redirect to the leader master when current master is not a leader
@haosdent - thanks for doing this, very useful indeed! On a related issue [0], I'd like to that one on: - can anyone comment if that's a good idea/bad idea; and - would anyone be willing to shepherd it? Thanks! [0] Master HTTP API support to get the leader ( https://issues.apache.org/jira/browse/MESOS-3841) -- *Marco Massenzio* http://codetrips.com On Tue, Apr 19, 2016 at 12:34 AM, haosdent <haosd...@gmail.com> wrote: > Hi All, > > We intend to introduce a breaking change[1] in the http endpoints without > the deprecation cycle. > For below http endpoints, when user request to a master which is not a > leader, > user would get a 307 redirect(TEMPORARY_REDIRECT) to the leader master. > > * /create-volumes > * /destroy-volumes > * /frameworks > * /reserve > * /slaves > * /quota > * /weights > * /state > * /state.json > * /state-summary > * /tasks > * /tasks.json > * /roles > * /roles.json > * /teardown > * /maintenance/schedule > * /maintenance/status > * /machine/down > * /machine/up > * /unreserve > > For other endpoints in master, the behaviour is not change. > > If your existing framework/tool relied on the old behaviour, I suggest to > add a logic to handle 307 redirect response. > Please let me know if you have any queries/concerns. Any comments will be > appreciated. > > Links: > [1] Tracking JIRA: https://issues.apache.org/jira/browse/MESOS-1865 > > -- > Best Regards, > Haosdent Huang >
Re: Safe update of agent attributes
IIRC you can avoid the issue by either using a different work_dir for the agent, or removing (and, possibly, re-creating) it. I'm afraid I don't have a running instance of Mesos on this machine and can't test it out. Also (and this is strictly my opinion :) I would consider a change of attribute a "material" change for the Agent and I would avoid trying to recover state from previous runs; but, again, there may be perfectly legitimate cases in which this is desirable. -- *Marco Massenzio* http://codetrips.com On Mon, Feb 22, 2016 at 12:11 PM, Zhitao Li <zhitaoli...@gmail.com> wrote: > Hi, > > We recently discovered that updating attributes on Mesos agents is a very > risk operation, and has a potential to send agent(s) into a crash loop if > not done properly with errors like "Failed to perform recovery: > Incompatible slave info detected". This combined with --recovery_timeout > made the situation even worse. > > In our setup, some of the attributes are generated from automated > configuration management system, so this opens a possibility that "bad" > configuration could be left on the machine and causing big trouble on next > agent upgrade, if the USR1 signal was not sent on time. > > Some questions: > > 1. Does anyone have a good practice recommended on managing these > attributes safely? > 2. Has Mesos considered to fallback to old metadata if it detects > incompatibility, so agents would keep running with old attributes instead > of falling into crash loop? > > Thanks. > > -- > Cheers, > > Zhitao Li >
Re: Using Virtual Hosts
How are you launching your tasks and are they containerized? If you use your own framework and launch tasks in containers, you can configure the networking mode as BRIDGED (in ContainerInfo), and your Framework will obtain (in the response / callback it receives after the task is launched) the port (and will already know the hostname/ip of the Agent whose Offer it accepted) - this information can then be fed to whatever discovery mechanism you use (or, more trivially, in the Framework's Web UI - which itself can be advertised to the Master - via the `webui_url` field in the FrameworkInfo protobuf [0] I don't know enough of Marathon to really be able to help there - but if you post the question in their user group, I'm sure there's a less involved way to do this if you use it. :) [0] ./include/mesos/mesos.proto LL #206 -- *Marco Massenzio* http://codetrips.com On Thu, Feb 11, 2016 at 9:27 PM, Jeff Schroeder <jeffschroe...@computer.org> wrote: > With a few of the newly added features, marathon-lb is actually a pretty > elegant solution: > > https://github.com/mesosphere/marathon-lb > > > On Thursday, February 11, 2016, Alfredo Carneiro < > alfr...@simbioseventures.com> wrote: > >> Hi guys, >> >> I have been searching for the past few weeks about Mesos and VHosts, >> saddly, I have not found anything useful. >> >> I have a mesos cluster running some webapps. So, I have assigned specifc >> ports to these apps, so I access this apps using >> *http://:*. How could I use Virtual Hosts to >> access these apps? *http://myapp.com <http://myapp.com>*? >> >> 1x Mesos Master with HAProxy and Chronos >> 9x Mesos Slave with Docker >> >> Thanks, >> >> -- >> Alfredo Miranda >> > > > -- > Text by Jeff, typos by iPhone >
Re: mesos 0.23, long term quering state.json data.
+1 to what Neil says plus, if you don't need all the info contained in /state, /state-summary is a much faster option. -- *Marco Massenzio* http://codetrips.com On Mon, Feb 1, 2016 at 8:27 AM, Neil Conway <neil.con...@gmail.com> wrote: > There are some known performance problems with the implementation of > the /state endpoint in prior versions of Mesos (see MESOS-2353 for > details). In Mesos 0.27, the performance of /state should be much, > much faster. > > Neil > > On Mon, Feb 1, 2016 at 8:02 AM, tommy xiao <xia...@gmail.com> wrote: > > David, Thanks for your quick response, i will feedback the result asap. > > > > haosdent, i am not sure, but the cluster is only 6 node, it is not a > large > > cluster. Through the Mesos-2352, i found the description: "Looking at > perf > > data, it seems most of the time is spent doing memory allocation / > > de-allocation. " do you know how to do that command? i can make test on > it. > > > > > > > > 2016-02-01 20:09 GMT+08:00 haosdent <haosd...@gmail.com>: > >> > >> Maybe this related to your problem > >> https://issues.apache.org/jira/browse/MESOS-2353 ? > >> > >> On Mon, Feb 1, 2016 at 8:02 PM, tommy xiao <xia...@gmail.com> wrote: > >>> > >>> In bare mental server, the 5051 port, the state.json query will hang a > >>> long time to response the json data. it about 5 minutes. the curious > thing > >>> is the 5051 port, the /help command can work correctly. so i wonder to > know, > >>> anyone came across the case in any time? no clues to debug in my view. > >>> > >>> -- > >>> Deshi Xiao > >>> Twitter: xds2000 > >>> E-mail: xiaods(AT)gmail.com > >> > >> > >> > >> > >> -- > >> Best Regards, > >> Haosdent Huang > > > > > > > > > > -- > > Deshi Xiao > > Twitter: xds2000 > > E-mail: xiaods(AT)gmail.com >
Re: [VOTE] Release Apache Mesos 0.27.0 (rc2)
On Fri, Jan 29, 2016 at 7:00 PM, Marco Massenzio <m.massen...@gmail.com> wrote: > Is there a 0.27.0-rc2 branch cut? > > $ git fetch --all > Fetching origin > > $ git co 0.27.0-rc2 > error: pathspec '0.27.0-rc2' did not match any file(s) known to git. > > well, or a tag, for that matter... $ git tag | grep 27 0.27.0-rc1 > > -- > *Marco Massenzio* > http://codetrips.com > > On Wed, Jan 27, 2016 at 11:12 PM, Michael Park <mp...@apache.org> wrote: > >> Hi all, >> >> Please vote on releasing the following candidate as Apache Mesos 0.27.0. >> >> 0.27.0 includes the following: >> >> >> We added major features such as Implicit Roles, Quota, Multiple Disks and >> more. >> >> We also added major bug fixes such as performance improvements to >> state.json requests and GLOG. >> >> The CHANGELOG for the release is available at: >> >> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.0-rc2 >> >> >> >> The candidate for Mesos 0.27.0 release is available at: >> >> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz >> >> The tag to be voted on is 0.27.0-rc2: >> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.27.0-rc2 >> >> The MD5 checksum of the tarball can be found at: >> >> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.md5 >> >> The signature of the tarball can be found at: >> >> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.asc >> >> The PGP key used to sign the release is here: >> https://dist.apache.org/repos/dist/release/mesos/KEYS >> >> The JAR is up in Maven in a staging repository here: >> https://repository.apache.org/content/repositories/orgapachemesos-1100 >> >> Please vote on releasing this package as Apache Mesos 0.27.0! >> >> The vote is open until Sat Jan 30 23:59:59 PST 2016 and passes if a >> majority of at least 3 +1 PMC votes are cast. >> >> [ ] +1 Release this package as Apache Mesos 0.27.0 >> [ ] -1 Do not release this package because ... >> >> Thanks, >> >> Tim, Kapil, MPark >> > >
Re: [VOTE] Release Apache Mesos 0.27.0 (rc2)
Is there a 0.27.0-rc2 branch cut? $ git fetch --all Fetching origin $ git co 0.27.0-rc2 error: pathspec '0.27.0-rc2' did not match any file(s) known to git. -- *Marco Massenzio* http://codetrips.com On Wed, Jan 27, 2016 at 11:12 PM, Michael Park <mp...@apache.org> wrote: > Hi all, > > Please vote on releasing the following candidate as Apache Mesos 0.27.0. > > 0.27.0 includes the following: > > > We added major features such as Implicit Roles, Quota, Multiple Disks and > more. > > We also added major bug fixes such as performance improvements to > state.json requests and GLOG. > > The CHANGELOG for the release is available at: > > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.0-rc2 > > > > The candidate for Mesos 0.27.0 release is available at: > https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz > > The tag to be voted on is 0.27.0-rc2: > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.27.0-rc2 > > The MD5 checksum of the tarball can be found at: > > https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.md5 > > The signature of the tarball can be found at: > > https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.asc > > The PGP key used to sign the release is here: > https://dist.apache.org/repos/dist/release/mesos/KEYS > > The JAR is up in Maven in a staging repository here: > https://repository.apache.org/content/repositories/orgapachemesos-1100 > > Please vote on releasing this package as Apache Mesos 0.27.0! > > The vote is open until Sat Jan 30 23:59:59 PST 2016 and passes if a > majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release this package as Apache Mesos 0.27.0 > [ ] -1 Do not release this package because ... > > Thanks, > > Tim, Kapil, MPark >
Re: [VOTE] Release Apache Mesos 0.27.0 (rc2)
Thanks, buddy - I keep forgetting that one! (one assumes --all would, well, take care of that too :) Have a great weekend! -- *Marco Massenzio* http://codetrips.com On Fri, Jan 29, 2016 at 7:06 PM, Vinod Kone <vinodk...@gmail.com> wrote: > Git fetch --tags > > @vinodkone > > On Jan 29, 2016, at 7:00 PM, Marco Massenzio <m.massen...@gmail.com> > wrote: > > Is there a 0.27.0-rc2 branch cut? > > $ git fetch --all > Fetching origin > > $ git co 0.27.0-rc2 > error: pathspec '0.27.0-rc2' did not match any file(s) known to git. > > > -- > *Marco Massenzio* > http://codetrips.com > > On Wed, Jan 27, 2016 at 11:12 PM, Michael Park <mp...@apache.org> wrote: > >> Hi all, >> >> Please vote on releasing the following candidate as Apache Mesos 0.27.0. >> >> 0.27.0 includes the following: >> >> >> We added major features such as Implicit Roles, Quota, Multiple Disks and >> more. >> >> We also added major bug fixes such as performance improvements to >> state.json requests and GLOG. >> >> The CHANGELOG for the release is available at: >> >> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.0-rc2 >> >> >> >> The candidate for Mesos 0.27.0 release is available at: >> >> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz >> >> The tag to be voted on is 0.27.0-rc2: >> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.27.0-rc2 >> >> The MD5 checksum of the tarball can be found at: >> >> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.md5 >> >> The signature of the tarball can be found at: >> >> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.asc >> >> The PGP key used to sign the release is here: >> https://dist.apache.org/repos/dist/release/mesos/KEYS >> >> The JAR is up in Maven in a staging repository here: >> https://repository.apache.org/content/repositories/orgapachemesos-1100 >> >> Please vote on releasing this package as Apache Mesos 0.27.0! >> >> The vote is open until Sat Jan 30 23:59:59 PST 2016 and passes if a >> majority of at least 3 +1 PMC votes are cast. >> >> [ ] +1 Release this package as Apache Mesos 0.27.0 >> [ ] -1 Do not release this package because ... >> >> Thanks, >> >> Tim, Kapil, MPark >> > >
Re: Basic questions about use of ZooKeeper
Hi Michal, 1. While watching some talk I've heard that maybe in the future ZooKeeper > won't be needed. Is this still planned? > At some point there has been talk of moving towards using etcd instead of ZooKeeper: y ou can look into Jira[0] , and it seems that MESOS-1806[1] is the one that has received the most attention/activity . Others may be able to provide more detailed guidance, but the impression I have is that it may be some time before this becomes available as a Production-ready alternative. > 2. We're using mainly quite large boxes (>= 20 CPUs, >= 48GB RAM). Is it > advised to put Mesos master and warm backup nodes inside ZooKeeper's > cluster? (just to avoid wasting resources). > > There is really no reason not to have Master/ZooKeeper servers co-located - in fact, this is the way DCOS CE is deployed in AWS (or, at least, used to last time I looked into it). Hope this helps! [0] https://issues.apache.org/jira/issues/?jql=project%20%3D%20Mesos%20and%20text%20~%20%22etcd%22 [1] https://issues.apache.org/jira/browse/MESOS-1806 -- *Marco Massenzio* http://codetrips.com
Re: installing a framework after teardown
Hey Viktor, I'm not clear what you mean by "re-install the same framework" - do you mean, just restarting the binary? If so, as Vinod pointed out, you should re-register with a SUBSCRIBE message and obtain a new FrameworkId in the response. And, yes, the name can stay perfectly the same (in fact, you can have several frameworks with the same name - but different IDs - connect to the same Master). Are you using the C++ API or the new HTTP API? If the latter, please have a look at the example here[0] for how to "terminate" a framework and then reconnect it. (in particular, see the sections "Registering a Framework" and "Terminating a Framework"). If the former, see [1] where I set the `name` in the `FrameworkInfo` (and that one stays the same across runs) but not the ID (that one gets returned in the `registered()`[2] method and can be used, if necessary elsewhere in the code; for example, when accepting offers; but should otherwise stay unique to the FW). There are many (better!) examples around of frameworks also in the "Examples" folder in the Mesos source code[3]; you may want to take a look there too. [0] https://github.com/massenz/zk-mesos/blob/develop/notebooks/Demo-API.ipynb [1] https://github.com/massenz/mongo_fw/blob/develop/src/mongo_scheduler.cpp#L194 [2] https://github.com/massenz/mongo_fw/blob/develop/src/mongo_scheduler.cpp#L70 [3] https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=tree;f=src/examples -- *Marco Massenzio* http://codetrips.com On Sat, Jan 16, 2016 at 12:52 AM, Viktor Sadovnikov <vik...@jv-ration.com> wrote: > Yes, I need to re-install the same framework. It can get another ID, but > its name should remain the same. > I though the framework ID dynamically assigned by the Master upon the > connection and did not expect the Master to provide the same ID > > On Fri, Jan 15, 2016 at 10:15 PM, Vinod Kone <vinodk...@apache.org> wrote: > >> What do you mean by gracefully recover? If you mean the ability to >> reconnect, you need to change the framework id in the FrameworkInfo when >> registering with the master. >> >> As a hack, you could restart the master, so that it forgets that it >> removed the framework with id and hence allows it to re-register with >> the old id. >> >> On Fri, Jan 15, 2016 at 5:37 AM, Viktor Sadovnikov <vik...@jv-ration.com> >> wrote: >> >>> Hello, >>> >>> I have removed a framework from Mesos Cluster by curl -X POST -d >>> 'frameworkId=-b036-4cb7-af53-4c837dc9521d-0002' >>> http://${MASTER_IP}:5050/master/teardown;. >>> This successfully removed all the framework tasks and scheduler. >>> >>> However now Mesos Cluster rejects my attempts to re-install the >>> framework. Is there a way to gracefully recover from this situation? >>> >>> I0115 12:54:57.916470 28856 sched.cpp:1024] Got error 'Framework has >>> been removed' >>> I0115 12:54:57.916509 28856 sched.cpp:1805] Asked to abort the driver >>> I0115 12:54:57.916824 28856 sched.cpp:1070] Aborting framework >>> '8ca5c18f-b036-4cb7-af53-4c837dc9521d-0001' >>> >>> With regards, >>> Viktor >>> >> >> >
Re: Running mesos slave in Docker on CoreOS
Provided that I know close to nothing about CoreOS (and very little about docker itself) usually the 127 exit code is for a "not found" binary - are you sure that `docker` is in the PATH of the user/process running the Mesos agent? Much longer shot - but worth a try: look into the permissions around the /var/run folder - what happens if you try to run the very same command that failed, from the shell? (but I do see that you mount it with the -v, so that should work, shouldn't it?) -- *Marco Massenzio* http://codetrips.com On Thu, Dec 31, 2015 at 1:17 PM, Taylor, Graham < graham.x.tay...@capgemini.com> wrote: > I did try removing the /proc and adding just pid=host but still no dice > with that. Need to have a deeper dig into the docker 1.9 changelog. Will > post back if I find anything. > > Thanks, > Graham. > > On 31 Dec 2015, at 20:27, Tim Chen <t...@mesosphere.io> wrote: > > I don't think you need to mount in /proc if you have --pid=host already, > can you try that? > > Tim > > On Thu, Dec 31, 2015 at 4:16 AM, Taylor, Graham < > graham.x.tay...@capgemini.com> wrote: > >> Hey folks, >> I’m trying to get Mesos slave up and running in a docker container on >> CoreOS. I’ve successfully got the master up and running but anytime I start >> the slave container I receive the following error - >> >> Failed to create a containerizer: Could not create DockerContainerizer: >> Failed to create docker: Failed to get docker version: Failed to execute >> 'docker -H unix:///var/run/docker.sock --version': exited with status 127 >> >> I’m starting the slave container with the following command - >> >> /usr/bin/docker run --rm --name mesos_slave \ >> --net=host \ >> --privileged \ >> --pid=host \ >> -p 5051:5051 \ >> -v /sys:/sys \ >> -v /proc:/host/proc:ro \ >> -v /usr/bin/docker:/usr/bin/docker:ro \ >> -v /var/run/docker.sock:/var/run/docker.sock \ >> -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \ >> -e "MESOS_MASTER=zk://172.31.1.11:2181,172.31.1.12:2181, >> 172.31.1.13:2181/mesos" \ >> -e "MESOS_EXECUTOR_REGISTRATION_TIMEOUT=10mins" \ >> -e "MESOS_CONTAINERIZERS=docker" \ >> -e "MESOS_RESOURCES=ports(*):[31000-32000]" \ >> -e "MESOS_IP=172.31.1.14" \ >> -e "MESOS_WORK_DIR=/tmp/mesos" \ >> -e "MESOS_HOSTNAME=172.31.1.14" \ >> mesosphere/mesos-slave:0.25.0-0.2.70.ubuntu1404 >> >> I’ve also tried with various other versions of the Docker image >> (including 0.26.0) but I keep receiving the same error. >> >> I’m running on CoreOS beta channel (877.1.0) which has docker installed >> and the service running - >> >> docker --version >> Docker version 1.9.1, build 4419fdb-dirty >> >> >> If I change the /proc mount to be /proc:/proc I get past the docker >> version error but receive a different error - >> >> Error response from daemon: Cannot start container >> 51a9b60f702a0f13f975fd2e7f4b642180d5363565e042702665098e8761b758: [8] >> System error: >> "/var/lib/docker/overlay/51a9b60f702a0f13f975fd2e7f4b642180d5363565e042702665098e8761b758/merged/proc" >> cannot be mounted because it is located inside "/proc” >> >> >> I had a search on the wiki and found some similar related issues >> https://issues.apache.org/jira/browse/MESOS-3498?jql=project%20%3D%20MESOS%20AND%20text%20~%20%22Failed%20to%20execute%20%27docker%20version%22 >> but >> they all seem to be closed/resolved/won’t fix. >> >> Is anyone successfully running a slave on CoreOS and can help me fix up >> my Docker command? >> >> Thanks, >> Graham. >> >> >> -- >> >> Capgemini is a trading name used by the Capgemini Group of companies >> which includes Capgemini UK plc, a company registered in England and Wales >> (number 943935) whose registered office is at No. 1, Forge End, Woking, >> Surrey, GU21 6DB. >> > > > -- > > Capgemini is a trading name used by the Capgemini Group of companies which > includes Capgemini UK plc, a company registered in England and Wales > (number 943935) whose registered office is at No. 1, Forge End, Woking, > Surrey, GU21 6DB. >
Re: How can mesos print logs from VLOG function?
Mesos uses Google Logging[0] and, according to the documentation there, the VLOG(n) calls are only logged if a variable GLOG_v=m (where n > m) is configured when running Mesos (the other suggested way, using --v=m won't work for mesos). Having said that, I have recently been unable to make this work - so there may be some other trickery at work. [0] https://google-glog.googlecode.com/svn/trunk/doc/glog.html -- *Marco Massenzio* http://codetrips.com On Wed, Dec 30, 2015 at 12:30 AM, Nan Xiao <xiaonan830...@gmail.com> wrote: > Hi all, > > I want mesos prints logs from VLOG function: > > VLOG(1) << "Executor started at: " << self() > << " with pid " << getpid(); > > But from mesos help: > > $ sudo ./bin/mesos-master.sh --help | grep -i LOG > --external_log_file=VALUESpecified the externally > managed log file. This file will be >stderr logging as the log > file is otherwise unknown to Mesos. > --[no-]initialize_driver_logging Whether to automatically > initialize Google logging of scheduler > --[no-]log_auto_initialize Whether to automatically > initialize the replicated log used for the >registry. If this is set to > false, the log has to be manually > --log_dir=VALUE Directory path to put log > files (no default, nothing >does not affect logging to > stderr). >NOTE: 3rd party log > messages (e.g. ZooKeeper) are > --logbufsecs=VALUE How many seconds to buffer > log messages for (default: 0) > --logging_level=VALUELog message at or above > this level; possible values: >will affect just the logs > from log_dir (if specified) (default: INFO) > --[no-]quiet Disable logging to stderr > (default: false) > --quorum=VALUE The size of the quorum of > replicas when using 'replicated_log' based >available options are > 'replicated_log', 'in_memory' (for testing). (default: replicated_log) > > I can't find related configurations. > > So how can mesos print logs from VLOG function? Thanks in advance! > > Best Regards > Nan Xiao >
Re: Is it safe to replace mesos-master in fly
The closest I could find is [0], but granted, much more detail could be desirable :) FYI - you may also want to check out the Maintenance Primitives [1] and upgrades [2] (which is actually not directly applicable to your stated use case, but may be of interest for future reference). In any event, you're doing it right. As for the "reasonable time to wait" - I'm afraid I don't really have a good feel for it: probably keeping a eye on the logs may help, but I'm sure other folks on this list will have a much more satisfying answer. Let us know how you go along, and if you want to contribute back to documenting how you did it, contributions always welcome! [0] http://mesos.apache.org/documentation/latest/operational-guide/ [1] http://mesos.apache.org/documentation/latest/maintenance/ [2] http://mesos.apache.org/documentation/latest/upgrades/ -- *Marco Massenzio* Distributed Systems Engineer http://codetrips.com On Tue, Nov 24, 2015 at 8:41 AM, Chengwei Yang <chengwei.yang...@gmail.com> wrote: > Thanks @Tommy, > > Since I didn't found any official document about migrate mesos-mater or > resize > mesos-master quorum size, so before anything missing that will supprise me, > I came here to confirm. :-) > > -- > Thanks, > Chengwei > > On Wed, Nov 25, 2015 at 12:07:43AM +0800, tommy xiao wrote: > > This is correct way on upgrade your mesos cluster, more details see mesos > > documents release note. > > > > 2015-11-24 9:47 GMT+08:00 Chengwei Yang <chengwei.yang...@gmail.com>: > > > > Hi all, > > > > We're using mesos in product on CentOS 6 and plan to upgrade CentOS > to 7.1, > > to > > avoid affect any tasks running on mesos. We're about to replace all > > mesos-masters in fly. > > > > The procedure listed below: > > > > 0. 3 mesos-masters running on CentOS 6 > > 1. shutdown 1 mesos-master(CentOS 6) and bring up 1 > mesos-master(CentOS 7) > >wait the new master synced for some time(is there any simple way > to know > > when?) > > 2. repeat step 1 > > > > NOTE: we plan to shutdown non-leader first, and shutdown the > leader(CentOS > > 6) > > last. > > > > Can we do this in such way? Or any other better suggestions? > > > > -- > > Thanks, > > Chengwei > > > > > > > > > > -- > > Deshi Xiao > > Twitter: xds2000 > > E-mail: xiaods(AT)gmail.com > > SECURITY NOTE: file ~/.netrc must not be accessible by others >
Re: Zookeeper cluster changes
The way I would do it in a production cluster would be *not* to use directly IP addresses for the ZK ensemble, but instead rely on some form of internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, ...}. prod.example.com etc) and have the provisioning tooling (Chef, Puppet, Ansible, what have you) handle the setting of the hostname when restarting/replacing a failing/crashed ZK server. This way your list of zk's to Mesos never changes, even though the FQN's will map to different IPs / VMs. Obviously, this may not be always desirable / feasible (eg, if your prod environment does not support DNS resolution). You are correct in that Mesos does not currently support dynamically changing the ZK's addresses, but I don't know whether that's a limitation of Mesos code or of the ZK C++ client driver. I'll look into it and let you know what I find (if anything). -- *Marco Massenzio* Distributed Systems Engineer http://codetrips.com On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaid...@me.com> wrote: > How do mesos masters and slaves react to zookeeper cluster changes? When > the masters and slaves start they are given a set of addresses to connect > to zookeeper. But over time, one of those zookeepers fails, and is replaced > by a new server at a new address. How should this be handled in the mesos > servers? > > I am guessing that mesos does not automatically detect and react to that > change. But obviously we should do something to keep the mesos servers > happy as well. What should be do? > > The obvious thing is to stop the mesos servers, one at a time, and restart > them with the new configuration. But it would be really nice to be able to > do this dynamically without restarting the server. After all, coordinating > a rolling restart is a fairly hard job. > > Any suggestions or pointers? > > Best regards, > Don Laidlaw > > >
Re: Welcome Kapil as Mesos committer and PMC member!
Awesome stuff! Congratulations, Kapil - totally deserved! On Thursday, November 5, 2015, Vinod Kone <vinodk...@gmail.com> wrote: > welcome kapil! > > On Thu, Nov 5, 2015 at 6:49 AM, <connor@gmail.com > <javascript:_e(%7B%7D,'cvml','connor@gmail.com');>> wrote: > >> Congrats Dr. Arya! >> >> > On Nov 5, 2015, at 02:02, Till Toenshoff <toensh...@me.com >> <javascript:_e(%7B%7D,'cvml','toensh...@me.com');>> wrote: >> > >> > I'm happy to announce that Kapil Arya has been voted a Mesos committer >> and PMC member! >> > >> > Welcome Kapil, and thanks for all of your great contributions to the >> project so far! >> > >> > Looking forward to lots more of your contributions! >> > >> > Thanks >> > Till >> > > -- -- *Marco Massenzio* Distributed Systems Engineer http://codetrips.com
Re: error: 'sasl_errdetail' is deprecated: first deprecated in OS X 10.11
I'm almost sure that you're running into https://issues.apache.org/jira/browse/MESOS-3030 (there is a patch out to fix this: https://reviews.apache.org/r/39230/) -- *Marco Massenzio* Distributed Systems Engineer http://codetrips.com On Mon, Oct 12, 2015 at 4:54 PM, yuankui <kui.y...@fraudmetrix.cn> wrote: > hello,buddies > > I'm compiling mesos on mac os x 10.11 (EI capitan) and come across with > some error as flowing > version: mesos-0.24.0 & mesos-0.25.0-rc3 > > > /usr/include/sasl/sasl.h:757:25: note: 'sasl_errstring' has been > explicitly marked deprecated here > LIBSASL_API const char *sasl_errstring(int saslerr, >^ > ../../src/authentication/cram_md5/authenticator.cpp:334:20: error: > 'sasl_errdetail' is deprecated: first deprecated in OS X 10.11 > [-Werror,-Wdeprecated-declarations] > string error(sasl_errdetail(connection)); > ^ > /usr/include/sasl/sasl.h:770:25: note: 'sasl_errdetail' has been > explicitly marked deprecated here > LIBSASL_API const char *sasl_errdetail(sasl_conn_t *conn) > __OSX_AVAILABLE_BUT_DEPRECATED(__MAC_10_0,__MAC_10_11,__IPHONE_NA,__IPHONE_NA); >^ > ../../src/authentication/cram_md5/authenticator.cpp:514:18: error: > 'sasl_server_init' is deprecated: first deprecated in OS X 10.11 > [-Werror,-Wdeprecated-declarations] >int result = sasl_server_init(NULL, "mesos"); > ^ > /usr/include/sasl/sasl.h:1016:17: note: 'sasl_server_init' has been > explicitly marked deprecated here > LIBSASL_API int sasl_server_init(const sasl_callback_t *callbacks, >^ > ../../src/authentication/cram_md5/authenticator.cpp:519:11: error: > 'sasl_errstring' is deprecated: first deprecated in OS X 10.11 > [-Werror,-Wdeprecated-declarations] > sasl_errstring(result, NULL, NULL)); > ^ > /usr/include/sasl/sasl.h:757:25: note: 'sasl_errstring' has been > explicitly marked deprecated here > LIBSASL_API const char *sasl_errstring(int saslerr, >^ > ../../src/authentication/cram_md5/authenticator.cpp:521:16: error: > 'sasl_auxprop_add_plugin' is deprecated: first deprecated in OS X 10.11 > [-Werror,-Wdeprecated-declarations] > result = sasl_auxprop_add_plugin( > ^ > /usr/include/sasl/saslplug.h:1013:17: note: 'sasl_auxprop_add_plugin' has > been explicitly marked deprecated here > LIBSASL_API int sasl_auxprop_add_plugin(const char *plugname, >^ > ../../src/authentication/cram_md5/authenticator.cpp:528:13: error: > 'sasl_errstring' is deprecated: first deprecated in OS X 10.11 > [-Werror,-Wdeprecated-declarations] >sasl_errstring(result, NULL, NULL)); >^ > /usr/include/sasl/sasl.h:757:25: note: 'sasl_errstring' has been > explicitly marked deprecated here > LIBSASL_API const char *sasl_errstring(int saslerr, >^ > > as I'm not familiar with c++, I don't know how to solve this > > I believe I'm not the first one who came across with this problem, So I'm > here for help! > thanks. > > >
Re: Can health-checks be run by Mesos for docker tasks?
Are those the stdout logs of the Agent? Because I don't see the --launcher-dir set, however, if I look into one that is running off the same 0.24.1 package, this is what I see: I1012 14:56:36.933856 1704 slave.cpp:191] Flags at startup: --appc_store_dir="/tmp/mesos/store/appc" --attributes="rack:r2d2;pod:demo,dev" --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --initialize_driver_logging="true" --ip="192.168.33.11" --isolation="cgroups/cpu,cgroups/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/local/mesos/logs/agent" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.33.1:2181/mesos/vagrant" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resource_monitoring_interval="1secs" --resources="ports:[9000-1];ephemeral_ports:[32768-57344]" --revocable_cpu_low_priority="true" --sandbox_directory="/var/local/sandbox" --strict="true" --switch_user="true" --version="false" --work_dir="/var/local/mesos/agent" (this is run off the Vagrantfile at [0] in case you want to reproduce). That agent is not run via the init command, though, I execute it manually via the `run-agent.sh` in the same directory. I don't really think this matters, but I assume you also restarted the agent after making the config changes? (and, for your own sanity - you can double check the version by looking at the very head of the logs). -- *Marco Massenzio* Distributed Systems Engineer http://codetrips.com On Mon, Oct 12, 2015 at 10:50 PM, Jay Taylor <outtat...@gmail.com> wrote: > Hi Haosdent and Mesos friends, > > I've rebuilt the cluster from scratch and installed mesos 0.24.1 from the > mesosphere apt repo: > > $ dpkg -l | grep mesos > ii mesos 0.24.1-0.2.35.ubuntu1404 > amd64Cluster resource manager with efficient resource isolation > > Then added the `launcher_dir' flag to /etc/mesos-slave/launcher_dir on the > slaves: > > mesos-worker1a:~$ cat /etc/mesos-slave/launcher_dir > /usr/libexec/mesos > > And yet the task health-checks are still being launched from the sandbox > directory like before! > > I've also tested setting the MESOS_LAUNCHER_DIR env var and get the > identical result (just as before on the cluster where many versions of > mesos had been installed): > > STDOUT: > > --container="mesos-20151012-184440-1625401536-5050-23953-S0.62d43b8f-6cd1-4c53-9ac8-84dbfc45bbcb" >> --docker="docker" --help="false" --initialize_driver_logging="true" >> --logbufsecs="0" --logging_level="INFO" >> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" >> --sandbox_directory="/tmp/mesos/slaves/20151012-184440-1625401536-5050-23953-S0/frameworks/20151012-184440-1625401536-5050-23953-/executors/hello-app_web-v3.33597b73-1943-41b4-a308-76132eebcc91/runs/62d43b8f-6cd1-4c53-9ac8-84dbfc45bbcb" >> --stop_timeout="0ns" >> --container="mesos-20151012-184440-1625401536-5050-23953-S0.62d43b8f-6cd1-4c53-9ac8-84dbfc45bbcb" >> --docker="docker" --help="false" --initialize_driver_logging="true" >> --logbufsecs="0" --logging_level="INFO" >> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" >> --sandbox_directory="/tmp/mesos/slaves/20151012-184440-1625401536-5050-23953-S0/frameworks/20151012-184440-1625401536-5050-23953-/executors/hello-app_web-v3.33597b73
Re: Can health-checks be run by Mesos for docker tasks?
On Mon, Oct 12, 2015 at 11:26 PM, Marco Massenzio <ma...@mesosphere.io> wrote: > Are those the stdout logs of the Agent? Because I don't see the > --launcher-dir set, however, if I look into one that is running off the > same 0.24.1 package, this is what I see: > > I1012 14:56:36.933856 1704 slave.cpp:191] Flags at startup: > --appc_store_dir="/tmp/mesos/store/appc" > --attributes="rack:r2d2;pod:demo,dev" --authenticatee="crammd5" > --cgroups_cpu_enable_pids_and_tids_count="false" > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" > --cgroups_limit_swap="false" --cgroups_root="mesos" > --container_disk_watch_interval="15secs" --containerizers="docker,mesos" > --default_role="*" --disk_watch_interval="1mins" --docker="docker" > --docker_kill_orphans="true" --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --enforce_container_disk_quota="false" > --executor_registration_timeout="1mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --initialize_driver_logging="true" > --ip="192.168.33.11" --isolation="cgroups/cpu,cgroups/mem" > --launcher_dir="/usr/libexec/mesos" > --log_dir="/var/local/mesos/logs/agent" --logbufsecs="0" > --logging_level="INFO" --master="zk://192.168.33.1:2181/mesos/vagrant" > --oversubscribed_resources_interval="15secs" --perf_duration="10secs" > --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" > --quiet="false" --recover="reconnect" --recovery_timeout="15mins" > --registration_backoff_factor="1secs" > --resource_monitoring_interval="1secs" > --resources="ports:[9000-1];ephemeral_ports:[32768-57344]" > --revocable_cpu_low_priority="true" > --sandbox_directory="/var/local/sandbox" --strict="true" > --switch_user="true" --version="false" --work_dir="/var/local/mesos/agent" > (this is run off the Vagrantfile at [0] in case you want to reproduce). > That agent is not run via the init command, though, I execute it manually > via the `run-agent.sh` in the same directory. > > I don't really think this matters, but I assume you also restarted the > agent after making the config changes? > (and, for your own sanity - you can double check the version by looking at > the very head of the logs). > > > [0] http://github.com/massenz/zk-mesos > > > > > -- > *Marco Massenzio* > Distributed Systems Engineer > http://codetrips.com > > On Mon, Oct 12, 2015 at 10:50 PM, Jay Taylor <outtat...@gmail.com> wrote: > >> Hi Haosdent and Mesos friends, >> >> I've rebuilt the cluster from scratch and installed mesos 0.24.1 from the >> mesosphere apt repo: >> >> $ dpkg -l | grep mesos >> ii mesos 0.24.1-0.2.35.ubuntu1404 >>amd64Cluster resource manager with efficient resource isolation >> >> Then added the `launcher_dir' flag to /etc/mesos-slave/launcher_dir on >> the slaves: >> >> mesos-worker1a:~$ cat /etc/mesos-slave/launcher_dir >> /usr/libexec/mesos >> >> And yet the task health-checks are still being launched from the sandbox >> directory like before! >> >> I've also tested setting the MESOS_LAUNCHER_DIR env var and get the >> identical result (just as before on the cluster where many versions of >> mesos had been installed): >> >> STDOUT: >> >> --container="mesos-20151012-184440-1625401536-5050-23953-S0.62d43b8f-6cd1-4c53-9ac8-84dbfc45bbcb" >>> --docker="docker" --help="false" --initialize_driver_logging="true" >>> --logbufsecs="0" --logging_level="INFO" >>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" >>> --sandbox_directory="/tmp/mesos/slaves/20151012-184440-1625401536-5050-23953-S0/frameworks/20151012-184440-1625401536-5050-23953-/executors/hello-app_web-v3.33597b73-1943-41b4-a308-76132eebcc91/runs/62d43b8f-6cd1-4c53-9ac8-84dbfc45bbcb" >>> --stop_timeout="0ns" >>> --container="mesos-20151012-184440-162540
Re: Framework control over slave recovery
It sounds to me a reasonable expectation that the framework may be notified if the agent(s) that are running one or more of its tasks starts showing signs of unhealthiness - in most instances, we would expect them to happily ignore such situation and just let Mesos take care of the matter, but if they do care, they should be able to know. Not so sure about the feasibility of a 'per task timeout', but the notification would be probably not too complicated (although, it does open up a whole new area of debate around implementation and how to modify the API to enable that). Could you please file a Jira requesting this as a feature on the Master? Thanks! *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Fri, Oct 9, 2015 at 3:29 PM, Marcus Larsson <marcus.lars...@oracle.com> wrote: > Hi, > > On 2015-10-09 15:26, Marco Massenzio wrote: > > The 'marking' of the task is not immediate: Master actually waits a beat > or two to see if the Agent reconnects, there are various flags that control > behavior around this [0]. > > Naive question: I am assuming that you already looked into a combination > of: > > --max_slave_ping_timeouts=VALUE > --slave_ping_timeout=VALUE > --slave_removal_rate_limit=VALUE > --slave_reregister_timeout=VALUE > > that may help with your use case? > I'm not really an expert into these flags, so not entirely sure whether a > combination thereof may work with your scenario. > > > Yeah I've seen and tried using these flags. While they can be used to > prevent Mesos from killing the agents too quickly, the framework will not > be notified about the slave failing the health checks unless it times out > completely and the task is lost. Also, ideally we would want per-task > timeouts, whereas these settings are global. > > Thanks, > Marcus > > > [0] http://mesos.apache.org/documentation/latest/configuration/ > > > > > *Marco Massenzio* > > *Distributed Systems Engineer http://codetrips.com <http://codetrips.com>* > > On Fri, Oct 9, 2015 at 11:48 AM, Marcus Larsson <marcus.lars...@oracle.com > > wrote: > >> Hi, >> >> I'm part of a project investigating the use of Mesos for a distributed >> build and test system. For some of our tasks we would like to have more >> control over the slave recovery policy. Currently, when a slave fails its >> health check, it seems Mesos will always mark any task on the slave as >> lost, and shutdown the slave when (or if) it reconnects. We would like the >> framework to have more information and control over this. >> >> I found an issue [1] in JIRA that mentions implementing something like >> this, but it seems only the part with the slave removal rate limiter was >> implemented. What I'm wondering is if there is any support in Mesos for >> letting the framework decide how to handle slave removal/recovery? >> >> For our case, we would like the framework to be notified when a slave >> fails its health check, so that the appropriate action for the task running >> on that slave can be taken. Some of our tasks will be very long running and >> we don't want to restart a few days worth of work because the network was >> down for a while. >> >> Thanks, >> Marcus >> >> [1]: https://issues.apache.org/jira/browse/MESOS-2246 >> > > >
Re: mesos-ui
Re: version, I tested against a 0.24.1 and it worked just fine (apart I could not access info about tasks running for framework, but that seems to be a "known issue:" #17 if memory serves). And it did show resource utilized against available on the (one) node. It definitely looks pretty, so quite looking forward to where you guys are going to take it! *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Fri, Oct 9, 2015 at 4:48 PM, Taylor, Graham < graham.x.tay...@capgemini.com> wrote: > > > > Capgemini is a trading name used by the Capgemini Group of companies which > includes Capgemini UK plc, a company registered in England and Wales > (number 943935) whose registered office is at No. 1, Forge End, Woking, > Surrey, GU21 6DB. > This message contains information that may be privileged or confidential > and is the property of the Capgemini Group. It is intended only for the > person to whom it is addressed. If you are not the intended recipient, you > are not authorized to read, print, retain, copy, disseminate, distribute, > or use this message or any part thereof. If you receive this message in > error, please notify the sender immediately and delete all copies of this > message. > > > -- Forwarded message -- > From: "Taylor, Graham" <graham.x.tay...@capgemini.com> > To: "user@mesos.apache.org" <user@mesos.apache.org> > Cc: > Date: Fri, 9 Oct 2015 15:48:56 + > Subject: Re: mesos-ui > Hey Andras, > Yep - we’ve admittedly only tested in against 0.23 and on fairly small > sized clusters at the moment. There’s a ticket to look at supporting > different versions in the future > https://github.com/Capgemini/mesos-ui/issues/3 > > @Marco - Cam should still be there, if you fire him a message on twitter > at https://twitter.com/Wallies9 he might respond and you can hook up. > > Cheers, > Graham. > > > On 9 Oct 2015, at 16:26, Andras Kerekes <andras.kere...@ishisystems.com> > wrote: > > Hi Graham, > > I was able to setup the UI in Marathon pretty quickly. Two quick things: I > had to increase the allocated memory to 2Gb (vs the 512Mb on the website). > Also the node level stats were showing only available resources but no > actual allocation/utilization, I assume this might be because of the > version > we're using (0.22.1) vs the version it is tested against (0.23), right? > > The UI looks nice, thanks for open sourcing it! > > Andras > > -Original Message- > From: Taylor, Graham [mailto:graham.x.tay...@capgemini.com > <graham.x.tay...@capgemini.com>] > Sent: Friday, October 09, 2015 6:17 AM > To: user@mesos.apache.org > Subject: Re: mesos-ui > > > > > Capgemini is a trading name used by the Capgemini Group of companies which > includes Capgemini UK plc, a company registered in England and Wales > (number > 943935) whose registered office is at No. 1, Forge End, Woking, Surrey, > GU21 > 6DB. > This message contains information that may be privileged or confidential > and > is the property of the Capgemini Group. It is intended only for the person > to whom it is addressed. If you are not the intended recipient, you are not > authorized to read, print, retain, copy, disseminate, distribute, or use > this message or any part thereof. If you receive this message in error, > please notify the sender immediately and delete all copies of this message. > > > >
Re: Is there any APIs for status monitering, how did the Webui got the status of mesos?
Probably the most appropriate endpoint(s) would be something like http://mesos-master:5050/system/stats.json http://mesos-master:5050/metrics/snapshot for a much more basic 'health' check you can use the /health endpoint (this just gives you back a 200 OK if the Master/Agent are... feeling well :) I would recommend staying away from the /state.json (soon to be /state) as it demands a heavy toll on the Master and you may end up DOS'ing your own cluster. *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Wed, Oct 7, 2015 at 6:37 PM, Klaus Ma <kl...@cguru.net> wrote: > Hi Chong, > > I think you can use Mesos’s REST API to achieve that; please refer to the > following URL for more detail: > http://mesos.apache.org/documentation/latest/monitoring/ > > > Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer > Platform Symphony/DCOS Development & Support, STG, IBM GCG > +86-10-8245 4084 | mad...@cn.ibm.com | http://www.cguru.net > > On Oct 8, 2015, at 09:04, Chong Chen <chong.ch...@huawei.com> wrote: > > Hi, > I want to implement a program to monitoring mesos. Is there exist any APIs > already implemented in mesos that I can use to get the status of the Mesos? > just like what webui did: acquire the information about the amount of > total resources, allocated resources, dispatched tasks, finished/lost > tasks…. > How did the webui of mesos got this information? I think the fast way for > me is using the same method as webui did. > > Thanks! > > Best Regards, > Chong > > >
Re: mesos-tail in 0.24.1
Provided that I'm not familiar at all with mesos-tail and/or mesos-resolve, you are correct in that this is due to the recent changes (in 0.24) to the way we write MasterInfo data to ZooKeeper. This is a genuine bug, thanks for reporting: would you mind terribly to file a Jira and assign to me, please? (marco-mesos) Thanks! *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Tue, Sep 29, 2015 at 6:28 AM, Rad Gruchalski <ra...@gruchalski.com> wrote: > Thank you, that’s some progress: > > I changed the code at this line: > > > https://github.com/mesosphere/mesos-cli/blob/master/mesos/cli/master.py#L107 > > to: > > try: > parsed = json.loads(val) > return parsed["address"]["ip"] + ":" + > str(parsed["address"]["port"]) > except Exception: > return val.split("@")[-1] > > And now it gives me the correct master. However, executing mesos-tail or > mesos-ps does not do anything, just hangs there without any output. > Something obviously does not work as advertised. > Or I should possibly switch to https://github.com/mesosphere/dcos-cli ( > https://pypi.python.org/pypi/dcoscli), but will this work with just a > regular mesos 0.24.1 installation? > > Kind regards, > Radek Gruchalski > ra...@gruchalski.com <ra...@gruchalski.com> > de.linkedin.com/in/radgruchalski/ > > > *Confidentiality:*This communication is intended for the above-named > person and may be confidential and/or legally privileged. > If it has come to you in error you must take no action based on it, nor > must you copy or show it to anyone; please delete/destroy and inform the > sender immediately. > > On Tuesday, 29 September 2015 at 15:20, haosdent wrote: > > I think the problem here is you use zk as schema in your config file( > .mesos.json) or MESOS_CLI_CONFIG ( > https://github.com/mesosphere/mesos-cli/blob/master/mesos/cli/cfg.py#L42 > and > https://github.com/mesosphere/mesos-cli/blob/master/mesos/cli/master.py#L119). > Not because 0.24.1, you use 0.24.0 should have same issue. > > On Tue, Sep 29, 2015 at 9:14 PM, haosdent <haosd...@gmail.com> wrote: > > I think you install mesos-cli from https://github.com/mesosphere/mesos-cli > > On Tue, Sep 29, 2015 at 8:51 PM, Rad Gruchalski <ra...@gruchalski.com> > wrote: > > It seems that I found the reason for this behaviour. > When I execute mesos-resolve, I get an output like this: > > 10.100.1.100:5050","port":5050,"version":"0.24.1"} > > I managed to get to the python sources on the machine, especially > master.py. I verified that in my case the zookeeper_resolver is used. > However, what gets returned from zookeeper resolver is: > > return val.split("@")[-1] > > Where the val is a JSON string: > > > > {"address":{"hostname”:”mesos-master","ip":"10.100.1.100","port":5050},"hostname”:”mesos-master","id":"20150929-113531-244404234-5050-18065","ip”:...,"pid":" > master@10.100.1.100:5050","port":5050,"version":"0.24.1”} > > Looking at these two, it is obvious why it does not work. I’m trying to > find the code for master.py but it does not exist in > https://github.com/apache/mesos/tree/master/src/python/interface/src/mesos/interface > . > Where does it come from? Is it somehow generated or is it a separate repo? > > Kind regards, > Radek Gruchalski > ra...@gruchalski.com <ra...@gruchalski.com> > de.linkedin.com/in/radgruchalski/ > > > *Confidentiality:*This communication is intended for the above-named > person and may be confidential and/or legally privileged. > If it has come to you in error you must take no action based on it, nor > must you copy or show it to anyone; please delete/destroy and inform the > sender immediately. > > On Tuesday, 29 September 2015 at 13:02, Rad Gruchalski wrote: > > Hi everyone, > > I have upgraded my development mesos environment to 0.24.1 this morning. > It’s a clean installation with new zookeeper and everything. > Since the upgrade I get an error while executing mesos-tail: > > mesos-master ~$ mesos tail -f -n 50 service > Traceback (most recent call last): > File "/usr/local/bin/mesos-tail", line 11, in > sys.exit(main()) > File "/usr/local/lib/python2.7/dist-packages/mesos/cli/cli.py", line 61, > in wrapper > return fn(*args, **kwargs) > File "/usr/local/lib/python2.7/dist-packages/mesos/cli/cmds/tail.py", > line 55,
Re: Fwd: [Breaking Change 0.24 & Upgrade path] ZooKeeper MasterInfo change.
+1 to what Alex says. As far as we know, the functionality we use (ephemeral sequential nodes and writing simple data to a znode) is part of the "base API" offered by ZooKeeper and every version would support it. (then again, not a ZK expert here - if anyone knows better, please feel free to correct me). *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Fri, Sep 25, 2015 at 6:24 AM, Alex Rukletsov <a...@mesosphere.com> wrote: > James— > > Marco will correct me if I'm wrong, but my understanding is that this > change does *not* impact what ZooKeeper version you can use with Mesos. We > have changed the format of the message stored in ZK from protobuf to JSON. > This message is needed by frameworks for mesos master leader detection. > > HTH, > Alex > > On Fri, Sep 25, 2015 at 11:12 AM, CCAAT <cc...@tampabay.rr.com> wrote: > >> On 09/25/2015 08:13 AM, Marco Massenzio wrote: >> >>> Folks: >>> >>> as a reminder, please be aware that as of Mesos 0.24.0, as announced >>> back in June, Mesos Master will write its information (`MasterInfo`) to >>> ZooKeeper in JSON format (see below for details). >>> >> >> >> What versions of Zookeeper are supported by this change? That is, what >> is the oldest version of Zookeeper known to work or not work with this >> change in Mesos? >> >> >> James >> >> >> >> >> >>> If your framework relied on parsing the info (either de-serializing the >>> Protocol Buffer or just looking for an "IP-like" string) this change >>> will be a breaking change. >>> >>> Just to confirm (see also Vinod's comments below) any rolling upgrades >>> (i.e., clusters with 0.22+0.23 and 0.23+0.24) of Mesos will just work. >>> >>> This was in conjunction with the HTTP API release and removing the need >>> for non-C++ developers to have to link with libmesos and have to deal >>> with Protocol Buffers. >>> >>> An example of how to access the new format in Python can be found in [0] >>> and we're happy to help with other languages too. >>> Any questions, please just ask. >>> >>> [0] http://github.com/massenz/zk-mesos >>> >>> Marco Massenzio >>> /Distributed Systems Engineer >>> http://codetrips.com/ >>> >>> -- Forwarded message -- >>> From: *Vinod Kone* <vinodk...@gmail.com <mailto:vinodk...@gmail.com>> >>> Date: Wed, Jun 24, 2015 at 4:17 PM >>> Subject: Re: [Breaking Change 0.24 & Upgrade path] ZooKeeper MasterInfo >>> change. >>> To: dev <d...@mesos.apache.org <mailto:d...@mesos.apache.org>> >>> >>> >>> Just to clarify, any frameworks that are using the Mesos provided >>> bindings >>> (aka libmesos.so) should not worry, as long as the version of the >>> bindings >>> and version of the mesos master are not separated by more than 1 version. >>> In other words, you should be able to live upgrade a cluster from 0.23.0 >>> to >>> 0.24.0. >>> >>> For framework schedulers that don't use the bindings (pesos, jesos etc), >>> it >>> is prudent to add support for JSON formatted ZNODE to their master >>> detection code. >>> >>> Thanks, >>> >>> On Wed, Jun 24, 2015 at 4:10 PM, Marco Massenzio <ma...@mesosphere.io >>> <mailto:ma...@mesosphere.io>> >>> wrote: >>> >>> Folks, >>>> >>>> as heads-up, we are planning to convert the format of the MasterInfo >>>> information stored in ZooKeeper from the Protocol Buffer binary format >>>> to >>>> JSON - this is in conjunction with the HTTP API development, to allow >>>> frameworks *not* to depend on libmesos and other binary dependencies to >>>> interact with Mesos Master nodes. >>>> >>>> > *NOTE* - there is no change in 0.23 (so any Master/Slave/Framework >>> that is >>> > currently working in 0.22 *will continue to work* in 0.23 too) but as >>> of >>> >>>> Mesos 0.24, frameworks and other clients relying on the binary format >>>> will >>>> break. >>>> >>>> The details of the design are in this Google Doc: >>>> >>>> >>>> https://docs.google.com/document/d/1i2pWJaIjnFYhuR-000NG-AC1rFKKrRh3Wn47Y2G6lRE/edit >>>> >>>> the actual work is detailed in MESOS-2340: >>>> https://issues.apache.org/jira/browse/MESOS-2340 >>>> >>>> and the patch (and associated test) are here: >>>> https://reviews.apache.org/r/35571/ >>>> https://reviews.apache.org/r/35815/ >>>> >>>> > *Marco Massenzio* >>> > *Distributed Systems Engineer* >>> > >>> >>> >> >
Re: Official RPMs
Yes, the plan is definitely to make the tooling available to the project: there is nothing "secret" about it - at the moment, unfortunately, it relies on a bit of internal infrastructure and, well, yesss, it's a bit too crafty to be ready for "external consumption" but we're working on it! *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Fri, Sep 25, 2015 at 11:33 AM, Zameer Manji <zma...@apache.org> wrote: > Could mesosphere donate their tooling for packaging mesos to the project? > This way any project member or contributor can build packages and it can be > apart of the release process. > > On Fri, Sep 25, 2015 at 10:53 AM, Artem Harutyunyan <ar...@mesosphere.io> > wrote: > >> The repositories have been updated yesterday, and the downloads page >> was updated today. Mesos 0.24 packages are now available at >> https://mesosphere.com/downloads/. Thank you very much for your >> patience! >> >> Cheers, >> Artem. >> >> On Tue, Sep 22, 2015 at 11:02 AM, Marco Massenzio <ma...@mesosphere.io> >> wrote: >> > Hi guys, >> > >> > just wanted to let you all know that we (Mesosphere) fully intend to >> > continue supporting distributing binary packages for the current set of >> > supported OSes (namely, Ubuntu / Debian / RedHat / CentOS as listed in >> [0]). >> > >> > Sorry that 0.24 slipped through the cracks, the person who actually >> takes >> > care of that and knows the magic incantations has been unwell, and a >> number >> > of other competing priorities got in the way - we will eventually be >> > automating the process, so that downloadable binary packages are >> created out >> > of each release/RC build (and, possibly, even more often) without pesky >> > humans getting in the way :) but this may take some time. >> > We're building the 0.24 ones as we speak, so please bear with us while >> this >> > gets done. >> > >> > Any questions / suggestions, we'd love to hear those too! >> > >> > [0] https://mesosphere.com/downloads/ >> > >> > Marco Massenzio >> > Distributed Systems Engineer >> > http://codetrips.com >> > >> > On Tue, Sep 22, 2015 at 10:54 AM, CCAAT <cc...@tampabay.rr.com> wrote: >> >> >> >> On 09/21/2015 03:01 PM, Vinod Kone wrote: >> >>> >> >>> +Jake Farrell >> >>> >> >>> The mesos project doesn't publish platform dependent artifacts. We >> >>> currently only publish platform independent artifacts like JAR (to >> >>> apache maven) and interface EGG (to PyPI). >> >>> >> >>> Recently we made the decision >> >>> <http://www.mail-archive.com/dev%40mesos.apache.org/msg33148.html> >> for >> >>> the project to not officially support different language (java, >> python) >> >>> framework libraries going forward (likely after 1.0). The project will >> >>> only support C++ libraries which will live in the repo and link to >> other >> >>> language libraries from our website. >> >>> >> >>> The main reason was that the PMC lacks the expertise to support >> various >> >>> language bindings and hence we wanted to remove the support burden. >> >>> >> >>> Option #1) It looks like we could do a similar thing with RPMs/DEBs, >> >>> i.e., link to 3rd party artifacts from the project website. Similar to >> >>> the client library authors, we could hold package maintainers >> >>> accountable by providing guidelines. >> >>> >> >>> Option #2) Since the project officially supports certain platforms >> >>> (Ubuntu, CentOS, OSX) and continuously tests this via CI, we could've >> >>> the CI continuously build and upload the packages. Not sure what's ASF >> >>> stance on this is. I filed a ticket >> >>> <https://issues.apache.org/jira/browse/INFRA-10385> a while ago with >> >>> INFRA regarding something similar, but never received any response. >> >>> >> >>> Personally, with the direction the project is headed towards, I prefer >> >>> #1. >> >> >> >> >> >> +1 (Option #1) >> >> >> >> This 'Option #1' approach will require the core dev team to clearly >> convey >> >> what is needed for any OS supported
Fwd: [Breaking Change 0.24 & Upgrade path] ZooKeeper MasterInfo change.
Folks: as a reminder, please be aware that as of Mesos 0.24.0, as announced back in June, Mesos Master will write its information (`MasterInfo`) to ZooKeeper in JSON format (see below for details). If your framework relied on parsing the info (either de-serializing the Protocol Buffer or just looking for an "IP-like" string) this change will be a breaking change. Just to confirm (see also Vinod's comments below) any rolling upgrades (i.e., clusters with 0.22+0.23 and 0.23+0.24) of Mesos will just work. This was in conjunction with the HTTP API release and removing the need for non-C++ developers to have to link with libmesos and have to deal with Protocol Buffers. An example of how to access the new format in Python can be found in [0] and we're happy to help with other languages too. Any questions, please just ask. [0] http://github.com/massenz/zk-mesos Marco Massenzio *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* -- Forwarded message -- From: Vinod Kone <vinodk...@gmail.com> Date: Wed, Jun 24, 2015 at 4:17 PM Subject: Re: [Breaking Change 0.24 & Upgrade path] ZooKeeper MasterInfo change. To: dev <d...@mesos.apache.org> Just to clarify, any frameworks that are using the Mesos provided bindings (aka libmesos.so) should not worry, as long as the version of the bindings and version of the mesos master are not separated by more than 1 version. In other words, you should be able to live upgrade a cluster from 0.23.0 to 0.24.0. For framework schedulers that don't use the bindings (pesos, jesos etc), it is prudent to add support for JSON formatted ZNODE to their master detection code. Thanks, On Wed, Jun 24, 2015 at 4:10 PM, Marco Massenzio <ma...@mesosphere.io> wrote: > Folks, > > as heads-up, we are planning to convert the format of the MasterInfo > information stored in ZooKeeper from the Protocol Buffer binary format to > JSON - this is in conjunction with the HTTP API development, to allow > frameworks *not* to depend on libmesos and other binary dependencies to > interact with Mesos Master nodes. > > *NOTE* - there is no change in 0.23 (so any Master/Slave/Framework that is > currently working in 0.22 *will continue to work* in 0.23 too) but as of > Mesos 0.24, frameworks and other clients relying on the binary format will > break. > > The details of the design are in this Google Doc: > > https://docs.google.com/document/d/1i2pWJaIjnFYhuR-000NG-AC1rFKKrRh3Wn47Y2G6lRE/edit > > the actual work is detailed in MESOS-2340: > https://issues.apache.org/jira/browse/MESOS-2340 > > and the patch (and associated test) are here: > https://reviews.apache.org/r/35571/ > https://reviews.apache.org/r/35815/ > > *Marco Massenzio* > *Distributed Systems Engineer* >
Re: Official RPMs
Hi guys, just wanted to let you all know that we (Mesosphere) fully intend to continue supporting distributing binary packages for the current set of supported OSes (namely, Ubuntu / Debian / RedHat / CentOS as listed in [0]). Sorry that 0.24 slipped through the cracks, the person who actually takes care of that and knows the magic incantations has been unwell, and a number of other competing priorities got in the way - we will eventually be automating the process, so that downloadable binary packages are created out of each release/RC build (and, possibly, even more often) without pesky humans getting in the way :) but this may take some time. We're building the 0.24 ones as we speak, so please bear with us while this gets done. Any questions / suggestions, we'd love to hear those too! [0] https://mesosphere.com/downloads/ *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Tue, Sep 22, 2015 at 10:54 AM, CCAAT <cc...@tampabay.rr.com> wrote: > On 09/21/2015 03:01 PM, Vinod Kone wrote: > >> +Jake Farrell >> >> The mesos project doesn't publish platform dependent artifacts. We >> currently only publish platform independent artifacts like JAR (to >> apache maven) and interface EGG (to PyPI). >> >> Recently we made the decision >> <http://www.mail-archive.com/dev%40mesos.apache.org/msg33148.html> for >> the project to not officially support different language (java, python) >> framework libraries going forward (likely after 1.0). The project will >> only support C++ libraries which will live in the repo and link to other >> language libraries from our website. >> >> The main reason was that the PMC lacks the expertise to support various >> language bindings and hence we wanted to remove the support burden. >> >> Option #1) It looks like we could do a similar thing with RPMs/DEBs, >> i.e., link to 3rd party artifacts from the project website. Similar to >> the client library authors, we could hold package maintainers >> accountable by providing guidelines. >> >> Option #2) Since the project officially supports certain platforms >> (Ubuntu, CentOS, OSX) and continuously tests this via CI, we could've >> the CI continuously build and upload the packages. Not sure what's ASF >> stance on this is. I filed a ticket >> <https://issues.apache.org/jira/browse/INFRA-10385> a while ago with >> INFRA regarding something similar, but never received any response. >> >> Personally, with the direction the project is headed towards, I prefer #1. >> > > +1 (Option #1) > > This 'Option #1' approach will require the core dev team to clearly convey > what is needed for any OS supported, not the chosen OSes for support. Right > now, I'm having to parse many documents to figure out how to extend the > gentoo ebuild for mesos. And where to cut off what I do in the ebuilds and > what to put into the configuration documents for gentoo. Naturally the > minimial is only what should be in the the gentoo ebuild; with other items, > such as HDFS as a compiler option. Once I get the btrfs/ceph work > stabilized, there will be a compile time option for btrfs/ceph with the > gentoo ebuild. Other distros that are not going that > way should have other Distributed File System options 'baked into' their > installation on that OS. > > > > 'Option #1' sets the stage for many OSes to be supported and the core dev > team only has to support a single document to clarify what any distro > needs to robustly support mesos for their user community. This will > facilitate a wider variety of experimentation, at the companion repos too. > This Option #1 approach will further accelerate adoption of Mesos on a > very wide variety of platforms and architectures, imho. It sets the stage > for valid benchmark performance comparison between distros; something that > the gentoo community will no doubt win > > ;-) > > James > > > > > >> On Sat, Sep 19, 2015 at 3:39 AM, Carlos Sanchez <car...@apache.org >> <mailto:car...@apache.org>> wrote: >> >> I'm using the same repo with some changes to build SSL enabled >> packages >> >> >> https://github.com/carlossg/mesos-deb-packaging/compare/master...carlossg:ssl >> >> >> On Sat, Sep 19, 2015 at 4:22 AM, Rad Gruchalski >> <ra...@gruchalski.com <mailto:ra...@gruchalski.com>> wrote: >> > Should be rather easy to package it with this little tool from >> Mesosphere: >> > https://github.com/mesosphere/mesos-deb-packaging. I’ve done it >> myself for >> > ubuntu 12.04 and 14.04. >> >
Re: Help interpreting output from running java test-framework example
Thanks, Stephen - feedback much appreciated! *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Thu, Sep 17, 2015 at 5:03 PM, Stephen Boesch <java...@gmail.com> wrote: > Compared to Yarn Mesos is just faster. Mesos has a smaller startup time > and the delay between tasks is smaller. The run times for terasort 100GB > tended towards 110sec median on Mesos vs about double that on Yarn. > > Unfortunately we require mature Multi-Tenancy/Isolation/Queues support > -which is still initial stages of WIP for Mesos. So we will need to use > YARN for the near and likely medium term. > > > > 2015-09-17 15:52 GMT-07:00 Marco Massenzio <ma...@mesosphere.io>: > >> Hey Stephen, >> >> The spark on mesos is twice as fast as yarn on our 20 node cluster. In >>> addition Mesos is handling datasizes that yarn simply dies on it. But >>> mesos is still just taking linearly increased time compared to smaller >>> datasizes. >> >> >> Obviously delighted to hear that, BUT me not much like "but" :) >> I've added Tim who is one of the main contributors to our Mesos/Spark >> bindings, and it would be great to hear your use case/experience and find >> out whether we can improve on that front too! >> >> As the case may be, we could also jump on a hangout if it makes >> conversation easier/faster. >> >> Cheers, >> >> *Marco Massenzio* >> >> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* >> >> On Wed, Sep 9, 2015 at 1:33 PM, Stephen Boesch <java...@gmail.com> wrote: >> >>> Thanks Vinod. I went back to see the logs and nothing interesting . >>> However int he process I found that my spark port was pointing to 7077 >>> instead of 5050. After re-running .. spark on mesos worked! >>> >>> The spark on mesos is twice as fast as yarn on our 20 node cluster. In >>> addition Mesos is handling datasizes that yarn simply dies on it. But >>> mesos is still just taking linearly increased time compared to smaller >>> datasizes. >>> >>> We have significant additional work to incorporate mesos into operations >>> and support but given the strong perforrmance and stability characterstics >>> we are initially seeing here that effort is likely to get underway. >>> >>> >>> >>> 2015-09-09 12:54 GMT-07:00 Vinod Kone <vinodk...@gmail.com>: >>> >>>> sounds like it. can you see what the slave/agent and executor logs say? >>>> >>>> On Tue, Sep 8, 2015 at 11:46 AM, Stephen Boesch <java...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> I am in the process of learning how to run a mesos cluster with the >>>>> intent for it to be the resource manager for Spark. As a small step in >>>>> that direction a basic test of mesos was performed, as suggested by the >>>>> Mesos Getting Started page. >>>>> >>>>> In the following output we see tasks launched and resources offered on >>>>> a 20 node cluster: >>>>> >>>>> [stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework >>>>> $(hostname -s):5050 >>>>> I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0 >>>>> I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at >>>>> master@10.64.204.124:5050 >>>>> I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided. >>>>> Attempting to register without authentication >>>>> I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with >>>>> 20150908-182014-2093760522-5050-15313- >>>>> Registered! ID = 20150908-182014-2093760522-5050-15313- >>>>> Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus: >>>>> 16.0 and mem: 119855.0 >>>>> Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0 >>>>> Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0 >>>>> Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0 >>>>> Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0 >>>>> Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0 >>>>> Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus: >>>>> 16.0 and mem: 119855.0 >>>>> Received offer 20150908-182014-2093760522-5050-15313-O2 with
Re: Help interpreting output from running java test-framework example
Hey Stephen, The spark on mesos is twice as fast as yarn on our 20 node cluster. In > addition Mesos is handling datasizes that yarn simply dies on it. But > mesos is still just taking linearly increased time compared to smaller > datasizes. Obviously delighted to hear that, BUT me not much like "but" :) I've added Tim who is one of the main contributors to our Mesos/Spark bindings, and it would be great to hear your use case/experience and find out whether we can improve on that front too! As the case may be, we could also jump on a hangout if it makes conversation easier/faster. Cheers, *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Wed, Sep 9, 2015 at 1:33 PM, Stephen Boesch <java...@gmail.com> wrote: > Thanks Vinod. I went back to see the logs and nothing interesting . > However int he process I found that my spark port was pointing to 7077 > instead of 5050. After re-running .. spark on mesos worked! > > The spark on mesos is twice as fast as yarn on our 20 node cluster. In > addition Mesos is handling datasizes that yarn simply dies on it. But > mesos is still just taking linearly increased time compared to smaller > datasizes. > > We have significant additional work to incorporate mesos into operations > and support but given the strong perforrmance and stability characterstics > we are initially seeing here that effort is likely to get underway. > > > > 2015-09-09 12:54 GMT-07:00 Vinod Kone <vinodk...@gmail.com>: > >> sounds like it. can you see what the slave/agent and executor logs say? >> >> On Tue, Sep 8, 2015 at 11:46 AM, Stephen Boesch <java...@gmail.com> >> wrote: >> >>> >>> I am in the process of learning how to run a mesos cluster with the >>> intent for it to be the resource manager for Spark. As a small step in >>> that direction a basic test of mesos was performed, as suggested by the >>> Mesos Getting Started page. >>> >>> In the following output we see tasks launched and resources offered on a >>> 20 node cluster: >>> >>> [stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework >>> $(hostname -s):5050 >>> I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0 >>> I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at >>> master@10.64.204.124:5050 >>> I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided. >>> Attempting to register without authentication >>> I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with >>> 20150908-182014-2093760522-5050-15313- >>> Registered! ID = 20150908-182014-2093760522-5050-15313- >>> Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus: 16.0 >>> and mem: 119855.0 >>> Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0 >>> Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0 >>> Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0 >>> Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0 >>> Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0 >>> Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O2 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O3 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O4 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O5 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O6 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O7 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O8 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O9 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O10 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O11 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O12 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-2093760522-5050-15313-O13 with cpus: 16.0 >>> and mem: 119855.0 >>> Received offer 20150908-182014-20937605
Re: Basic installation question
Stephen: Klaus is correct, you are starting the Master in "standalone" mode, not with zookeeper support: it needs adding the --zk=zk://10.xx.xx.124:2181/mesos --quorum=1 options (at the very least). As you correctly noted, the contents of the /mesos znode is empty and thus the agent nodes cannot find elected Master leader (also, if you are running more than one Master, they won't 'know' about each other and won't be able to elect a leader). To check that your settings work, you can (a) look in Master logs (it will log a lot of info when connecting to ZK) and (b) see that under /mesos a number of json.info_nn nodes will appear (whose contents are JSON so you can double check that the contents make sense). You can find more info here[0]. [0] http://codetrips.com/2015/08/16/apache-mesos-leader-master-discovery-using-zookeeper-part-2/ *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Fri, Sep 4, 2015 at 5:33 PM, Stephen Boesch <java...@gmail.com> wrote: > > I installed using yum -y install mesos. That did work. > > Now the master and slaves do not see each other. > > > Here is the master: > $ ps -ef | grep mesos | grep -v grep > stack30236 17902 0 00:09 pts/400:00:04 > /mnt/mesos/build/src/.libs/lt-mesos-master --work_dir=/tmp/mesos > --ip=10.xx.xx.124 > > > Here is one of the 20 slaves: > > ps -ef | grep mesos | grep -v grep > root 26086 1 0 00:10 ?00:00:00 /usr/sbin/mesos-slave > --master=zk://10.xx.xx.124:2181/mesos --log_dir=/var/log/mesos > root 26092 26086 0 00:10 ?00:00:00 logger -p user.info -t > mesos-slave[26086] > root 26093 26086 0 00:10 ?00:00:00 logger -p user.err -t > mesos-slave[26086] > > > Note the slave and master are on correct same ip address > > The /etc/mesos/zk seems to be set properly : and I do see the /mesos node > in zookeeper is updated after restarting the master > > However the zookeeper node is empty: > > [zk: localhost:2181(CONNECTED) 10] ls /mesos > [] > > The node is world accessible so no permission issue: > > [zk: localhost:2181(CONNECTED) 12] getAcl /mesos > 'world,'anyone > : cdrwa > > Why is the zookeeper node empty? Is this the reason the master and > slaves are not connecting? > > 2015-09-04 14:56 GMT-07:00 craig w <codecr...@gmail.com>: > >> No problem, they have a "downloads" link inn their menu: >> https://mesosphere.com/downloads/ >> On Sep 4, 2015 5:43 PM, "Stephen Boesch" <java...@gmail.com> wrote: >> >>> @Craig . That is an incomplete answer - given that such links are not >>> presented in an obvious manner . Maybe you managed to find a link on >>> their site that provides prebuilt for Centos7: if so then please share it. >>> >>> >>> I had previously found a link on their site for prebuilt binaries but is >>> based on using CDH4 (which is not possible for my company). It is also old. >>> >>> https://docs.mesosphere.com/tutorials/install_centos_rhel/ >>> >>> >>> 2015-09-04 14:27 GMT-07:00 craig w <codecr...@gmail.com>: >>> >>>> Mesosphere has packages prebuilt, go to their site to find how to >>>> install >>>> On Sep 4, 2015 5:11 PM, "Stephen Boesch" <java...@gmail.com> wrote: >>>> >>>>> >>>>> After following the directions here: >>>>> http://mesos.apache.org/gettingstarted/ >>>>> >>>>> Which for centos7 includes the following: >>>>> >>>>> >>>>> >>>>> >>>>> # Change working directory. >>>>> $ cd mesos >>>>> >>>>> # Bootstrap (Only required if building from git repository). >>>>> $ ./bootstrap >>>>> >>>>> # Configure and build. >>>>> $ mkdir build >>>>> $ cd build >>>>> $ ../configure >>>>> $ make >>>>> >>>>> In order to speed up the build and reduce verbosity of the logs, you >>>>> can append-j V=0 to make. >>>>> >>>>> # Run test suite. >>>>> $ make check >>>>> >>>>> # Install (Optional). >>>>> $ make install >>>>> >>>>> >>>>> >>>>> But the installation is not correct afterwards: here is the bin >>>>> directory: >>>>> >>>>> $ ll bin >>>>> total 92 >>&g
Re: Basic installation question
Thanks for follow-up, Stephen - this will be also useful to others finding this in the archives! Glad it eventually worked for you, I'll drop a line to our guys to update the download page with this information, so it should hopefully be less painful in the future for others. *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Sat, Sep 5, 2015 at 3:00 PM, Stephen Boesch <java...@gmail.com> wrote: > Yes I had started the slaves as > > service mesos-slave start > > But had not done the correct way on the master, which is supposed to be: > > service mesos-master start > > The slaves do appear after having made that correction: thanks. > > > 2015-09-05 14:55 GMT-07:00 Marco Massenzio <ma...@mesosphere.io>: > >> Stephen: >> >> Klaus is correct, you are starting the Master in "standalone" mode, not >> with zookeeper support: it needs adding the --zk=zk://10.xx.xx.124:2181/mesos >> --quorum=1 options (at the very least). >> >> As you correctly noted, the contents of the /mesos znode is empty and >> thus the agent nodes cannot find elected Master leader (also, if you are >> running more than one Master, they won't 'know' about each other and won't >> be able to elect a leader). >> >> To check that your settings work, you can (a) look in Master logs (it >> will log a lot of info when connecting to ZK) and (b) see that under /mesos >> a number of json.info_nn nodes will appear (whose contents are JSON so >> you can double check that the contents make sense). >> >> You can find more info here[0]. >> >> [0] >> http://codetrips.com/2015/08/16/apache-mesos-leader-master-discovery-using-zookeeper-part-2/ >> >> *Marco Massenzio* >> >> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* >> >> On Fri, Sep 4, 2015 at 5:33 PM, Stephen Boesch <java...@gmail.com> wrote: >> >>> >>> I installed using yum -y install mesos. That did work. >>> >>> Now the master and slaves do not see each other. >>> >>> >>> Here is the master: >>> $ ps -ef | grep mesos | grep -v grep >>> stack30236 17902 0 00:09 pts/400:00:04 >>> /mnt/mesos/build/src/.libs/lt-mesos-master --work_dir=/tmp/mesos >>> --ip=10.xx.xx.124 >>> >>> >>> Here is one of the 20 slaves: >>> >>> ps -ef | grep mesos | grep -v grep >>> root 26086 1 0 00:10 ?00:00:00 /usr/sbin/mesos-slave >>> --master=zk://10.xx.xx.124:2181/mesos --log_dir=/var/log/mesos >>> root 26092 26086 0 00:10 ?00:00:00 logger -p user.info -t >>> mesos-slave[26086] >>> root 26093 26086 0 00:10 ?00:00:00 logger -p user.err -t >>> mesos-slave[26086] >>> >>> >>> Note the slave and master are on correct same ip address >>> >>> The /etc/mesos/zk seems to be set properly : and I do see the /mesos >>> node in zookeeper is updated after restarting the master >>> >>> However the zookeeper node is empty: >>> >>> [zk: localhost:2181(CONNECTED) 10] ls /mesos >>> [] >>> >>> The node is world accessible so no permission issue: >>> >>> [zk: localhost:2181(CONNECTED) 12] getAcl /mesos >>> 'world,'anyone >>> : cdrwa >>> >>> Why is the zookeeper node empty? Is this the reason the master and >>> slaves are not connecting? >>> >>> 2015-09-04 14:56 GMT-07:00 craig w <codecr...@gmail.com>: >>> >>>> No problem, they have a "downloads" link inn their menu: >>>> https://mesosphere.com/downloads/ >>>> On Sep 4, 2015 5:43 PM, "Stephen Boesch" <java...@gmail.com> wrote: >>>> >>>>> @Craig . That is an incomplete answer - given that such links are not >>>>> presented in an obvious manner . Maybe you managed to find a link on >>>>> their site that provides prebuilt for Centos7: if so then please share it. >>>>> >>>>> >>>>> I had previously found a link on their site for prebuilt binaries but >>>>> is based on using CDH4 (which is not possible for my company). It is also >>>>> old. >>>>> >>>>> https://docs.mesosphere.com/tutorials/install_centos_rhel/ >>>>> >>>>> >>>>> 2015-09-04 14:27 GMT-07:00 craig w <codecr...@gmail.com>: >>>>> >>>>>> Mesosphere has pack
Re: Basic installation question
I think you are looking into the wrong bin/ folder (the one under top-level mesos/) - the actual binaries are in ${MESOS_HOME}/bin/build I am positive that the instructions work on CentOS 7.1 as I had to run all those recently on a VM of mine. BTW - If you are looking for the libmesos and various includes, they will be under /usr/local (you can change that by using something like: ../configure --prefix /path/to/install/dir *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Fri, Sep 4, 2015 at 2:10 PM, Stephen Boesch <java...@gmail.com> wrote: > > After following the directions here: > http://mesos.apache.org/gettingstarted/ > > Which for centos7 includes the following: > > > > > # Change working directory. > $ cd mesos > > # Bootstrap (Only required if building from git repository). > $ ./bootstrap > > # Configure and build. > $ mkdir build > $ cd build > $ ../configure > $ make > > In order to speed up the build and reduce verbosity of the logs, you can > append-j V=0 to make. > > # Run test suite. > $ make check > > # Install (Optional). > $ make install > > > > But the installation is not correct afterwards: here is the bin directory: > > $ ll bin > total 92 > -rw-r--r--. 1 stack stack 1769 Jul 17 23:14 valgrind-mesos-tests.sh.in > -rw-r--r--. 1 stack stack 1769 Jul 17 23:14 valgrind-mesos-slave.sh.in > -rw-r--r--. 1 stack stack 1772 Jul 17 23:14 valgrind-mesos-master.sh.in > -rw-r--r--. 1 stack stack 1769 Jul 17 23:14 valgrind-mesos-local.sh.in > -rw-r--r--. 1 stack stack 1026 Jul 17 23:14 mesos-tests.sh.in > -rw-r--r--. 1 stack stack 901 Jul 17 23:14 mesos-tests-flags.sh.in > -rw-r--r--. 1 stack stack 1019 Jul 17 23:14 mesos-slave.sh.in > -rw-r--r--. 1 stack stack 1721 Jul 17 23:14 mesos-slave-flags.sh.in > -rw-r--r--. 1 stack stack 1366 Jul 17 23:14 mesos.sh.in > -rw-r--r--. 1 stack stack 1026 Jul 17 23:14 mesos-master.sh.in > -rw-r--r--. 1 stack stack 858 Jul 17 23:14 mesos-master-flags.sh.in > -rw-r--r--. 1 stack stack 1023 Jul 17 23:14 mesos-local.sh.in > -rw-r--r--. 1 stack stack 935 Jul 17 23:14 mesos-local-flags.sh.in > -rw-r--r--. 1 stack stack 1466 Jul 17 23:14 lldb-mesos-tests.sh.in > -rw-r--r--. 1 stack stack 1489 Jul 17 23:14 lldb-mesos-slave.sh.in > -rw-r--r--. 1 stack stack 1492 Jul 17 23:14 lldb-mesos-master.sh.in > -rw-r--r--. 1 stack stack 1489 Jul 17 23:14 lldb-mesos-local.sh.in > -rw-r--r--. 1 stack stack 1498 Jul 17 23:14 gdb-mesos-tests.sh.in > -rw-r--r--. 1 stack stack 1527 Jul 17 23:14 gdb-mesos-slave.sh.in > -rw-r--r--. 1 stack stack 1530 Jul 17 23:14 gdb-mesos-master.sh.in > -rw-r--r--. 1 stack stack 1521 Jul 17 23:14 gdb-mesos-local.sh.in > drwxr-xr-x. 2 stack stack 4096 Jul 17 23:21 . > drwxr-xr-x. 11 stack stack 4096 Sep 4 20:08 .. > > So .. two things: > > (a) what is missing from the installation instructions? > > (b) Is there an *up to date *rpm/yum installation for centos7? > > > > > > >
Re: Basic installation question
argh - sorry! ${MESOS_HOME}/build/bin (I'd mixed the two around) *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Fri, Sep 4, 2015 at 2:39 PM, Marco Massenzio <ma...@mesosphere.io> wrote: > I think you are looking into the wrong bin/ folder (the one under > top-level mesos/) - the actual binaries are in ${MESOS_HOME}/bin/build > > I am positive that the instructions work on CentOS 7.1 as I had to run all > those recently on a VM of mine. > > BTW - If you are looking for the libmesos and various includes, they will > be under /usr/local (you can change that by using something like: > > ../configure --prefix /path/to/install/dir > > > > *Marco Massenzio* > > *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* > > On Fri, Sep 4, 2015 at 2:10 PM, Stephen Boesch <java...@gmail.com> wrote: > >> >> After following the directions here: >> http://mesos.apache.org/gettingstarted/ >> >> Which for centos7 includes the following: >> >> >> >> >> # Change working directory. >> $ cd mesos >> >> # Bootstrap (Only required if building from git repository). >> $ ./bootstrap >> >> # Configure and build. >> $ mkdir build >> $ cd build >> $ ../configure >> $ make >> >> In order to speed up the build and reduce verbosity of the logs, you can >> append-j V=0 to make. >> >> # Run test suite. >> $ make check >> >> # Install (Optional). >> $ make install >> >> >> >> But the installation is not correct afterwards: here is the bin directory: >> >> $ ll bin >> total 92 >> -rw-r--r--. 1 stack stack 1769 Jul 17 23:14 valgrind-mesos-tests.sh.in >> -rw-r--r--. 1 stack stack 1769 Jul 17 23:14 valgrind-mesos-slave.sh.in >> -rw-r--r--. 1 stack stack 1772 Jul 17 23:14 valgrind-mesos-master.sh.in >> -rw-r--r--. 1 stack stack 1769 Jul 17 23:14 valgrind-mesos-local.sh.in >> -rw-r--r--. 1 stack stack 1026 Jul 17 23:14 mesos-tests.sh.in >> -rw-r--r--. 1 stack stack 901 Jul 17 23:14 mesos-tests-flags.sh.in >> -rw-r--r--. 1 stack stack 1019 Jul 17 23:14 mesos-slave.sh.in >> -rw-r--r--. 1 stack stack 1721 Jul 17 23:14 mesos-slave-flags.sh.in >> -rw-r--r--. 1 stack stack 1366 Jul 17 23:14 mesos.sh.in >> -rw-r--r--. 1 stack stack 1026 Jul 17 23:14 mesos-master.sh.in >> -rw-r--r--. 1 stack stack 858 Jul 17 23:14 mesos-master-flags.sh.in >> -rw-r--r--. 1 stack stack 1023 Jul 17 23:14 mesos-local.sh.in >> -rw-r--r--. 1 stack stack 935 Jul 17 23:14 mesos-local-flags.sh.in >> -rw-r--r--. 1 stack stack 1466 Jul 17 23:14 lldb-mesos-tests.sh.in >> -rw-r--r--. 1 stack stack 1489 Jul 17 23:14 lldb-mesos-slave.sh.in >> -rw-r--r--. 1 stack stack 1492 Jul 17 23:14 lldb-mesos-master.sh.in >> -rw-r--r--. 1 stack stack 1489 Jul 17 23:14 lldb-mesos-local.sh.in >> -rw-r--r--. 1 stack stack 1498 Jul 17 23:14 gdb-mesos-tests.sh.in >> -rw-r--r--. 1 stack stack 1527 Jul 17 23:14 gdb-mesos-slave.sh.in >> -rw-r--r--. 1 stack stack 1530 Jul 17 23:14 gdb-mesos-master.sh.in >> -rw-r--r--. 1 stack stack 1521 Jul 17 23:14 gdb-mesos-local.sh.in >> drwxr-xr-x. 2 stack stack 4096 Jul 17 23:21 . >> drwxr-xr-x. 11 stack stack 4096 Sep 4 20:08 .. >> >> So .. two things: >> >> (a) what is missing from the installation instructions? >> >> (b) Is there an *up to date *rpm/yum installation for centos7? >> >> >> >> >> >> >> >
Re: Basic installation question
Hey Stephen, the Mesos packages for download from Mesosphere are available here: https://mesosphere.com/downloads/ (for Mesos, just click on the Getting Started button - sorry, no direct URL - it will show the steps to install on the supported distros using apt-get/yum). Those work and I obviously recommend them :) But I think you wanted the "full developer experience" as you pointed to the make steps. Also, if you haven't looked at the tutorials in a while (as you seem to imply in your message) I would recommend you give them another shot: we've been doing some work on revamping them and making them more accessible. *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Fri, Sep 4, 2015 at 2:38 PM, Stephen Boesch <java...@gmail.com> wrote: > @Craig . That is an incomplete answer - given that such links are not > presented in an obvious manner . Maybe you managed to find a link on > their site that provides prebuilt for Centos7: if so then please share it. > > > I had previously found a link on their site for prebuilt binaries but is > based on using CDH4 (which is not possible for my company). It is also old. > > https://docs.mesosphere.com/tutorials/install_centos_rhel/ > > > 2015-09-04 14:27 GMT-07:00 craig w <codecr...@gmail.com>: > >> Mesosphere has packages prebuilt, go to their site to find how to install >> On Sep 4, 2015 5:11 PM, "Stephen Boesch" <java...@gmail.com> wrote: >> >>> >>> After following the directions here: >>> http://mesos.apache.org/gettingstarted/ >>> >>> Which for centos7 includes the following: >>> >>> >>> >>> >>> # Change working directory. >>> $ cd mesos >>> >>> # Bootstrap (Only required if building from git repository). >>> $ ./bootstrap >>> >>> # Configure and build. >>> $ mkdir build >>> $ cd build >>> $ ../configure >>> $ make >>> >>> In order to speed up the build and reduce verbosity of the logs, you can >>> append-j V=0 to make. >>> >>> # Run test suite. >>> $ make check >>> >>> # Install (Optional). >>> $ make install >>> >>> >>> >>> But the installation is not correct afterwards: here is the bin >>> directory: >>> >>> $ ll bin >>> total 92 >>> -rw-r--r--. 1 stack stack 1769 Jul 17 23:14 valgrind-mesos-tests.sh.in >>> -rw-r--r--. 1 stack stack 1769 Jul 17 23:14 valgrind-mesos-slave.sh.in >>> -rw-r--r--. 1 stack stack 1772 Jul 17 23:14 valgrind-mesos-master.sh.in >>> -rw-r--r--. 1 stack stack 1769 Jul 17 23:14 valgrind-mesos-local.sh.in >>> -rw-r--r--. 1 stack stack 1026 Jul 17 23:14 mesos-tests.sh.in >>> -rw-r--r--. 1 stack stack 901 Jul 17 23:14 mesos-tests-flags.sh.in >>> -rw-r--r--. 1 stack stack 1019 Jul 17 23:14 mesos-slave.sh.in >>> -rw-r--r--. 1 stack stack 1721 Jul 17 23:14 mesos-slave-flags.sh.in >>> -rw-r--r--. 1 stack stack 1366 Jul 17 23:14 mesos.sh.in >>> -rw-r--r--. 1 stack stack 1026 Jul 17 23:14 mesos-master.sh.in >>> -rw-r--r--. 1 stack stack 858 Jul 17 23:14 mesos-master-flags.sh.in >>> -rw-r--r--. 1 stack stack 1023 Jul 17 23:14 mesos-local.sh.in >>> -rw-r--r--. 1 stack stack 935 Jul 17 23:14 mesos-local-flags.sh.in >>> -rw-r--r--. 1 stack stack 1466 Jul 17 23:14 lldb-mesos-tests.sh.in >>> -rw-r--r--. 1 stack stack 1489 Jul 17 23:14 lldb-mesos-slave.sh.in >>> -rw-r--r--. 1 stack stack 1492 Jul 17 23:14 lldb-mesos-master.sh.in >>> -rw-r--r--. 1 stack stack 1489 Jul 17 23:14 lldb-mesos-local.sh.in >>> -rw-r--r--. 1 stack stack 1498 Jul 17 23:14 gdb-mesos-tests.sh.in >>> -rw-r--r--. 1 stack stack 1527 Jul 17 23:14 gdb-mesos-slave.sh.in >>> -rw-r--r--. 1 stack stack 1530 Jul 17 23:14 gdb-mesos-master.sh.in >>> -rw-r--r--. 1 stack stack 1521 Jul 17 23:14 gdb-mesos-local.sh.in >>> drwxr-xr-x. 2 stack stack 4096 Jul 17 23:21 . >>> drwxr-xr-x. 11 stack stack 4096 Sep 4 20:08 .. >>> >>> So .. two things: >>> >>> (a) what is missing from the installation instructions? >>> >>> (b) Is there an *up to date *rpm/yum installation for centos7? >>> >>> >>> >>> >>> >>> >>> >
Re: [VOTE] Release Apache Mesos 0.24.0 (rc2)
+1 (non-binding) All tests (including ROOT) pass on Ubuntu 14.04 All tests pass on CentOS 7.1; ROOT tests cause 1 failure: [ FAILED ] 1 test, listed below: [ FAILED ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseRSS $ cat /etc/centos-release CentOS Linux release 7.1.1503 (Core) This seems to be new[0], but possibly related to some limitation/setting of my test machine (VirtualBox VM, running 2 CPUs on Ubuntu host). Interestingly enough, I don't see the 4 failures as Vaibhav, but in my log it shows *YOU HAVE 11 DISABLED TESTS* (he has 12). [0] https://issues.apache.org/jira/issues/?filter=12333150 *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Tue, Sep 1, 2015 at 5:45 PM, Vinod Kone <vinodk...@apache.org> wrote: > Hi all, > > > Please vote on releasing the following candidate as Apache Mesos 0.24.0. > > > 0.24.0 includes the following: > > > > > Experimental support for v1 scheduler HTTP API! > > This release also wraps up support for fetcher. > > The CHANGELOG for the release is available at: > > > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.24.0-rc2 > > > > > > The candidate for Mesos 0.24.0 release is available at: > > https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz > > > The tag to be voted on is 0.24.0-rc2: > > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.24.0-rc2 > > > The MD5 checksum of the tarball can be found at: > > > https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz.md5 > > > The signature of the tarball can be found at: > > > https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz.asc > > > The PGP key used to sign the release is here: > > https://dist.apache.org/repos/dist/release/mesos/KEYS > > > The JAR is up in Maven in a staging repository here: > > https://repository.apache.org/content/repositories/orgapachemesos-1066 > > > Please vote on releasing this package as Apache Mesos 0.24.0! > > > The vote is open until Fri Sep 4 17:33:05 PDT 2015 and passes if a > majority of at least 3 +1 PMC votes are cast. > > > [ ] +1 Release this package as Apache Mesos 0.24.0 > > [ ] -1 Do not release this package because ... > > > Thanks, > > Vinod >
Re: mesos-slave crashing with CHECK_SOME
@Steven - agreed! As mentioned, if we can reduce the "footprint of unnecessary CHECKs" (so to speak) I'm all for it - let's document and add Jiras for that, by all means. @Scott - LoL: you certainly didn't; I was more worried my email would ;-) Thanks, guys! *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Wed, Sep 2, 2015 at 10:59 AM, Steven Schlansker < sschlans...@opentable.com> wrote: > I 100% agree with your philosophy here, and I suspect it's something > shared in the Mesos community. > > I just think that we can restrict the domain of the failure to a smaller > reasonable window -- once you are in the context of "I am doing work to > launch a specific task", there is a well defined "success / failure / here > is an error message" path defined already. Users expect tasks to fail and > can see the errors. > > I think that a lot of these assertions are in fact more appropriate as > task failures. But I agree that they should be fatal to *some* part of the > system, just not the whole agent entirely. > > On Sep 1, 2015, at 4:33 PM, Marco Massenzio <ma...@mesosphere.io> wrote: > > > That's one of those areas for discussions that is so likely to generate > a flame war that I'm hesitant to wade in :) > > > > In general, I would agree with the sentiment expressed there: > > > > > If the task fails, that is unfortunate, but not the end of the world. > Other tasks should not be affected. > > > > which is, in fact, to large extent exactly what Mesos does; the example > given in MESOS-2684, as it happens, is for a "disk full failure" - carrying > on as if nothing had happened, is only likely to lead to further (and > worse) disappointment. > > > > The general philosophy back at Google (and which certainly informs the > design of Borg[0]) was "fail early, fail hard" so that either (a) the > service is restarted and hopefully the root cause cleared or (b) someone > (who can hopefully do something) will be alerted about it. > > > > I think it's ultimately a matter of scale: up to a few tens of servers, > you can assume there is some sort of 'log-monitor' that looks out for > errors and other anomalies and alerts humans that will then take a look and > possibly apply some corrective action - when you're up to hundreds or > thousands (definitely Mesos territory) that's not practical: the system > should either self-heal or crash-and-restart. > > > > All this to say, that it's difficult to come up with a general > *automated* approach to unequivocally decide if a failure is "fatal" or > could just be safely "ignored" (after appropriate error logging) - in > general, when in doubt it's probably safer to "noisily crash & restart" and > rely on the overall system's HA architecture to take care of replication > and consistency. > > (and an intelligent monitoring system that only alerts when some failure > threshold is exceeded). > > > > From what I've seen so far (granted, still a novice here) it seems that > Mesos subscribes to this notion, assuming that Agent Nodes will come and > go, and usually Tasks survive (for a certain amount of time anyway) a Slave > restart (obviously, if the physical h/w is the ultimate cause of failure, > well, then all bets are off). > > > > Having said all that - if there are areas where we have been over-eager > with our CHECKs, we should definitely revisit that and make it more > crash-resistant, absolutely. > > > > [0] http://research.google.com/pubs/pub43438.html > > > > Marco Massenzio > > Distributed Systems Engineer > > http://codetrips.com > > > > On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker < > sschlans...@opentable.com> wrote: > > > > > > On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote: > > > > > > tag=mesos-slave[12858]: F0831 09:37:29.838184 12898 slave.cpp:3354] > CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory > > > > I reported a similar bug a while back: > > > > https://issues.apache.org/jira/browse/MESOS-2684 > > > > This seems to be a class of bugs where some filesystem operations which > may fail for unforeseen reasons are written as assertions which crash the > process, rather than failing only the task and communicating back the error > reason. > > > > > > > >
Re: [VOTE] Release Apache Mesos 0.24.0 (rc1)
Hey guys, just a quick note to bring back the conversation on track to the 0.24-RC1 release. Is my understanding correct that there are currently no binding -1's? @Vinod: what do you think, are we good to release? Thanks! *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Tue, Sep 1, 2015 at 1:49 AM, Dario Rexin <dario.re...@me.com> wrote: > One more question. From the Mesos code it doesn’t look like events are > being split or combined, so given I have a client that gives me access to > the individual chunks, is it safe to assume that each chunk contains > exactly one event? Because that would make parsing the events a lot easier > for me. > > Thanks, > Dario > > On Sep 1, 2015, at 8:42 AM, dario.re...@me.com wrote: > > Hi Vinod, > > thanks for the explanation, I got it now. > > Thanks, > Dario > > On 31.08.2015, at 23:47, Vinod Kone <vinodk...@apache.org> wrote: > > I think you might be confused with the HTTP chunked encoding and RecordIO > encoding. Most HTTP client libraries dechunk the stream before presenting > it to the application. So the application needs to know the encoding of the > dechunked data to be able to process it. > > In Mesos's case, the server (master here) can encode it in JSON or > Protobuf. We wanted to have a consistent way to encode both these formats > and Record-IO format was the one we settled on. Note that this format is > also used by the Twitter streaming API > <https://dev.twitter.com/streaming/overview/processing> (see delimited > messages section). > > HTH, > > On Mon, Aug 31, 2015 at 2:09 PM, Dario Rexin <dario.re...@me.com> wrote: > >> Hi Vino, >> >> On Aug 31, 2015, at 9:36 PM, Vinod Kone <vinodk...@apache.org> wrote: >> >> Hi Dario, >> >> Can you test with "curl --no-buffer" option? Looks like your stdout might >> be line-buffered. >> >> >> that did the trick, thanks! >> >> >> The reason we used record-io formatting is to be consistent in how we >> stream protobuf and json encoded data. >> >> >> How does simple chunked encoding prevent you from doing this? >> >> Thanks, >> Dario >> >> On Fri, Aug 28, 2015 at 2:04 PM, <dario.re...@me.com> wrote: >> >>> Anand, >>> >>> thanks for the explanation. I didn't think about the case when you have >>> to split a message, now it makes sense. >>> >>> But the case I observed with curl is still weird. Even when splitting a >>> message, it should still receive both parts almost at the same time. Do you >>> have any idea why it could behave like this? >>> >>> On 28.08.2015, at 21:31, Anand Mazumdar <an...@mesosphere.io> wrote: >>> >>> Dario, >>> >>> Most HTTP libraries/parsers ( including one that Mesos uses internally ) >>> provide a way to specify a default size of each chunk. If a Mesos Event is >>> too big , it would get split into smaller chunks and vice-versa. >>> >>> -anand >>> >>> On Aug 28, 2015, at 11:51 AM, dario.re...@me.com wrote: >>> >>> Anand, >>> >>> in the example from my first mail you can see that curl prints the size >>> of a message and then waits for the next message and only when it receives >>> that message it will print the prior message plus the size of the next >>> message, but not the actual message. >>> >>> What's the benefit of encoding multiple messages in a single chunk? You >>> could simply create a single chunk per event. >>> >>> Cheers, >>> Dario >>> >>> On 28.08.2015, at 19:43, Anand Mazumdar <an...@mesosphere.io> wrote: >>> >>> Dario, >>> >>> Can you shed a bit more light on what you still find puzzling about the >>> CURL behavior after my explanation ? >>> >>> PS: A single HTTP chunk can have 0 or more Mesos (Scheduler API) Events. >>> So in your example, the first chunk had complete information about the >>> first “event”, followed by partial information about the subsequent event >>> from another chunk. >>> >>> As for the benefit of using RecordIO format here, how else do you think >>> we could have de-marcated two events in the response ? >>> >>> -anand >>> >>> >>> On Aug 28, 2015, at 10:01 AM, dario.re...@me.com wrote: >>> >>> Anand, >>> >>> thanks for the explanation. I'm still a little puzzled why curl behaves >>>
Re: [VOTE] Release Apache Mesos 0.24.0 (rc1)
Cool - I'll ping Joseph on that one. (the -1 from Nik was related to the known ROOT test issues that -if memory serves- we agreed were non-blocking: I'll follow up with him too) *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Tue, Sep 1, 2015 at 10:49 AM, Vinod Kone <vinodk...@apache.org> wrote: > Thanks for the nudge Marco. > > There was a binding -1 from Niklas. > > I'm planning to cut an RC2. The cherry picks I've selected so far are in > MESOS-2562 <https://issues.apache.org/jira/browse/MESOS-2562>. > > The only one I'm currently waiting on to get a resolution for is > https://issues.apache.org/jira/browse/MESOS-3345. > > On Tue, Sep 1, 2015 at 10:44 AM, Marco Massenzio <ma...@mesosphere.io> > wrote: > >> Hey guys, >> >> just a quick note to bring back the conversation on track to the 0.24-RC1 >> release. >> Is my understanding correct that there are currently no binding -1's? >> >> @Vinod: what do you think, are we good to release? >> >> Thanks! >> >> *Marco Massenzio* >> >> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* >> >> On Tue, Sep 1, 2015 at 1:49 AM, Dario Rexin <dario.re...@me.com> wrote: >> >>> One more question. From the Mesos code it doesn’t look like events are >>> being split or combined, so given I have a client that gives me access to >>> the individual chunks, is it safe to assume that each chunk contains >>> exactly one event? Because that would make parsing the events a lot easier >>> for me. >>> >>> Thanks, >>> Dario >>> >>> On Sep 1, 2015, at 8:42 AM, dario.re...@me.com wrote: >>> >>> Hi Vinod, >>> >>> thanks for the explanation, I got it now. >>> >>> Thanks, >>> Dario >>> >>> On 31.08.2015, at 23:47, Vinod Kone <vinodk...@apache.org> wrote: >>> >>> I think you might be confused with the HTTP chunked encoding and >>> RecordIO encoding. Most HTTP client libraries dechunk the stream before >>> presenting it to the application. So the application needs to know the >>> encoding of the dechunked data to be able to process it. >>> >>> In Mesos's case, the server (master here) can encode it in JSON or >>> Protobuf. We wanted to have a consistent way to encode both these formats >>> and Record-IO format was the one we settled on. Note that this format is >>> also used by the Twitter streaming API >>> <https://dev.twitter.com/streaming/overview/processing> (see delimited >>> messages section). >>> >>> HTH, >>> >>> On Mon, Aug 31, 2015 at 2:09 PM, Dario Rexin <dario.re...@me.com> wrote: >>> >>>> Hi Vino, >>>> >>>> On Aug 31, 2015, at 9:36 PM, Vinod Kone <vinodk...@apache.org> wrote: >>>> >>>> Hi Dario, >>>> >>>> Can you test with "curl --no-buffer" option? Looks like your stdout >>>> might be line-buffered. >>>> >>>> >>>> that did the trick, thanks! >>>> >>>> >>>> The reason we used record-io formatting is to be consistent in how we >>>> stream protobuf and json encoded data. >>>> >>>> >>>> How does simple chunked encoding prevent you from doing this? >>>> >>>> Thanks, >>>> Dario >>>> >>>> On Fri, Aug 28, 2015 at 2:04 PM, <dario.re...@me.com> wrote: >>>> >>>>> Anand, >>>>> >>>>> thanks for the explanation. I didn't think about the case when you >>>>> have to split a message, now it makes sense. >>>>> >>>>> But the case I observed with curl is still weird. Even when splitting >>>>> a message, it should still receive both parts almost at the same time. Do >>>>> you have any idea why it could behave like this? >>>>> >>>>> On 28.08.2015, at 21:31, Anand Mazumdar <an...@mesosphere.io> wrote: >>>>> >>>>> Dario, >>>>> >>>>> Most HTTP libraries/parsers ( including one that Mesos uses internally >>>>> ) provide a way to specify a default size of each chunk. If a Mesos Event >>>>> is too big , it would get split into smaller chunks and vice-versa. >>>>> >>>>> -anand >>>>> >>>>> On Aug 28, 2015, at 11:51 AM, dario.
Re: [VOTE] Release Apache Mesos 0.24.0 (rc1)
Awesome - we'll be running RC2 through our CI env and let you know the outcome soon as we know. Thanks! *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Tue, Sep 1, 2015 at 11:42 AM, Vinod Kone <vinodk...@apache.org> wrote: > My only concern is that if we decide to change the protobuf -> json > conversion for int64 (from json number to string), we should do that in the > scheduler http api as well (Resource protobuf uses int64 for ports). > > But since the scheduler http api is labeled beta for 0.24, we can still > change the semantics in 0.25. > > So, I'll go ahead and call the vote for RC2 today. > > On Tue, Sep 1, 2015 at 11:05 AM, Marco Massenzio <ma...@mesosphere.io> > wrote: > >> Cool - I'll ping Joseph on that one. >> >> (the -1 from Nik was related to the known ROOT test issues that -if >> memory serves- we agreed were non-blocking: I'll follow up with him too) >> >> *Marco Massenzio* >> >> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* >> >> On Tue, Sep 1, 2015 at 10:49 AM, Vinod Kone <vinodk...@apache.org> wrote: >> >>> Thanks for the nudge Marco. >>> >>> There was a binding -1 from Niklas. >>> >>> I'm planning to cut an RC2. The cherry picks I've selected so far are in >>> MESOS-2562 <https://issues.apache.org/jira/browse/MESOS-2562>. >>> >>> The only one I'm currently waiting on to get a resolution for is >>> https://issues.apache.org/jira/browse/MESOS-3345. >>> >>> On Tue, Sep 1, 2015 at 10:44 AM, Marco Massenzio <ma...@mesosphere.io> >>> wrote: >>> >>>> Hey guys, >>>> >>>> just a quick note to bring back the conversation on track to the >>>> 0.24-RC1 release. >>>> Is my understanding correct that there are currently no binding -1's? >>>> >>>> @Vinod: what do you think, are we good to release? >>>> >>>> Thanks! >>>> >>>> *Marco Massenzio* >>>> >>>> *Distributed Systems Engineerhttp://codetrips.com >>>> <http://codetrips.com>* >>>> >>>> On Tue, Sep 1, 2015 at 1:49 AM, Dario Rexin <dario.re...@me.com> wrote: >>>> >>>>> One more question. From the Mesos code it doesn’t look like events are >>>>> being split or combined, so given I have a client that gives me access to >>>>> the individual chunks, is it safe to assume that each chunk contains >>>>> exactly one event? Because that would make parsing the events a lot easier >>>>> for me. >>>>> >>>>> Thanks, >>>>> Dario >>>>> >>>>> On Sep 1, 2015, at 8:42 AM, dario.re...@me.com wrote: >>>>> >>>>> Hi Vinod, >>>>> >>>>> thanks for the explanation, I got it now. >>>>> >>>>> Thanks, >>>>> Dario >>>>> >>>>> On 31.08.2015, at 23:47, Vinod Kone <vinodk...@apache.org> wrote: >>>>> >>>>> I think you might be confused with the HTTP chunked encoding and >>>>> RecordIO encoding. Most HTTP client libraries dechunk the stream before >>>>> presenting it to the application. So the application needs to know the >>>>> encoding of the dechunked data to be able to process it. >>>>> >>>>> In Mesos's case, the server (master here) can encode it in JSON or >>>>> Protobuf. We wanted to have a consistent way to encode both these formats >>>>> and Record-IO format was the one we settled on. Note that this format is >>>>> also used by the Twitter streaming API >>>>> <https://dev.twitter.com/streaming/overview/processing> (see >>>>> delimited messages section). >>>>> >>>>> HTH, >>>>> >>>>> On Mon, Aug 31, 2015 at 2:09 PM, Dario Rexin <dario.re...@me.com> >>>>> wrote: >>>>> >>>>>> Hi Vino, >>>>>> >>>>>> On Aug 31, 2015, at 9:36 PM, Vinod Kone <vinodk...@apache.org> wrote: >>>>>> >>>>>> Hi Dario, >>>>>> >>>>>> Can you test with "curl --no-buffer" option? Looks like your stdout >>>>>> might be line-buffered. >>>>>> >>>>>> >>>>>> that did the trick, thanks! >&
Re: mesos-slave crashing with CHECK_SOME
That's one of those areas for discussions that is so likely to generate a flame war that I'm hesitant to wade in :) In general, I would agree with the sentiment expressed there: > If the task fails, that is unfortunate, but not the end of the world. Other tasks should not be affected. which is, in fact, to large extent exactly what Mesos does; the example given in MESOS-2684, as it happens, is for a "disk full failure" - carrying on as if nothing had happened, is only likely to lead to further (and worse) disappointment. The general philosophy back at Google (and which certainly informs the design of Borg[0]) was "fail early, fail hard" so that either (a) the service is restarted and hopefully the root cause cleared or (b) someone (who can hopefully do something) will be alerted about it. I think it's ultimately a matter of scale: up to a few tens of servers, you can assume there is some sort of 'log-monitor' that looks out for errors and other anomalies and alerts humans that will then take a look and possibly apply some corrective action - when you're up to hundreds or thousands (definitely Mesos territory) that's not practical: the system should either self-heal or crash-and-restart. All this to say, that it's difficult to come up with a general *automated* approach to unequivocally decide if a failure is "fatal" or could just be safely "ignored" (after appropriate error logging) - in general, when in doubt it's probably safer to "noisily crash & restart" and rely on the overall system's HA architecture to take care of replication and consistency. (and an intelligent monitoring system that only alerts when some failure threshold is exceeded). >From what I've seen so far (granted, still a novice here) it seems that Mesos subscribes to this notion, assuming that Agent Nodes will come and go, and usually Tasks survive (for a certain amount of time anyway) a Slave restart (obviously, if the physical h/w is the ultimate cause of failure, well, then all bets are off). Having said all that - if there are areas where we have been over-eager with our CHECKs, we should definitely revisit that and make it more crash-resistant, absolutely. [0] http://research.google.com/pubs/pub43438.html *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker < sschlans...@opentable.com> wrote: > > > On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote: > > > > tag=mesos-slave[12858]: F0831 09:37:29.838184 12898 slave.cpp:3354] > CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory > > I reported a similar bug a while back: > > https://issues.apache.org/jira/browse/MESOS-2684 > > This seems to be a class of bugs where some filesystem operations which > may fail for unforeseen reasons are written as assertions which crash the > process, rather than failing only the task and communicating back the error > reason. > > >
Re: Prepping for next release
Uhm, that's a tricky one... Considering that JDK6 was EOL'd in 2011[0] and even JDK7 is now officially out of support from Oracle, I don't think this should be a major issue? I'm also assuming that, if anyone really needs to use JDK6 they can build from source, by simply running `mvn package` and replace the JAR? (not terribly familiar with our build process, so no idea if that would work at all). [0] http://www.oracle.com/technetwork/java/eol-135779.html *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Tue, Sep 1, 2015 at 4:46 PM, Vinod Kone <vinodk...@apache.org> wrote: > +user > > So looks like this issue is related to JDK6 and not my maven password > settings. > > Related ASF ticket: https://issues.apache.org/jira/browse/BUILDS-85 > > The reason it worked for me, when I tagged RC1, was because I also pointed > my maven to use JDK7. > > So we have couple options here: > > #1) (Easy) Do same thing with RC2 as we did for RC1. This does mean the > artifacts we upload to nexus will be compiled with JDK7. IIUC, if any JVM > based frameworks are still on JDK6 they can't link in the new artifacts? > > #2) (Harder) As mentioned in the ticket, have maven compile Mesos jar with > JDK6 but use JDK7 when uploading. Not sure how easy it is to adapt our > Mesos build tool chain for this. Anyone has expertise in this area? > > Thoughts? > > > On Tue, Aug 18, 2015 at 3:14 PM, Vinod Kone <vinodk...@apache.org> wrote: > > > I re-encrypted the maven passwords and that seemed to have done the > trick. > > Thanks Adam! > > > > On Tue, Aug 18, 2015 at 1:59 PM, Adam Bordelon <a...@mesosphere.io> > wrote: > > > >> Update your ~/.m2/settings.xml? > >> Also check that the output of `gpg --list-keys` and `--list-sigs` > matches > >> the keypair you expect > >> > >> On Tue, Aug 18, 2015 at 1:48 PM, Vinod Kone <vinodk...@apache.org> > wrote: > >> > >> > I definitely had to create a new gpg key because my previous one > >> expired! I > >> > uploaded them id.apache and our SVN repo containing KEYS. > >> > > >> > Do I need to do anything specific for maven? > >> > > >> > On Tue, Aug 18, 2015 at 1:25 PM, Adam Bordelon <a...@mesosphere.io> > >> wrote: > >> > > >> > > Haven't seen that one. Are you sure you've got your gpg key properly > >> set > >> > up > >> > > with Maven? > >> > > > >> > > On Tue, Aug 18, 2015 at 1:13 PM, Vinod Kone <vinodk...@apache.org> > >> > wrote: > >> > > > >> > > > I'm getting the following error when running ./support/tag.sh. Has > >> any > >> > of > >> > > > the recent release managers seen this one before? > >> > > > > >> > > > [ERROR] Failed to execute goal > >> > > > org.apache.maven.plugins:maven-deploy-plugin:2.7:deploy > >> > (default-deploy) > >> > > on > >> > > > project mesos: Failed to deploy artifacts: Could not transfer > >> artifact > >> > > > org.apache.mesos:mesos:jar:0.24.0-rc1 from/to > apache.releases.https > >> ( > >> > > > https://repository.apache.org/service/local/staging/deploy/maven2 > ): > >> > > > java.lang.RuntimeException: Could not generate DH keypair: Prime > >> size > >> > > must > >> > > > be multiple of 64, and can only range from 512 to 1024 (inclusive) > >> -> > >> > > [Help > >> > > > 1] > >> > > > > >> > > > On Mon, Aug 17, 2015 at 11:23 AM, Vinod Kone < > vinodk...@apache.org> > >> > > wrote: > >> > > > > >> > > > > Update: > >> > > > > > >> > > > > There are 3 outstanding tickets (all related to flaky tests), > >> that we > >> > > are > >> > > > > trying to resolve. Any help fixing those (esp. MESOS-3050 > >> > > > > <https://issues.apache.org/jira/browse/MESOS-3050>) would be > >> > > > appreciated! > >> > > > > > >> > > > > Planning to cut an RC as soon as they are fixed (assuming no new > >> ones > >> > > > crop > >> > > > > up). > >> > > > > > >> > > > > Thanks, > >> > > > > > >> > > > &g
Re: Recommended way to discover current master
The easiest way is via accessing directly Zookeeper - as you don't need to know a priori the list of Masters; if you do, however, hitting any one of them will redirect (302) to the current Leader. If you would like to see an example of how to retrieve that info from ZK, I have written about it here[0]. Finally, we're planning to make all this available via the Mesos Commons[1] library (currently, there is a PR[2] waiting to be be merged). [0] http://codetrips.com/2015/08/16/apache-mesos-leader-master-discovery-using-zookeeper-part-2/ [1] https://github.com/mesos/commons [2] https://github.com/mesos/commons/pull/2/files *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Mon, Aug 31, 2015 at 10:25 AM, Philip Weaver <philip.wea...@gmail.com> wrote: > My framework knows the list of zookeeper hosts and the list of mesos > master hosts. > > I can think of a few ways for the framework to figure out which host is > the current master. What would be the best? Should I check in zookeeper > directly? Does the mesos library expose an interface to discover the master > from zookeeper or otherwise? Should I just try each possible master until > one responds? > > Apologies if this is already well documented, but I wasn't able to find > it. Thanks! > > - Philip > >
Re: Mesos-master complains about quorum being a duplicate flag on CoreOS
Thanks for following up, glad we figured it out. IMO the current behavior (and the error message) are non-intuitive and I've filed a Jira[0] to address that. [0] https://issues.apache.org/jira/browse/MESOS-3340 *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Mon, Aug 31, 2015 at 1:59 AM, F21 <f21.gro...@gmail.com> wrote: > Ah, that makes sense! > > I have the environment variable MESOS_QUORUM exported and it was > conflicting with the --quorum passed to the command line. > > Removing the MESOS_QUORUM environment variable fixed it. > > > On 31/08/2015 5:36 PM, Marco Massenzio wrote: > > Command line flags are parsed using stout/flags.hpp[0] and the FlagsBase > class is derived in mesos::internal::master::Flags (see > src/master/flags.hpp[1]). > > I am not sure why you are seeing that behavior on CoreOS, but I'd be > curious to know what happens if you omit the --quorum when you start > master: it should usually fail and complain that it's a required flag (when > used in conjunction with --zk). If it works, it will emit in the logs > (towards the very beginning) all the values of the flags: what does it say > about --quorum? > > Completely random question: I assume you don't already have in the > environment a MESOS_QUORUM variable exported? > > If the issue persists in a "clean" OS install and a recent build, it's > definitely a bug: it'd be great if you could please file a ticket at > http://issues.apache.org/jira (feel free to assign to me). > > Thanks! > > [0] > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=3rdparty/libprocess/3rdparty/stout/include/stout/flags.hpp > [1] > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/master/flags.hpp > > *Marco Massenzio* > > *Distributed Systems Engineer http://codetrips.com <http://codetrips.com>* > > On Sun, Aug 30, 2015 at 8:21 PM, F21 <f21.gro...@gmail.com> wrote: > >> I've gotten the mesos binaries compiled and packaged and deployed them >> onto a CoreOS instance. >> >> >> When I run the master, it complains that the quorum flag is duplicated: >> >> $ ./mesos-master --zk=zk://192.168.1.4/mesos --quorum=1 >> --hostname=192.168.1.4 --ip=192.168.1.4 >> Duplicate flag 'quorum' on command line >> ... >> >> However, if I try and run mesos-master on Ubuntu 15.04 64-bit (where the >> binaries were built), it seems to work properly: >> >> $ ./mesos-master --zk=zk://192.168.1.4/mesos --quorum=1 >> --hostname=192.168.1.4 --ip=192.168.1.4 >> >> I0830 18:31:20.983999 2830 main.cpp:181] Build: 2015-08-30 10:11:54 by >> I0830 18:31:20.984246 2830 main.cpp:183] Version: 0.23.0 >> I0830 18:31:20.984694 2830 main.cpp:204] Using 'HierarchicalDRF' >> allocator --work_dir needed for replicated log based registry >> >> How are the command line flags parsed in mesos? What causes this strange >> behavior on CoreOS? >> >> >> > >
Re: Mesos-master complains about quorum being a duplicate flag on CoreOS
Command line flags are parsed using stout/flags.hpp[0] and the FlagsBase class is derived in mesos::internal::master::Flags (see src/master/flags.hpp[1]). I am not sure why you are seeing that behavior on CoreOS, but I'd be curious to know what happens if you omit the --quorum when you start master: it should usually fail and complain that it's a required flag (when used in conjunction with --zk). If it works, it will emit in the logs (towards the very beginning) all the values of the flags: what does it say about --quorum? Completely random question: I assume you don't already have in the environment a MESOS_QUORUM variable exported? If the issue persists in a "clean" OS install and a recent build, it's definitely a bug: it'd be great if you could please file a ticket at http://issues.apache.org/jira (feel free to assign to me). Thanks! [0] https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=3rdparty/libprocess/3rdparty/stout/include/stout/flags.hpp [1] https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/master/flags.hpp *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Sun, Aug 30, 2015 at 8:21 PM, F21 <f21.gro...@gmail.com> wrote: > I've gotten the mesos binaries compiled and packaged and deployed them > onto a CoreOS instance. > > > When I run the master, it complains that the quorum flag is duplicated: > > $ ./mesos-master --zk=zk://192.168.1.4/mesos --quorum=1 > --hostname=192.168.1.4 --ip=192.168.1.4 > Duplicate flag 'quorum' on command line > ... > > However, if I try and run mesos-master on Ubuntu 15.04 64-bit (where the > binaries were built), it seems to work properly: > > $ ./mesos-master --zk=zk://192.168.1.4/mesos --quorum=1 > --hostname=192.168.1.4 --ip=192.168.1.4 > > I0830 18:31:20.983999 2830 main.cpp:181] Build: 2015-08-30 10:11:54 by > I0830 18:31:20.984246 2830 main.cpp:183] Version: 0.23.0 > I0830 18:31:20.984694 2830 main.cpp:204] Using 'HierarchicalDRF' > allocator --work_dir needed for replicated log based registry > > How are the command line flags parsed in mesos? What causes this strange > behavior on CoreOS? > > >
Re: Use docker start rather than docker run?
Hi Paul, +1 to what Alex/Tim say. Maybe a (simple) example will help: a very basic framework I created recently, does away with the Executor and only uses the Scheduler, sending a CommandInfo structure to Mesos' Agent node to execute. See: https://github.com/massenz/mongo_fw/blob/develop/src/mongo_scheduler.cpp#L124 If Python is more your thing, there are examples in the Mesos repository, or you can take a look at something I started recently to use the new (0.24) HTTP API (NOTE - this is still very much still WIP): https://github.com/massenz/zk-mesos/blob/develop/notebooks/HTTP%20API%20Tests.ipynb *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com* On Fri, Aug 28, 2015 at 8:44 AM, Paul Bell arach...@gmail.com wrote: Alex Tim, Thank you both; most helpful. Alex, can you dispel my confusion on this point: I keep reading that a framework in Mesos (e.g., Marathon) consists of a scheduler and an executor. This reference to executor made me think that Marathon must have *some* kind of presence on the slave node. But the more familiar I become with Mesos the less likely this seems to me. So, what does it mean to talk about the Marathon framework executor? Tim, I did come up with a simple work-around that involves re-copying the needed file into the container each time the application is started. For reasons unknown, this file is not kept in a location that would readily lend itself to my use of persistent storage (Docker -v). That said, I am keenly interested in learning how to write both custom executors schedulers. Any sense for what release of Mesos will see persistent volumes? Thanks again, gents. -Paul On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen t...@mesosphere.io wrote: Hi Paul, We don't [re]start a container since we assume once the task terminated the container is no longer reused. In Mesos to allow tasks to reuse the same executor and handle task logic accordingly people will opt to choose the custom executor route. We're working on a way to keep your sandbox data beyond a container lifecycle, which is called persistent volumes. We haven't integrated that with Docker containerizer yet, so you'll have to wait to use that feature. You could also choose to implement a custom executor for now if you like. Tim On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov a...@mesosphere.com wrote: Paul, that component is called DockerContainerizer and it's part of Mesos Agent (check /Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp). @Tim, could you answer the docker start vs. docker run question? On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell arach...@gmail.com wrote: Hi All, I first posted this to the Marathon list, but someone suggested I try it here. I'm still not sure what component (mesos-master, mesos-slave, marathon) generates the docker run command that launches containers on a slave node. I suppose that it's the framework executor (Marathon) on the slave that actually executes the docker run, but I'm not sure. What I'm really after is whether or not we can cause the use of docker start rather than docker run. At issue here is some persistent data inside /var/lib/docker/aufs/mnt/CTR_ID. docker run will by design (re)launch my application with a different CTR_ID effectively rendering that data inaccessible. But docker start will restart the container and its old data will still be there. Thanks. -Paul
Re: Talks at MesosCon 2015
On Fri, Aug 21, 2015 at 12:07 AM, Marco Massenzio ma...@mesosphere.io wrote: Great talks today, can't wait to get hands on the new APIs. You can ;) Mesos 0.25-rc1 is out for grabs and testing... And, of course, I meant *0.24-rc1* ... this is what happens when one spends all day building @HEAD :D Apologies for confusion and thanks to @mpark for being eagle-eyed! will require building from source: not for the faint of heart, but not an insurmountable hurdle either. *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com* On Thu, Aug 20, 2015 at 9:22 PM, Haripriya Ayyalasomayajula aharipriy...@gmail.com wrote: thanks! that would be very helpful. Great talks today, can't wait to get hands on the new APIs. On Thu, Aug 20, 2015 at 10:57 AM, Chris Aniszczyk z...@twitter.com wrote: Yes, they will all be recorded and posted on YouTube. On Thu, Aug 20, 2015 at 10:30 AM, craig w codecr...@gmail.com wrote: An earlier post said all videos will be online a few days after the conference. On Thu, Aug 20, 2015 at 1:29 PM, Kenneth Su su.ke...@gmail.com wrote: Yes, agree that. I am looking for live video today on the youtube, but nothing there. Definitely the forum will help to follow up and discuss from there Thanks! Kenneth On Thu, Aug 20, 2015 at 11:26 AM, Haripriya Ayyalasomayajula aharipriy...@gmail.com wrote: Hi all, I'm at the MesosCon 2015 today and was just curious if all the talks/ presentations would be captured anywhere( mesosphere blog/ youtube). It would be very helpful to have them recorded. There are multiple interesting talks at the same time scheduled and Its not possible to cover all. I strongly believe if we have a forum to follow up with these talks / topics presented here will be helpful. Thanks. -- Regards, Haripriya Ayyalasomayajula -- https://github.com/mindscratch https://www.google.com/+CraigWickesser https://twitter.com/mind_scratch https://twitter.com/craig_links -- Cheers, Chris Aniszczyk | Open Source | Twitter, Inc. @cra | +1 512 961 6719 -- Regards, Haripriya Ayyalasomayajula
Re: Talks at MesosCon 2015
Great talks today, can't wait to get hands on the new APIs. You can ;) Mesos 0.25-rc1 is out for grabs and testing... will require building from source: not for the faint of heart, but not an insurmountable hurdle either. *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com* On Thu, Aug 20, 2015 at 9:22 PM, Haripriya Ayyalasomayajula aharipriy...@gmail.com wrote: thanks! that would be very helpful. Great talks today, can't wait to get hands on the new APIs. On Thu, Aug 20, 2015 at 10:57 AM, Chris Aniszczyk z...@twitter.com wrote: Yes, they will all be recorded and posted on YouTube. On Thu, Aug 20, 2015 at 10:30 AM, craig w codecr...@gmail.com wrote: An earlier post said all videos will be online a few days after the conference. On Thu, Aug 20, 2015 at 1:29 PM, Kenneth Su su.ke...@gmail.com wrote: Yes, agree that. I am looking for live video today on the youtube, but nothing there. Definitely the forum will help to follow up and discuss from there Thanks! Kenneth On Thu, Aug 20, 2015 at 11:26 AM, Haripriya Ayyalasomayajula aharipriy...@gmail.com wrote: Hi all, I'm at the MesosCon 2015 today and was just curious if all the talks/ presentations would be captured anywhere( mesosphere blog/ youtube). It would be very helpful to have them recorded. There are multiple interesting talks at the same time scheduled and Its not possible to cover all. I strongly believe if we have a forum to follow up with these talks / topics presented here will be helpful. Thanks. -- Regards, Haripriya Ayyalasomayajula -- https://github.com/mindscratch https://www.google.com/+CraigWickesser https://twitter.com/mind_scratch https://twitter.com/craig_links -- Cheers, Chris Aniszczyk | Open Source | Twitter, Inc. @cra | +1 512 961 6719 -- Regards, Haripriya Ayyalasomayajula
Re: Assertion `data.isNone()' failed
Are you sure this is a 0.21.1 cluster? the line numbers in the logs match the code in Mesos 0.23.0 This is, however, a genuine bug (src/launcher/fetcher.cpp#L99): Trybool available = hdfs.available(); if (available.isError() || !available.get()) { return Error(Skipping fetch with Hadoop Client as Hadoop Client not available: + available.error()); } The root cause is that (probably) the HDFS client is not available on the slave; however, we do not 'error()' but rather return a 'false' - this is all good. The bug is exposed in the return line, where we try to retrieve available.error() (which we should not - it's just `false`). This was a 'latent' bug that *may* have been exposed by (my) recent refactoring of os::shell which is used by hdfs.available() under the covers. (this is a bit unclear, though, as that refactoring is post-0.23) Be that as it may, I've filed https://issues.apache.org/jira/browse/MESOS-3287: the fix is trivial and I may be able to sneak it into 0.24 (which we're cutting now). Thanks for reporting! PS - bad code aside, the root cause is that the `hdfs` binary seems to be unreachable on the slave: is it installed in the PATH of the user under which the slave binary executes? *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com* On Mon, Aug 17, 2015 at 10:46 PM, Ashwanth Kumar ashwa...@indix.com wrote: We've a 20 node mesos cluster running mesos v0.21.1, We run marathon on top of this setup without any problems for ~4 months now. I'm now trying to get hadoop mesos https://github.com/mesos/hadoop/ integration working but I see the TaskTrackers that gets launched are failing with the following error I0818 05:36:35.058688 24428 fetcher.cpp:409] Fetcher Info: {cache_directory:\/tmp\/mesos\/fetch\/slaves\/20150706-075218-1611773194-5050-28439-S473\/hadoop,items:[{action:BYPASS_CACHE,uri:{extract:true,value:hdfs:\/\/hdfs.prod:54310\/user\/ashwanth\/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz}}],sandbox_directory:\/var\/lib\/mesos\/slaves\/20150706-075218-1611773194-5050-28439-S473\/frameworks\/20150706-075218-1611773194-5050-28439-4532\/executors\/executor_Task_Tracker_4129\/runs\/c26f52d4-4055-46fa-b999-11d73f2096dd,user:hadoop} I0818 05:36:35.059806 24428 fetcher.cpp:364] Fetching URI 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz' I0818 05:36:35.059821 24428 fetcher.cpp:238] Fetching directly into the sandbox directory I0818 05:36:35.059835 24428 fetcher.cpp:176] Fetching URI 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz' *mesos-fetcher: /tmp/mesos-build/mesos-repo/3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90: const string TryT::error() const [with T = bool; std::string = std::basic_stringchar]: Assertion `data.isNone()' failed.* *** Aborted at 1439876195 (unix time) try date -d @1439876195 if you are using GNU date *** PC: @ 0x343ee32635 (unknown) *** SIGABRT (@0x5f6c) received by PID 24428 (TID 0x7f988832f820) from PID 24428; stack trace: *** @ 0x343f20f710 (unknown) @ 0x343ee32635 (unknown) @ 0x343ee33e15 (unknown) @ 0x343ee2b75e (unknown) @ 0x343ee2b820 (unknown) @ 0x408b0a Try::error() @ 0x40cbcf download() @ 0x4098a3 main @ 0x343ee1ed5d (unknown) @ 0x40aeb5 (unknown) Failed to synchronize with slave (it's probably exited) Environment - EC2 Machines - Output of lsb_release -a LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch Distributor ID: CentOS Description: CentOS release 6.5 (Final) Release: 6.5 Codename: Final Any ideas what I'm doing wrong? -- -- Ashwanth Kumar
Re: Assertion `data.isNone()' failed
Hi Ashwanth, I've pushed a fix out for review https://reviews.apache.org/r/37584/, we'll see if it makes it in time for 0.24. As for the version, you can quickly verify that by running `mesos-master --version` (or just look at the very beginning of the logs, it will tell you a bunch of stuff about version, build, etc.) I am sorry, I don't really know enough about setting up Hadoop on Mesos to give you any useful guidance; from a quick glance at the code, it seems to me that, if the URI is a `hdfs://` one, the only way to retrieve the tarball is via HDFS (so you will need the hdfs client to be available on the Slave(s)). If you do use an HTTP URI (http://) then it should work just fine. Hopefully others will be able to chime in with a more informed view. *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com* On Tue, Aug 18, 2015 at 2:46 AM, Ashwanth Kumar ashwa...@indix.com wrote: Thanks Marco for the update. My understanding of the hadoop mesos framework was that the executor would download the hadoop distro from mapred.mesos.executor.uri and execute the TTs. I didn't know that to download from HDFS it needs `hdfs` binary in PATH. I don't have a hadoop setup on the mesos slave. Should I go ahead and add them? Regarding the line number mismatch, I installed the package through mesosphere not sure if that's the reason. On Tue, Aug 18, 2015 at 1:22 PM, Marco Massenzio ma...@mesosphere.io wrote: Are you sure this is a 0.21.1 cluster? the line numbers in the logs match the code in Mesos 0.23.0 This is, however, a genuine bug (src/launcher/fetcher.cpp#L99): Trybool available = hdfs.available(); if (available.isError() || !available.get()) { return Error(Skipping fetch with Hadoop Client as Hadoop Client not available: + available.error()); } The root cause is that (probably) the HDFS client is not available on the slave; however, we do not 'error()' but rather return a 'false' - this is all good. The bug is exposed in the return line, where we try to retrieve available.error() (which we should not - it's just `false`). This was a 'latent' bug that *may* have been exposed by (my) recent refactoring of os::shell which is used by hdfs.available() under the covers. (this is a bit unclear, though, as that refactoring is post-0.23) Be that as it may, I've filed https://issues.apache.org/jira/browse/MESOS-3287: the fix is trivial and I may be able to sneak it into 0.24 (which we're cutting now). Thanks for reporting! PS - bad code aside, the root cause is that the `hdfs` binary seems to be unreachable on the slave: is it installed in the PATH of the user under which the slave binary executes? *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com* On Mon, Aug 17, 2015 at 10:46 PM, Ashwanth Kumar ashwa...@indix.com wrote: We've a 20 node mesos cluster running mesos v0.21.1, We run marathon on top of this setup without any problems for ~4 months now. I'm now trying to get hadoop mesos https://github.com/mesos/hadoop/ integration working but I see the TaskTrackers that gets launched are failing with the following error I0818 05:36:35.058688 24428 fetcher.cpp:409] Fetcher Info: {cache_directory:\/tmp\/mesos\/fetch\/slaves\/20150706-075218-1611773194-5050-28439-S473\/hadoop,items:[{action:BYPASS_CACHE,uri:{extract:true,value:hdfs:\/\/hdfs.prod:54310\/user\/ashwanth\/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz}}],sandbox_directory:\/var\/lib\/mesos\/slaves\/20150706-075218-1611773194-5050-28439-S473\/frameworks\/20150706-075218-1611773194-5050-28439-4532\/executors\/executor_Task_Tracker_4129\/runs\/c26f52d4-4055-46fa-b999-11d73f2096dd,user:hadoop} I0818 05:36:35.059806 24428 fetcher.cpp:364] Fetching URI 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz' I0818 05:36:35.059821 24428 fetcher.cpp:238] Fetching directly into the sandbox directory I0818 05:36:35.059835 24428 fetcher.cpp:176] Fetching URI 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz' *mesos-fetcher: /tmp/mesos-build/mesos-repo/3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90: const string TryT::error() const [with T = bool; std::string = std::basic_stringchar]: Assertion `data.isNone()' failed.* *** Aborted at 1439876195 (unix time) try date -d @1439876195 if you are using GNU date *** PC: @ 0x343ee32635 (unknown) *** SIGABRT (@0x5f6c) received by PID 24428 (TID 0x7f988832f820) from PID 24428; stack trace: *** @ 0x343f20f710 (unknown) @ 0x343ee32635 (unknown) @ 0x343ee33e15 (unknown) @ 0x343ee2b75e (unknown) @ 0x343ee2b820 (unknown) @ 0x408b0a Try::error() @ 0x40cbcf download() @ 0x4098a3 main @ 0x343ee1ed5d (unknown) @ 0x40aeb5 (unknown) Failed to synchronize with slave (it's
Re: Assertion `data.isNone()' failed
For info, the patch was committed today and made the cut to 0.24-rc1. Thanks to @vinodkone for super-quick turnaround. — Sent from Mailbox On Tue, Aug 18, 2015 at 10:45 AM, Marco Massenzio ma...@mesosphere.io wrote: Hi Ashwanth, I've pushed a fix out for review https://reviews.apache.org/r/37584/, we'll see if it makes it in time for 0.24. As for the version, you can quickly verify that by running `mesos-master --version` (or just look at the very beginning of the logs, it will tell you a bunch of stuff about version, build, etc.) I am sorry, I don't really know enough about setting up Hadoop on Mesos to give you any useful guidance; from a quick glance at the code, it seems to me that, if the URI is a `hdfs://` one, the only way to retrieve the tarball is via HDFS (so you will need the hdfs client to be available on the Slave(s)). If you do use an HTTP URI (http://) then it should work just fine. Hopefully others will be able to chime in with a more informed view. *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com* On Tue, Aug 18, 2015 at 2:46 AM, Ashwanth Kumar ashwa...@indix.com wrote: Thanks Marco for the update. My understanding of the hadoop mesos framework was that the executor would download the hadoop distro from mapred.mesos.executor.uri and execute the TTs. I didn't know that to download from HDFS it needs `hdfs` binary in PATH. I don't have a hadoop setup on the mesos slave. Should I go ahead and add them? Regarding the line number mismatch, I installed the package through mesosphere not sure if that's the reason. On Tue, Aug 18, 2015 at 1:22 PM, Marco Massenzio ma...@mesosphere.io wrote: Are you sure this is a 0.21.1 cluster? the line numbers in the logs match the code in Mesos 0.23.0 This is, however, a genuine bug (src/launcher/fetcher.cpp#L99): Trybool available = hdfs.available(); if (available.isError() || !available.get()) { return Error(Skipping fetch with Hadoop Client as Hadoop Client not available: + available.error()); } The root cause is that (probably) the HDFS client is not available on the slave; however, we do not 'error()' but rather return a 'false' - this is all good. The bug is exposed in the return line, where we try to retrieve available.error() (which we should not - it's just `false`). This was a 'latent' bug that *may* have been exposed by (my) recent refactoring of os::shell which is used by hdfs.available() under the covers. (this is a bit unclear, though, as that refactoring is post-0.23) Be that as it may, I've filed https://issues.apache.org/jira/browse/MESOS-3287: the fix is trivial and I may be able to sneak it into 0.24 (which we're cutting now). Thanks for reporting! PS - bad code aside, the root cause is that the `hdfs` binary seems to be unreachable on the slave: is it installed in the PATH of the user under which the slave binary executes? *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com* On Mon, Aug 17, 2015 at 10:46 PM, Ashwanth Kumar ashwa...@indix.com wrote: We've a 20 node mesos cluster running mesos v0.21.1, We run marathon on top of this setup without any problems for ~4 months now. I'm now trying to get hadoop mesos https://github.com/mesos/hadoop/ integration working but I see the TaskTrackers that gets launched are failing with the following error I0818 05:36:35.058688 24428 fetcher.cpp:409] Fetcher Info: {cache_directory:\/tmp\/mesos\/fetch\/slaves\/20150706-075218-1611773194-5050-28439-S473\/hadoop,items:[{action:BYPASS_CACHE,uri:{extract:true,value:hdfs:\/\/hdfs.prod:54310\/user\/ashwanth\/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz}}],sandbox_directory:\/var\/lib\/mesos\/slaves\/20150706-075218-1611773194-5050-28439-S473\/frameworks\/20150706-075218-1611773194-5050-28439-4532\/executors\/executor_Task_Tracker_4129\/runs\/c26f52d4-4055-46fa-b999-11d73f2096dd,user:hadoop} I0818 05:36:35.059806 24428 fetcher.cpp:364] Fetching URI 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz' I0818 05:36:35.059821 24428 fetcher.cpp:238] Fetching directly into the sandbox directory I0818 05:36:35.059835 24428 fetcher.cpp:176] Fetching URI 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz' *mesos-fetcher: /tmp/mesos-build/mesos-repo/3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90: const string TryT::error() const [with T = bool; std::string = std::basic_stringchar]: Assertion `data.isNone()' failed.* *** Aborted at 1439876195 (unix time) try date -d @1439876195 if you are using GNU date *** PC: @ 0x343ee32635 (unknown) *** SIGABRT (@0x5f6c) received by PID 24428 (TID 0x7f988832f820) from PID 24428; stack trace: *** @ 0x343f20f710 (unknown) @ 0x343ee32635 (unknown) @ 0x343ee33e15 (unknown) @ 0x343ee2b75e (unknown
Re: SSL in Mesos 0.23
FYI - Joris is out this week, he'll be probably able to get back to you early next (modulo MesosCon craziness :) *Marco Massenzio* *Distributed Systems Engineer* On Fri, Aug 14, 2015 at 9:14 AM, Carlos Sanchez car...@apache.org wrote: no suggestions? On Tue, Aug 11, 2015 at 6:47 PM, Vinod Kone vinodk...@apache.org wrote: @joris, can you help out here? On Tue, Aug 11, 2015 at 9:43 AM, Carlos Sanchez car...@apache.org wrote: I have tried to enable SSL with no success, even compiling from source with the ssl flags --enable-libevent --enable-ssl export SSL_ENABLED=true export SSL_SUPPORT_DOWNGRADE=false export SSL_REQUIRE_CERT=true export SSL_CERT_FILE=/etc/mesos/... export SSL_KEY_FILE=/etc/mesos/... export SSL_CA_FILE=/etc/mesos/... /home/ubuntu/mesos-deb-packaging/mesos-repo/build/src/mesos-master --work_dir=/var/lib/mesos Port 5050 is still served as plain http, no SSL Nothing about ssl shows up in the logs, any ideas? Thanks From: Dharmit Shah shahdhar...@gmail.com To: user@mesos.apache.org Cc: Date: Mon, 10 Aug 2015 14:13:04 +0530 Subject: Re: SSL in Mesos 0.23 Hi Jeff, Thanks for the suggestion. I modified the systemd service file to use `/etc/sysconfig/mesos-master` and `/etc/sysconfig/mesos-slave` as environment files for master and slave services respectively. In these files, I specified the environment variables that I used to specify on the command line. Now if I check `strings /proc/pid/environ | grep SSL` for pids of master and slave services, I see the environment variables that I set in the /etc/sysconfig/environment-file. Now that it looks like I have started the master and slave services with SSL enabled, how do I really confirm that communication between master and slaves is really happening over SSL? Also, how do I enable SSL communication for a framework like Marathon? Regards, Dharmit. On Fri, Aug 7, 2015 at 10:56 PM, Jeff Schroeder jeffschroe...@computer.org wrote: The sudo command defaults to envreset (look for that in the man page) which strips all env variables sans a select few. I'd almost bet that your SSL_* variables are not present and were not passed to the slave. Just sudo -i and start the slaves *as root* without sudo. There is no benefit to starting them with sudo. You can verify what I'm saying with something along the lines of: strings /proc/$(pidof mesos-slave)/environ | grep ^SSL_ On Friday, August 7, 2015, Dharmit Shah shahdhar...@gmail.com wrote: Hello again, Thanks for your responses. I will share what I tried after your suggestions. 1. `ldd /usr/sbin/mesos-master` and `ldd /usr/sbin/mesos-slave` returned similar output as one suggested by Craig. So, I guess, the Mesosphere repo binaries have SSL enabled. Right? 2. I created SSL private key and cert on one system in my cluster by referring this guide on DO [1]. Admittedly, my knowledge of SSL is limited. 3. Next, I copied the key and cert to all three mesos-master nodes and four mesos-slave nodes. Shouldn't slave nodes be provided only with the cert and not the private key? Whereas all master nodes may have the private key and cert both. Or am I understanding SSL incorrectly here? 4. After copying the cert and key, I started the mesos-master service on master nodes with below command: $ sudo SSL_ENABLED=true SSL_KEY_FILE=~/ssl/mesos.key SSL_CERT_FILE=~/ssl/mesos.crt /usr/sbin/mesos-master --zk=zk://172.19.10.111:2181,172.19.10.112:2181, 172.19.10.193:2181/mesos --port=5050 --log_dir=/var/log/mesos --acls=file:///root/acls.json --credentials=/home/isys/mesos --quorum=2 --work_dir=/var/lib/mesos I check web UI and things look good. I am not completely sure if https should have worked for mesos web UI but, it didn't. 5. Next, I start slave nodes with below command: $ sudo SSL_ENABLED=true SSL_CERT_FILE=~/mesos.crt SSL_KEY_FILE=~/mesos.key /usr/sbin/mesos-slave --master=zk://172.19.10.111:2181,172.19.10.112:2181, 172.19.10.193:2181/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=15mins Mesos web UI reported four mesos-slave nodes in Activated mode. So far so good. I am still wondering how I should verify if communication is happening over SSL. 6. To check if SSL is indeed working, I stopped one slave node and started it without SSL using `systemctl start mesos-slave`. I was expecting it to not get into Activated state on Mesos web UI but it did. So, I think SSL is not configured properly by me. I am attaching logs from the master nodes. These logs were generated after starting masters with command specified in point 4. Let
Re: Can't start master properly (stale state issue?); help!
Thanks for the summary, Paul. As mentioned, I'm not terribly familiar with what happens in the 'log-replicas' folder so I will not even try to comment as I don't want to mislead you (and future readers) on red-herring chases. I can tell you that 'zapping' the ZK data folders is essentially harmless (well, as to Mesos - not sure if you use for other stuff) so long as that happens while the Master/Agent Nodes are NOT running (or you can seriously send them in a spin) and I would certainly strongly suggest that hostname/hosts files be touched *before* Mesos starts up (if you think that was not the case, it would certainly explain the weirdness). If you do see it again, my recommendation (or, at least, what I do when I see Leader-related weirdness) is to use zkCli.sh and go looking into the znode contents as I mentioned in my previous emails. The good news is that, as of 0.24 (out probably next week) we write to ZK in JSON, so that will be easy to parse for humans too (and non-PB-aware code too). The Leader is always the one with the lowest-numbered sequential node there and you should be able to confirm that by looking at the logs too. Good luck with your app, sound like fun and exciting! *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com* On Fri, Aug 14, 2015 at 5:53 AM, Paul Bell arach...@gmail.com wrote: All, By way of some background: I'm not running a data center (or centers). Rather, I work on a distributed application whose trajectory is taking it into a realm of many Docker containers distributed across many hosts (mostly virtual hosts at the outset). An environment that supports isolation, multi-tenancy, scalability, and some fault tolerance is desirable for this application. Also, the mere ability to simplify - at least somewhat - the management of multiple hosts is of great importance. So, that's more or less how I got to Mesos and to here... I ended up writing a Java program that configures a collection of host VMs as a Mesos cluster and then, via Marathon, distributes the application containers across the cluster. Configuring building the cluster is largely a lot of SSH work. Doing the same for the application is part Marathon, part Docker remote API. The containers that need to talk to each other via TCP are connected with Weave's (http://weave.works) overlay network. So the main infrastructure consists of Mesos, Docker, and Weave. The whole thing is pretty amazing - for which I take very little credit. Rather, these are some wonderful technologies, and the folks who write support them are very helpful. That said, I sometimes feel like I'm juggling chain saws! *In re* the issues raised on this thread: All Mesos components were installed via the Mesosphere packages. The 4 VMs in the cluster are all running Ubuntu 14.04 LTS. My suspicions about the IP@ 127.0.1.1 were raised a few months ago when, after seeing this IP in a mesos-master log when things weren't working, I discovered these articles: https://groups.google.com/forum/#!topic/marathon-framework/1qboeZTOLU4 https://groups.google.com/forum/%20/l%20!topic/marathon-framework/1qboeZTOLU4 *http://frankhinek.com/build-mesos-multi-node-ha-cluster/ http://frankhinek.com/build-mesos-multi-node-ha-cluster/%20/l%20note2* (see note 2) So, to the point raised just now by Klaus (and earlier in the thread), the aforementioned configuration program does change /etc/hosts (and /etc/hostname) in the way Klaus suggested. But, as I mentioned to Marco hasodent, I might have encountered a race condition wherein ZK mesos-master saw the unchanged /etc/hosts before I altered it. I believe that I yesterday fixed that issue. Also, as part of the cluster create step, I get a bit aggressive (perhaps unwisely) with what I believe are some state repositories. Specifically, I rm /var/lib/zookeeper/version-2/* rm -Rf /var/lib/mesos/replicated_log Should I NOT be doing this? I know from experience that zapping the version-2 directory (ZK's data_Dir if IIRC) can solve occasional weirdness. Marco is /var/lib/mesos/replicated_log what you are referring to when you say some issue with the log-replica? Just a day or two ago I first heard the term znode learned a little about zkCli.sh. I will experiment with it more in the coming days. As matters now stand, I have the cluster up and running. But before I again deploy the application, I am trying to put the cluster through its paces by periodically cycling it through the states my program can bring about, e.g., --cluster create (takes a clean VM and configures it to act as one or more Mesos components: ZK, master, slave) --cluster stop(stops the Mesos services on each node) --cluster destroy (configures the VM back to its original clean state) --cluster create --cluster stop --cluster start et cetera. *The only way I got rid of the no leading master issue that started
Re: Can't start master properly (stale state issue?); help!
To be really sure about the possible root cause, I'd need to know how you installed Mesos on your server, if it's via Mesosphere packages, the configuration is described here: https://open.mesosphere.com/reference/packages/ I am almost[0] sure the behavior you are seeing has something to do how the server resolves the hostname to an IP for your Master - unless you give an explicit IP address to bind to (--ip) libprocess will look up the hostname, reverse-DNS it, and resolve to an IP address: if that fails, it falls back to localhost. If you want to try a quick hack, you can run `cat /etc/hostname` on that server, and add a line in /etc/hosts that resolves that name to the actual IP address (71.100.14.9, in your logs). The other possibility is that it's really a 'stale state' in ZK - you can either drop the znode (whichever you used for the --zk path) or launch with a different one. Finally, if you have the option to run master without using the `service start`, by SSH'ing into the server and doing something like: /path/to/install/bin/mesos-master.sh --quorum=1 --work_dir=/tmp/mesos --zk=zk://ZK-IP:ZK-PORT/mesos/test --ip=71.100.14.9 and see whether that works. If none of the above helps, please let us know what you see and we'll keep debugging it :) BTW - the new leading master is a bit of a logging decoy, it's not actually new per se - so I'm almost[0] sure the leader never changed. [0] almost as this line confuses me: I0813 10:19:46.601297 2612 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@127.0.1.1:5050, log-replica(1)@71.100.14.9:5050 } (but that's because of my lack of deep understanding of how the log-replicas work) *Marco Massenzio* *Distributed Systems Engineer* On Thu, Aug 13, 2015 at 7:37 AM, Paul Bell arach...@gmail.com wrote: Hi All, I hope someone can shed some light on this because I'm getting desperate! I try to start components zk, mesos-master, and marathon in that order. They are started via a program that SSHs to the sole host and does service xxx start. Everyone starts happily enough. But the Mesos UI shows me: *This master is not the leader, redirecting in 0 seconds ... go now* The pattern seen in all of the mesos-master.INFO logs (one of which shown below) is that the mesos-master with the correct IP@ starts. But then a new leader is detected and becomes leading master. This new leader shows UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050* I've tried clearing what ZK and mesos-master state I can find, but this problem will not go away. Would someone be so kind as to a) explain what is happening here and b) suggest remedies? Thanks very much. -Paul Log file created at: 2015/08/13 10:19:43 Running on machine: 71.100.14.9 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0813 10:19:43.225636 2542 logging.cpp:172] INFO level logging started! I0813 10:19:43.235213 2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root I0813 10:19:43.235244 2542 main.cpp:183] Version: 0.22.1 I0813 10:19:43.235257 2542 main.cpp:186] Git tag: 0.22.1 I0813 10:19:43.235268 2542 main.cpp:190] Git SHA: d6309f92a7f9af3ab61a878403e3d9c284ea87e0 I0813 10:19:43.245098 2542 leveldb.cpp:176] Opened db in 9.386828ms I0813 10:19:43.247138 2542 leveldb.cpp:183] Compacted db in 1.956669ms I0813 10:19:43.247194 2542 leveldb.cpp:198] Created db iterator in 13961ns I0813 10:19:43.247206 2542 leveldb.cpp:204] Seeked to beginning of db in 677ns I0813 10:19:43.247215 2542 leveldb.cpp:273] Iterated through 0 keys in the db in 243ns I0813 10:19:43.247252 2542 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0813 10:19:43.248755 2611 log.cpp:238] Attempting to join replica to ZooKeeper group I0813 10:19:43.248924 2542 main.cpp:306] Starting Mesos master I0813 10:19:43.249244 2612 recover.cpp:449] Starting replica recovery I0813 10:19:43.250239 2612 recover.cpp:475] Replica is in EMPTY status I0813 10:19:43.250819 2612 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0813 10:19:43.251014 2607 recover.cpp:195] Received a recover response from a replica in EMPTY status *I0813 10:19:43.249503 2542 master.cpp:349] Master 20150813-101943-151938119-5050-2542 (71.100.14.9) started on 71.100.14.9:5050 http://71.100.14.9:5050* I0813 10:19:43.252053 2610 recover.cpp:566] Updating replica status to STARTING I0813 10:19:43.252571 2542 master.cpp:397] Master allowing unauthenticated frameworks to register I0813 10:19:43.253159 2542 master.cpp:402] Master allowing unauthenticated slaves to register I0813 10:19:43.254276 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.816161ms I0813 10:19:43.254323 2612 replica.cpp:323] Persisted replica status to STARTING I0813 10:19:43.254905 2612 recover.cpp:475] Replica is in STARTING status I0813 10:19:43.255203 2612 replica.cpp:641] Replica in STARTING status
Re: Can't start master properly (stale state issue?); help!
On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell arach...@gmail.com wrote: Marco hasodent, This is just a quick note to say thank you for your replies. No problem, you're welcome. I will answer you much more fully tomorrow, but for now can only manage a few quick observations questions: 1. Having some months ago encountered a known problem with the IP@ 127.0.1.1 (I'll provide references tomorrow), I early on configured /etc/hosts, replacing myHostName 127.0.1.1 with myHostName Real_IP. That said, I can't rule out a race condition whereby ZK | mesos-master saw the original unchanged /etc/hosts before I zapped it. 2. What is a znode and how would I drop it? so, the znode is the fancy name that ZK gives to the nodes in its tree (trivially, the path) - assuming that you give Mesos the following ZK URL: zk://10.10.0.5:2181/mesos/prod the 'znode' would be `/mesos/prod` and you could go inspect it (using zkCli.sh) by doing: ls /mesos/prod you should see at least one (with the Master running) file: info_001 or json.info_0001 (depending on whether you're running 0.23 or 0.24) and you could then inspect its contents with: get /mesos/prod/info_001 For example, if I run a Mesos 0.23 on my localhost, against ZK on the same: $ ./bin/mesos-master.sh --zk=zk://localhost:2181/mesos/test --quorum=1 --work_dir=/tmp/m23-2 --port=5053 I can connect to ZK via zkCli.sh and: [zk: localhost:2181(CONNECTED) 4] ls /mesos/test [info_06, log_replicas] [zk: localhost:2181(CONNECTED) 6] get /mesos/test/info_06 #20150813-120952-18983104-5053-14072ц 'master@192.168.33.1:5053 * 192.168.33.120.23.0 cZxid = 0x314 dataLength = 93 // a bunch of other metadata numChildren = 0 (you can remove it with - you guessed it - `rm -f /mesos/test` at the CLI prompt - stop Mesos first, or it will be a very unhappy Master :). in the corresponding logs I see (note the new leader here too, even though this was the one and only): I0813 12:09:52.126509 105455616 group.cpp:656] Trying to get '/mesos/test/info_06' in ZooKeeper W0813 12:09:52.127071 107065344 detector.cpp:444] Leading master master@192.168.33.1:5053 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340) I0813 12:09:52.127094 107065344 detector.cpp:481] A new leading master (UPID=master@192.168.33.1:5053) is detected I0813 12:09:52.127187 103845888 master.cpp:1481] The newly elected leader is master@192.168.33.1:5053 with id 20150813-120952-18983104-5053-14072 I0813 12:09:52.127209 103845888 master.cpp:1494] Elected as the leading master! At this point, I'm almost sure you're running up against some issue with the log-replica; but I'm the least competent guy here to help you on that one, hopefully someone else will be able to add insight here. I start the services (zk, master, marathon; all on same host) by SSHing into the host doing service start commands. Again, thanks very much; and more tomorrow. Cordially, Paul On Thu, Aug 13, 2015 at 1:08 PM, haosdent haosd...@gmail.com wrote: Hello, how you start the master? And could you try use netstat -antp|grep 5050 to find whether there are multi master processes run at a same machine or not? On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell arach...@gmail.com wrote: Hi All, I hope someone can shed some light on this because I'm getting desperate! I try to start components zk, mesos-master, and marathon in that order. They are started via a program that SSHs to the sole host and does service xxx start. Everyone starts happily enough. But the Mesos UI shows me: *This master is not the leader, redirecting in 0 seconds ... go now* The pattern seen in all of the mesos-master.INFO logs (one of which shown below) is that the mesos-master with the correct IP@ starts. But then a new leader is detected and becomes leading master. This new leader shows UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050* I've tried clearing what ZK and mesos-master state I can find, but this problem will not go away. Would someone be so kind as to a) explain what is happening here and b) suggest remedies? Thanks very much. -Paul Log file created at: 2015/08/13 10:19:43 Running on machine: 71.100.14.9 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0813 10:19:43.225636 2542 logging.cpp:172] INFO level logging started! I0813 10:19:43.235213 2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root I0813 10:19:43.235244 2542 main.cpp:183] Version: 0.22.1 I0813 10:19:43.235257 2542 main.cpp:186] Git tag: 0.22.1 I0813 10:19:43.235268 2542 main.cpp:190] Git SHA: d6309f92a7f9af3ab61a878403e3d9c284ea87e0 I0813 10:19:43.245098 2542 leveldb.cpp:176] Opened db in 9.386828ms I0813 10:19:43.247138 2542 leveldb.cpp:183] Compacted db in 1.956669ms I0813 10:19:43.247194 2542 leveldb.cpp:198] Created db iterator in 13961ns I0813 10:19:43.247206 2542
Re: Mesos slave help
Hi Stephen, You can see all the launch flags here: http://mesos.apache.org/documentation/latest/configuration/ (or just running .../mesos-slave.sh --help) If you launch it via systemd (which is actually how we run it ourselves in DCOS) you will have to configure your nodes (master/agents) via the MESOS_* environment variables. In production, obviously, you want to use ZooKeeper as the discovery / coordination method (as you correctly did here): you can obviously use whatever you like as the znode path there, but it must be the same for all masters/agents. Make sure, if your run a test/dev configuration with multiple masters/agents on the same node to (a) configure each master on their own port (--port) and (b) to make each node point to a different work_dir (or you'll get confusing errors around log-replicas). (@haosdent: I'm *almost* sure the packaging is correct, but needs the env vars to be configured properly) *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com* On Thu, Aug 6, 2015 at 4:12 AM, Stephen Knight skni...@pivotal.io wrote: Ok, that's working if I run it like this: /usr/sbin/mesos-slave --master=zk://172.31.x.x:2181/mesos /dev/null 21 Thanks for your help, really appreciate it. On Thu, Aug 6, 2015 at 3:03 PM, haosdent haosd...@gmail.com wrote: Hm, need pass your master location, for example: /usr/sbin/mesos-slave --master=x.x.x.x:5050 if you use zookeeper, need use the format like: /usr/sbin/mesos-slave --master=zk://host1:port1,host2:port2,.../path On Thu, Aug 6, 2015 at 6:55 PM, Stephen Knight skni...@pivotal.io wrote: My system doesn't support cat with systemctl for some reason but here is the contents of /usr/lib/systemd/system/mesos-slave.service [Unit] Description=Mesos Slave After=network.target Wants=network.target [Service] ExecStart=/usr/bin/mesos-init-wrapper slave KillMode=process Restart=always RestartSec=20 LimitNOFILE=16384 CPUAccounting=true MemoryAccounting=true [Install] WantedBy=multi-user.target What are the required flags to start it manually? On Thu, Aug 6, 2015 at 2:51 PM, haosdent haosd...@gmail.com wrote: Or you could try systemctl cat mesos-slave.service and show us the file content. On Thu, Aug 6, 2015 at 6:49 PM, haosdent haosd...@gmail.com wrote: From this message, I think systemctl status mesos-slave.service -l run mesos-slave with uncorrect flags. And the status out of it is the help message of slave. Could you try to start mesos-slave in manual way? Not through systemctl. On Thu, Aug 6, 2015 at 6:41 PM, Stephen Knight skni...@pivotal.io wrote: systemctl gives me the following output on CentOS: The command to start I ran was systemctl start mesos-slave.service [root@ip-172-31-35-167 mesos]# systemctl status mesos-slave.service -l mesos-slave.service - Mesos Slave Loaded: loaded (/usr/lib/systemd/system/mesos-slave.service; enabled) Drop-In: /etc/systemd/system/mesos-slave.service.d └─mesos-slave-containerizers.conf Active: activating (auto-restart) (Result: exit-code) since Thu 2015-08-06 10:38:08 UTC; 2s ago Process: 1472 ExecStart=/usr/bin/mesos-init-wrapper slave *(code=exited, status=1/FAILURE)* Main PID: 1472 (code=exited, status=1/FAILURE) Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *If strict=false, any expected errors (e.g., slave cannot recover* Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *information about an executor, because the slave died right before* Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *the executor registered.) during recovery are ignored and as much* Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *state as possible is recovered.* Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *(default: true)* Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *--[no-]switch_user Whether to run tasks as the user who* Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *submitted them rather than the user running* Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *the slave (requires setuid permission) (default: true)* Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *--[no-]version Show version and exit. (default: false)* Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *--work_dir=VALUE Directory path to place framework work directories* I've also run strace against it, nothing sticks out: strace systemctl start mesos-slave.service execve(/bin/systemctl, [systemctl, start, mesos-slave.service], [/* 18 vars */]) = 0 brk(0) = 0x7f5c2af9f000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5c2a5c6000 access
Re: Metering for Mesos
Hi Sam, Mesos (both Master and Agents) publish a wealth of metrics that can be used for metering, diagnostic, fault discovery/prediction and, I presume, accounting and billing too (that very much depends on what pricing model you guys use). As an example, you may want to take a look at https://github.com/nqn/nibbler . Hope this helps. *Marco Massenzio* *Distributed Systems Engineer* On Thu, Aug 6, 2015 at 9:48 PM, Sam Chen usultra...@gmail.com wrote: Haosdent , Let me bring one example on the table . We are using Mesos and Marathon , and deployed two tier application (web tier is Tomcat , database layer is mysql) . We are frustrated in terms of how to charge this service, so that we are thinking whether mesos or marathon can have metering service to let us reference . Hope its clear :) Sam On Fri, Aug 7, 2015 at 12:33 PM, haosdent haosd...@gmail.com wrote: You mean metering by resource? You could get every task resource usage through send http request to state.json . On Fri, Aug 7, 2015 at 12:23 PM, Sam Chen usultra...@gmail.com wrote: Guys , We are planning to use Mesos as production platform and based on Openstack , My question is , Is there any solution for metering ? then billing . Since we want to have our platform online and have pay-as-you-go mode . Anyone have any suggsetion ? Very appreciated . Sam -- Best Regards, Haosdent Huang
Re: Get List of Active Slaves
Now that Mesos (0.24, to be released soon) publishes the Master info to ZooKeeper in JSON, it should be (relatively) easier to get the info about the leading master directly from there (or even set a Watcher on the znode to be alerted of leadership changes). Not as easy as hitting an HTTP endpoint, granted, but that's just a hard problem to solve anyway. I'm planning to provide sample code and a blog entry about this soon as I have time, but it won't be before this weekend at the earliest (and more likely the next one). *Marco Massenzio* *Distributed Systems Engineer* On Tue, Aug 4, 2015 at 5:04 PM, Steven Schlansker sschlans...@opentable.com wrote: Unfortunately that sort of solution is also prone to races. I do not think this is really possible (at least not even remotely elegantly) to solve externally to Mesos itself. On Aug 4, 2015, at 4:49 PM, James DeFelice james.defel...@gmail.com wrote: If you're using mesos-dns I think you can query slave.mesos to get an a record for each. I believe it responds to srv requests too. On Aug 4, 2015 7:29 PM, Steven Schlansker sschlans...@opentable.com wrote: Unfortunately this is racey. If you redirect to a master just as it is removed from leadership, you can still get bogus data, with no indication anything went wrong. Some people are reporting that this breaks tools that generate HTTP proxy configurations. I filed this issue a while ago as https://issues.apache.org/jira/browse/MESOS-1865 On Aug 4, 2015, at 3:49 PM, Vinod Kone vinodk...@gmail.com wrote: Not today, no. But, you could either hit the /redirect endpoint on any master that should redirect you to the leading master. On Tue, Aug 4, 2015 at 3:29 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: I see. Nope, and pointing to the leading master shows the proper resultJ Thanks. Is there a REST equivalent to mesos-resolve, so that one can ascertain who is the leader without having to point to the leader? Cheers, image001.jpg Nastooh Avessta ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: +1 604 647 1527 Cisco Systems Limited 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121 VANCOUVER BRITISH COLUMBIA V7X 1J1 CA Cisco.com image002.gifThink before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. Preferences - Unsubscribe – Privacy From: Vinod Kone [mailto:vinodk...@gmail.com] Sent: Tuesday, August 04, 2015 3:19 PM To: user@mesos.apache.org Subject: Re: Get List of Active Slaves Is that the leading master? On Tue, Aug 4, 2015 at 3:09 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi Trying to get the list of active slaves, via cli, e.g. curl http://10.4.50.80:5050/master/slaves | python -m json.tool and am not getting the expected results. The returned value is empty: { slaves: [] } , whereas, looking at web gui I can see that there are deployed slaves. Am I missing something? Cheers, image001.jpg Nastooh Avessta ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: +1 604 647 1527 Cisco Systems Limited 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121 VANCOUVER BRITISH COLUMBIA V7X 1J1 CA Cisco.com image002.gifThink before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. Preferences - Unsubscribe – Privacy
Re: How to measure the ZooKeeper Resilience on mesos cluster
Distributed systems are hard - but most importantly, they all differ in various ways. I feel the zookeeper is almost unstable for a cluster. this is too a general and vague statement to be either true or false (or provide any guidance): it all depends on how you deploy your ensemble, what hardware it runs on, what virtualization layer you use, how do you manage failovers and recovery. But, way more importantly, it all depends on *your* requirements: a configuration that works perfectly fine for a few hundred nodes, distributed across 2-3 DCs in a geographically contained region (eg, North America) would be woefully inadequate for a system running across 6 global DCs, covering several thousand of nodes, with tight latency requirements. Outside of Google (where we would use our own stuff - Borg, Chubby friends) I've never really had any trouble with ZK - then again, maybe the stuff I worked on, was nowhere near as complex as what you're trying to achieve. My suggestion would be to try it out on a staging environment, conduct some performance and stress test, and find out whether the performance, stability and availability of the ZK ensemble (and, consequently, of the Mesos cluster) meet your requirements. Hope this helps. *Marco Massenzio* *Distributed Systems Engineer* On Sun, Aug 2, 2015 at 10:15 AM, tommy xiao xia...@gmail.com wrote: today i reading ZooKeeper Resilience at Pinterest ( https://engineering.pinterest.com/blog/zookeeper-resilience-pinterest?route=/post/%3Aid/%3Asummary), I feel the zookeeper is almost unstable for a cluster. Does anyone have some experience with the zookeeper usage? -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com
Re: [RESULT] [VOTE] Release Apache Mesos 0.23.0 (rc4)
Great news, indeed! Thanks, Adam, for all the hard work in driving this release to fruition, you're a star! *Marco Massenzio* *Distributed Systems Engineer* On Wed, Jul 22, 2015 at 9:29 PM, Adam Bordelon a...@mesosphere.io wrote: Good news, everyone! The vote for Mesos 0.23.0 (rc4) has passed with the following votes. +1 (Binding) -- *** Vinod Kone *** Adam B *** Benjamin Hindman *** Timothy Chen +1 (Non-binding) -- *** Vaibhav Khanduja *** Marco Massenzio There were no 0 or -1 votes. Known issue: `sudo make check` may fail on some OSes. These tests have been fixed in 0.24.0 without any changes to the rest of the code. Please find the release at: https://dist.apache.org/repos/dist/release/mesos/0.23.0 It is recommended to use a mirror to download the release: http://www.apache.org/dyn/closer.cgi The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0 The mesos-0.23.0.jar has been released to: https://repository.apache.org The website (http://mesos.apache.org) will be updated shortly to reflect this release. Thanks, -Adam- On Wed, Jul 22, 2015 at 1:20 PM, Timothy Chen tnac...@gmail.com wrote: +1 The docker bridge network test failed because some iptable rules that was set on the environment. I will comment on the JIRA but not a blocker. Tim On Jul 22, 2015, at 1:07 PM, Benjamin Hindman benjamin.hind...@gmail.com wrote: +1 (binding) On Ubuntu 14.04: $ make check ... all tests pass ... $ sudo make check ... tests with known issues fail, but ignoring because these have all been resolved and are issues with the tests alone ... Thanks Adam. On Fri, Jul 17, 2015 at 4:42 PM Adam Bordelon a...@mesosphere.io wrote: Hello Mesos community, Please vote on releasing the following candidate as Apache Mesos 0.23.0. 0.23.0 includes the following: - Per-container network isolation - Dockerized slaves will properly recover Docker containers upon failover. - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+. as well as experimental support for: - Fetcher Caching - Revocable Resources - SSL encryption - Persistent Volumes - Dynamic Reservations The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc4 The candidate for Mesos 0.23.0 release is available at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz The tag to be voted on is 0.23.0-rc4: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc4 The MD5 checksum of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.md5 The signature of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.asc The PGP key used to sign the release is here: https://dist.apache.org/repos/dist/release/mesos/KEYS The JAR is up in Maven in a staging repository here: https://repository.apache.org/content/repositories/orgapachemesos-1062 Please vote on releasing this package as Apache Mesos 0.23.0! The vote is open until Wed July 22nd, 17:00 PDT 2015 and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Mesos 0.23.0 (I've tested it!) [ ] -1 Do not release this package because ... Thanks, -Adam-
Re: [VOTE] Release Apache Mesos 0.23.0 (rc4)
+1 Run all tests on Ubuntu 14.04 (physical box, not a VM). All tests pass (as regular user). `sudo make distcheck` still fails with the following errors; I am assuming these are known issues and not deemed to be blockers? [ FAILED ] 9 tests, listed below: [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample [ FAILED ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess [ FAILED ] MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PerfRollForward [ FAILED ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery [ FAILED ] NsTest.ROOT_setns [ FAILED ] PerfTest.ROOT_Events [ FAILED ] PerfTest.ROOT_SamplePid I tried to check out 0.22.1 and run the same tests, but it has several failures and it complains about already existing cgroups hierarchies; so I'm assuming the earlier test run left the system in an unclean state. *Marco Massenzio* *Distributed Systems Engineer* On Tue, Jul 21, 2015 at 3:12 PM, Adam Bordelon a...@mesosphere.io wrote: +1 (binding) to Mesos 0.23.0-rc4 as 0.23.0 As I mentioned before, for rc3, basic integration tests passed for Mesos 0 .23.0 on CoreOS with DCOS GUI/CLI, Marathon, Chronos, Spark, HDFS, Cassandra, and Kafka. We have been tracking the Ubuntu `sudo make check` failures in https://issues.apache.org/jira/browse/MESOS-3079 and related CentOS ROOT_ test failures in https://issues.apache.org/jira/browse/MESOS-3050 (some fixes already pulled into rc4). After pulling down the latest master, including a series of test code only fixes for MESOS-3079, `sudo make check` passed for me on Ubuntu 14.04, excluding only ROOT_DOCKER_Launch_Executor_Bridged (segfault tracked in MESOS-3123). There are at least two remaining test-only fixes tracked in MESOS-3079, but none of these are critical for Mesos 0.23.0, so I'm not inclined to call for a rc5. We can call out the ROOT_ test failures as a known issue with the release. Anybody else have any test results? Please vote, -Adam- On Fri, Jul 17, 2015 at 8:18 PM, Marco Massenzio ma...@mesosphere.io wrote: I am almost sure (more like hoping) I'm missing something fundamental here and/or there is some basic configuration missing on my box. Running tests as root, causes a significant number of failures. Has anyone else *ever* run tests as root in the last few weeks? Here's the headline, the full log of the failed tests attached. $ lsb_release -a LSB Version: core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:cxx-3.0-amd64:cxx-3.0-noarch:cxx-3.1-amd64:cxx-3.1-noarch:cxx-3.2-amd64:cxx-3.2-noarch:cxx-4.0-amd64:cxx-4.0-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-3.1-amd64:desktop-3.1-noarch:desktop-3.2-amd64:desktop-3.2-noarch:desktop-4.0-amd64:desktop-4.0-noarch:desktop-4.1-amd64:desktop-4.1-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.0-amd64:graphics-3.0-noarch:graphics-3.1-amd64:graphics-3.1-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch:graphics-4.1-amd64:graphics-4.1-noarch:languages-3.2-amd64:languages-3.2-noarch:languages-4.0-amd64:languages-4.0-noarch:languages-4.1-amd64:languages-4.1-noarch:multimedia-3.2-amd64:multimedia-3.2-noarch:multimedia-4.0-amd64:multimedia-4.0-noarch:multimedia-4.1-amd64:multimedia-4.1-noarch:printing-3.2-amd64:printing-3.2-noarch:printing-4.0-amd64:printing-4.0-noarch:printing-4.1-amd64:printing-4.1-noarch:qt4-3.1-amd64:qt4-3.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch Distributor ID: Ubuntu Description:*Ubuntu 14.04.2 LTS* Release:14.04 Codename: *trusty* $ sudo make -j12 V=0 check [==] 712 tests from 116 test cases ran. (318672 ms total) [ PASSED ] 676 tests. [ FAILED ] 36 tests, listed below: [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample [ FAILED ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess [ FAILED ] SlaveRecoveryTest/0.RecoverSlaveState, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.RecoverStatusUpdateManager, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.RecoverUnregisteredExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.RecoverTerminatedExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.RecoverCompletedExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer
Re: Cluster of Workstations type design for a Mesos cluster
You're not crazy :) This will work just fine, the Master takes up very little CPU/RAM, and, as you plan to have it on your desktop you could even wrap with some send-notify script so that should it fail or something, you could get an alert. I'm not sure why you want to segment out the Agent Nodes and isolate them from (outbound) web connectivity, the one thing to bear in mind is that you won't be able to install packages directly (apt-get) and anything you want to run off them you will need to instead fuffle around with binary installers and the like - then again, you may just install the smallest footprint OS (CoreOS springs to mind) and maximize resources for tasks. Keep us posted on how you progress, I may eventually go down the same path :) *Marco Massenzio* *Distributed Systems Engineer* On Tue, Jul 21, 2015 at 6:44 AM, Gaston, Dan dan.gas...@nshealth.ca wrote: Is there likely to be any issues with the Master? Given it would be an active desktop it would be running all of the typical mesos master stuff, plus say an active Ubuntu desktop environment. It would also need to host things like a local Docker registry and the like as well, since the compute nodes wouldn’t have direct access to the wider internet. *From:* jeffschr...@gmail.com [mailto:jeffschr...@gmail.com] *On Behalf Of *Jeff Schroeder *Sent:* Tuesday, July 21, 2015 10:42 AM *To:* user@mesos.apache.org *Subject:* Re: Cluster of Workstations type design for a Mesos cluster As far as mesos is concerned, compute is a commodity. This should work just fine. Put Aurora or Marathon ontop of mesos if you need a general purpose scheduler and you're good to go. The nice thing is that you can add additional slaves as you need. I believe heterogeneous clusters are best if possible, but absolutely not a requirement of any sort. On Tuesday, July 21, 2015, Gaston, Dan dan.gas...@nshealth.ca wrote: Let’s say I had 2 high-performance workstations kicking around (dual 6-core, 2.4GHz, xeon processors; 128 GB RAM each; etc) and a smaller workstation (single Xeon 4-core, 3.5GHz and 16 GB RAM) available and I wanted to cluster them together with Mesos. What is the best way of doing this? My thought was that the smaller workstation would be at my desk (the other two would be in the same office) because it would be used for development work and some general tasks but would also be the master node of the mesos cluster (note that HA isn’t a requirement here). This workstation would have two NICs, one connected to our institutional network and the other making up the private network between the clusters. Is this even doable? Normally you would have some sort of client submitting to the Master but in this case the Master node would be serving up multiple roles. The other workstations would probably not have access to the institutional network, so all software updates and the like would have to be piped through the master workstation. There would also be a relatively large NAS device connected into this network as well. Thoughts and suggestions welcome, even if it is to tell me I’m crazy. I’m building a small scale compute “cluster” that is fairly limited by budget (and the needs aren’t high either) and it may not be able to be located in a datacenter, hence the cluster of workstations type setup. [image: NSHA_colour_logo.jpg] Dan Gaston, PhD Clinical Laboratory Bioinformatician Department of Pathology and Laboratory Medicine Division of Hematopathology Rm 511, 5788 University Ave. Halifax, NS B3H 1V8 -- Text by Jeff, typos by iPhone
Re: [VOTE] Release Apache Mesos 0.23.0 (rc4)
Ubuntu 14.04 Not sure if I'm doing something wrong, `sudo make distcheck` fails - re-running after a `make clean` If it continues failing, I'll provide more detailed log output. In the meantime, if anyone has any suggestions as to what I may be doing wrong, please let me know. $ ../configure make -j8 V=0 make -j12 V=0 check [==] 649 tests from 94 test cases ran. (254152 ms total) [ PASSED ] 649 tests. $ sudo make -j12 V=0 distcheck [==] 712 tests from 116 test cases ran. (325751 ms total) [ PASSED ] 702 tests. [ FAILED ] 10 tests, listed below: [ FAILED ] LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample [ FAILED ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess [ FAILED ] MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PerfRollForward [ FAILED ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery [ FAILED ] NsTest.ROOT_setns [ FAILED ] PerfTest.ROOT_Events [ FAILED ] PerfTest.ROOT_SamplePid 10 FAILED TESTS YOU HAVE 12 DISABLED TESTS *Marco Massenzio* *Distributed Systems Engineer* On Fri, Jul 17, 2015 at 6:49 PM, Vinod Kone vinodk...@gmail.com wrote: +1 (binding) Successfully built RPMs for CentOS5 and CentOS6 with network isolator. On Fri, Jul 17, 2015 at 4:56 PM, Khanduja, Vaibhav vaibhav.khand...@emc.com wrote: +1 Sent from my iPhone. Please excuse the typos and brevity of this message. On Jul 17, 2015, at 4:43 PM, Adam Bordelon a...@mesosphere.io wrote: Hello Mesos community, Please vote on releasing the following candidate as Apache Mesos 0.23.0. 0.23.0 includes the following: - Per-container network isolation - Dockerized slaves will properly recover Docker containers upon failover. - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+. as well as experimental support for: - Fetcher Caching - Revocable Resources - SSL encryption - Persistent Volumes - Dynamic Reservations The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc4 The candidate for Mesos 0.23.0 release is available at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz The tag to be voted on is 0.23.0-rc4: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc4 The MD5 checksum of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.md5 The signature of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.asc The PGP key used to sign the release is here: https://dist.apache.org/repos/dist/release/mesos/KEYS The JAR is up in Maven in a staging repository here: https://repository.apache.org/content/repositories/orgapachemesos-1062 Please vote on releasing this package as Apache Mesos 0.23.0! The vote is open until Wed July 22nd, 17:00 PDT 2015 and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Mesos 0.23.0 (I've tested it!) [ ] -1 Do not release this package because ... Thanks, -Adam-
Re: [VOTE] Release Apache Mesos 0.23.0 (rc4)
] MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward [ FAILED ] MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward [ FAILED ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery [ FAILED ] NsTest.ROOT_setns [ FAILED ] PerfTest.ROOT_Events [ FAILED ] PerfTest.ROOT_SamplePid 36 FAILED TESTS YOU HAVE 12 DISABLED TESTS *Marco Massenzio* *Distributed Systems Engineer* On Fri, Jul 17, 2015 at 7:26 PM, Marco Massenzio ma...@mesosphere.io wrote: Ubuntu 14.04 Not sure if I'm doing something wrong, `sudo make distcheck` fails - re-running after a `make clean` If it continues failing, I'll provide more detailed log output. In the meantime, if anyone has any suggestions as to what I may be doing wrong, please let me know. $ ../configure make -j8 V=0 make -j12 V=0 check [==] 649 tests from 94 test cases ran. (254152 ms total) [ PASSED ] 649 tests. $ sudo make -j12 V=0 distcheck [==] 712 tests from 116 test cases ran. (325751 ms total) [ PASSED ] 702 tests. [ FAILED ] 10 tests, listed below: [ FAILED ] LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample [ FAILED ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess [ FAILED ] MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PerfRollForward [ FAILED ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery [ FAILED ] NsTest.ROOT_setns [ FAILED ] PerfTest.ROOT_Events [ FAILED ] PerfTest.ROOT_SamplePid 10 FAILED TESTS YOU HAVE 12 DISABLED TESTS *Marco Massenzio* *Distributed Systems Engineer* On Fri, Jul 17, 2015 at 6:49 PM, Vinod Kone vinodk...@gmail.com wrote: +1 (binding) Successfully built RPMs for CentOS5 and CentOS6 with network isolator. On Fri, Jul 17, 2015 at 4:56 PM, Khanduja, Vaibhav vaibhav.khand...@emc.com wrote: +1 Sent from my iPhone. Please excuse the typos and brevity of this message. On Jul 17, 2015, at 4:43 PM, Adam Bordelon a...@mesosphere.io wrote: Hello Mesos community, Please vote on releasing the following candidate as Apache Mesos 0.23.0. 0.23.0 includes the following: - Per-container network isolation - Dockerized slaves will properly recover Docker containers upon failover. - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+. as well as experimental support for: - Fetcher Caching - Revocable Resources - SSL encryption - Persistent Volumes - Dynamic Reservations The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc4 The candidate for Mesos 0.23.0 release is available at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz The tag to be voted on is 0.23.0-rc4: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc4 The MD5 checksum of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.md5 The signature of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.asc The PGP key used to sign the release is here: https://dist.apache.org/repos/dist/release/mesos/KEYS The JAR is up in Maven in a staging repository here: https://repository.apache.org/content/repositories/orgapachemesos-1062 Please vote on releasing this package as Apache Mesos 0.23.0! The vote is open until Wed July 22nd, 17:00 PDT 2015 and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Mesos 0.23.0 (I've tested it!) [ ] -1 Do not release this package because ... Thanks, -Adam- $ sudo make -j12 V=0 check [==] 712 tests from 116 test cases ran. (318672 ms total) [ PASSED ] 676 tests. [ FAILED ] 36 tests, listed below: [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample [ FAILED ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess [ FAILED ] SlaveRecoveryTest/0.RecoverSlaveState, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.RecoverStatusUpdateManager, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0
Re: [VOTE] Release Apache Mesos 0.23.0 (rc3)
Just to add my +1 Built Make check on Ubuntu 14.04 With Without SSL / libevent (no 'sudo' - can test all 4 variants this evening on rc4) — Sent from Mailbox On Thu, Jul 16, 2015 at 3:10 PM, Timothy Chen tnac...@gmail.com wrote: As Adam mention I also think this is not a blocker, as it only affects the way we test the cgroup on CentOS 7.x due to a CentOS bug and doesn't actually impact Mesos normal operations. My vote is +1 as well. Tim On Thu, Jul 16, 2015 at 12:10 PM, Vinod Kone vinodk...@gmail.com wrote: Found a bug in HTTP API related code: MESOS-3055 https://issues.apache.org/jira/browse/MESOS-3055 If we don't fix this in 0.23.0, we cannot expect the 0.24.0 scheduler driver (that will send Calls) to properly subscribe with a 0.23.0 master. I could add a work around in the driver to only send Calls if the master version is 0.24.0, but would prefer to not have to do that. Also, on the review https://reviews.apache.org/r/36518/ for that bug, we realized that we might want to make Subscribe.force 'optional' instead of 'required'. That's an API change, which would be nice to go into 0.23.0 as well. So, not a -1 per se, but if you are willing to cut another RC, I can land the fixes today. Sorry for the trouble. On Thu, Jul 16, 2015 at 11:48 AM, Adam Bordelon a...@mesosphere.io wrote: +1 (binding) This vote has been silent for almost a week. I assume everybody's busy testing. My testing results: basic integration tests passed for Mesos 0.23.0 on CoreOS with DCOS GUI/CLI, Marathon, Chronos, Spark, HDFS, Cassandra, and Kafka. `make check` passes on Ubuntu and CentOS, but `sudo make check` fails on CentOS 7.1 due to errors in CentOS. See https://issues.apache.org/jira/browse/MESOS-3050 for more details. I'm not convinced this is serious enough to do another release candidate and voting round, but I'll let Tim and others chime in with their thoughts. If we don't get enough deciding votes by 6pm Pacific today, I'll extend the vote for another day. On Thu, Jul 9, 2015 at 6:09 PM, Khanduja, Vaibhav vaibhav.khand...@emc.com wrote: +1 Sent from my iPhone. Please excuse the typos and brevity of this message. On Jul 9, 2015, at 6:07 PM, Adam Bordelon a...@mesosphere.io wrote: Hello Mesos community, Please vote on releasing the following candidate as Apache Mesos 0.23.0. 0.23.0 includes the following: - Per-container network isolation - Dockerized slaves will properly recover Docker containers upon failover. - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+. as well as experimental support for: - Fetcher Caching - Revocable Resources - SSL encryption - Persistent Volumes - Dynamic Reservations The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc3 The candidate for Mesos 0.23.0 release is available at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc3/mesos-0.23.0.tar.gz The tag to be voted on is 0.23.0-rc3: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc3 The MD5 checksum of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc3/mesos-0.23.0.tar.gz.md5 The signature of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc3/mesos-0.23.0.tar.gz.asc The PGP key used to sign the release is here: https://dist.apache.org/repos/dist/release/mesos/KEYS The JAR is up in Maven in a staging repository here: https://repository.apache.org/content/repositories/orgapachemesos-1060 Please vote on releasing this package as Apache Mesos 0.23.0! The vote is open until Thurs July 16th, 18:00 PDT 2015 and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Mesos 0.23.0 [ ] -1 Do not release this package because ... Thanks, -Adam-
Re: [VOTE] Release Apache Mesos 0.23.0 (rc3)
Adam - thanks. Please let me know soon as you push an rc4, if I'm still home, I can test it against Ubuntu 14.04 with/without SSL, with/without sudo (or I can always VPN in :) Very minor doc update: https://reviews.apache.org/r/36532/ (feel free to ignore). Thanks, everyone! *Marco Massenzio* *Distributed Systems Engineer* On Thu, Jul 16, 2015 at 8:05 PM, Adam Bordelon a...@mesosphere.io wrote: Thanks, Vinod. I've got those commits in the list already. We'll pull in fixes for MESOS-3055 and others for rc4. I'll give it another night for Bernd to commit the fetcher fix and for Niklas to update the oversubscription doc. Then I'll cut rc4 tomorrow and leave the new vote open until next Wednesday. See the dashboard for status on remaining issues: https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12326227 Jeff, see my cherry-pick spreadsheet to see what we're planning to pull into rc4: https://docs.google.com/spreadsheets/d/14yUtwfU0mGQ7x7UcjfzZg2o1TuRMkn5SvJvetARM7JQ/edit#gid=0 If anybody has any other high priority fixes or doc updates that they want in rc4, let me know asap. On Thu, Jul 16, 2015 at 7:58 PM, Jeff Schroeder jeffschroe...@computer.org wrote: What about MESOS-3055 in 0.23? Is that going to get passed up on even if we are going to cut another rc? On Thursday, July 16, 2015, Vinod Kone vinodk...@gmail.com wrote: -1 so that we can cherry pick MESOS-3055. The master crash bug is MESOS-3070 https://issues.apache.org/jira/browse/MESOS-3070 but the fix is non-trivial and the bug has been in the code base prior to 23.0. So I won't make it a blocker. Can't update the spreadsheet. So here are the commits I would like cherry-picked. fc85cc512b7767fc2e3921b15cf6602c0c68593e bfe6c07b79550bb3d1f2ab6f5344d740e6eb6f60 Thanks Adam. On Thu, Jul 16, 2015 at 7:39 PM, Adam Bordelon a...@mesosphere.io wrote: The 7 day voting period has ended with only 2 binding +1s (we needed 3) and no explicit -1s. However, Vinod says they've found a bug that crashes master when a framework uses duplicate task ids. Vinod, can you please share the new JIRA and officially vote -1 for rc3 if you want to call for an rc4? Assuming we'll cut an rc4, I'm tracking the JIRAs/patches to pull in here: https://docs.google.com/spreadsheets/d/14yUtwfU0mGQ7x7UcjfzZg2o1TuRMkn5SvJvetARM7JQ/edit#gid=0 Since the rc4 changes are minor (mostly tests) and we've heavily tested rc3, the next vote will only last for 3 (business) days. On Thu, Jul 16, 2015 at 6:38 PM, Marco Massenzio ma...@mesosphere.io wrote: Just to add my +1 Built Make check on Ubuntu 14.04 With Without SSL / libevent (no 'sudo' - can test all 4 variants this evening on rc4) — Sent from Mailbox https://www.dropbox.com/mailbox On Thu, Jul 16, 2015 at 3:10 PM, Timothy Chen tnac...@gmail.com wrote: As Adam mention I also think this is not a blocker, as it only affects the way we test the cgroup on CentOS 7.x due to a CentOS bug and doesn't actually impact Mesos normal operations. My vote is +1 as well. Tim On Thu, Jul 16, 2015 at 12:10 PM, Vinod Kone vinodk...@gmail.com wrote: Found a bug in HTTP API related code: MESOS-3055 https://issues.apache.org/jira/browse/MESOS-3055 If we don't fix this in 0.23.0, we cannot expect the 0.24.0 scheduler driver (that will send Calls) to properly subscribe with a 0.23.0 master. I could add a work around in the driver to only send Calls if the master version is 0.24.0, but would prefer to not have to do that. Also, on the review https://reviews.apache.org/r/36518/ for that bug, we realized that we might want to make Subscribe.force 'optional' instead of 'required'. That's an API change, which would be nice to go into 0.23.0 as well. So, not a -1 per se, but if you are willing to cut another RC, I can land the fixes today. Sorry for the trouble. On Thu, Jul 16, 2015 at 11:48 AM, Adam Bordelon a...@mesosphere.io wrote: +1 (binding) This vote has been silent for almost a week. I assume everybody's busy testing. My testing results: basic integration tests passed for Mesos 0.23.0 on CoreOS with DCOS GUI/CLI, Marathon, Chronos, Spark, HDFS, Cassandra, and Kafka. `make check` passes on Ubuntu and CentOS, but `sudo make check` fails on CentOS 7.1 due to errors in CentOS. See https://issues.apache.org/jira/browse/MESOS-3050 for more details. I'm not convinced this is serious enough to do another release candidate and voting round, but I'll let Tim and others chime in with their thoughts. If we don't get enough deciding votes by 6pm Pacific today, I'll extend the vote for another day. On Thu, Jul 9, 2015 at 6:09 PM, Khanduja, Vaibhav vaibhav.khand...@emc.com wrote: +1 Sent from my iPhone. Please excuse the typos and brevity of this message. On Jul 9, 2015, at 6:07 PM
Re: [VOTE] Release Apache Mesos 0.23.0 (rc2)
This seems to be somewhat related to PB 2.4 v 2.5 (what Mesos uses) - and possibly, indirectly, to Py 2.6 v 2.7 (wild guess here). The problem with Python is that it's always difficult to figure out where it goes looking for imports (unless you have a virtualenv and/or munge sys.path) so it may well be that it finds `mesos.interface` from the main system site-packages folder (where you may have an old version of the protobuf libraries) instead of the correct (for 2.5.0) place (under our build/3rdparty/... foders). As in the other instance, a log dump of sys.path just before the import *may* shed some light (or add to the confusion). IMO we should require Python == 2.7 (no idea if we can support Python 3, my guess is we can't, because of this https://github.com/google/protobuf/issues/9), but that's probably another story. *Marco Massenzio* *Distributed Systems Engineer* On Thu, Jul 9, 2015 at 3:21 PM, Ian Downes idow...@twitter.com wrote: The ExamplesTest.PythonFramework test fails differently for me on CentOS5 with python 2.6.6. I presume we don't require 2.7? [idownes@hostname build]$ MESOS_VERBOSE=1 ./bin/mesos-tests.sh --gtest_filter=ExamplesTest.PythonFramework Source directory: /home/idownes/workspace/mesos Build directory: /home/idownes/workspace/mesos/build - We cannot run any cgroups tests that require mounting hierarchies because you have the following hierarchies mounted: /sys/fs/cgroup/cpu, /sys/fs/cgroup/cpuacct, /sys/fs/cgroup/freezer, /sys/fs/cgroup/memory, /sys/fs/cgroup/perf_event We'll disable the CgroupsNoHierarchyTest test fixture for now. - - We cannot run any Docker tests because: Failed to get docker version: Failed to execute 'docker --version': exited with status 127 - /usr/bin/nc Note: Google Test filter = trimmed [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from ExamplesTest [ RUN ] ExamplesTest.PythonFramework Using temporary directory '/tmp/ExamplesTest_PythonFramework_igPnUB' Traceback (most recent call last): File /home/idownes/workspace/mesos/build/../src/examples/python/test_framework.py, line 24, in module from mesos.interface import mesos_pb2 File build/bdist.linux-x86_64/egg/mesos/interface/mesos_pb2.py, line 4, in module ImportError: cannot import name enum_type_wrapper ../../src/tests/script.cpp:83: Failure Failed python_framework_test.sh exited with status 1 [ FAILED ] ExamplesTest.PythonFramework (136 ms) [--] 1 test from ExamplesTest (136 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (169 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] ExamplesTest.PythonFramework 1 FAILED TEST YOU HAVE 10 DISABLED TESTS [idownes@hostname build]$ python --version Python 2.6.6 On Thu, Jul 9, 2015 at 2:53 PM, Vinod Kone vinodk...@gmail.com wrote: I'm assuming the 50 min Jeff mentioned was when doing a 'make check' on a fresh copy of mesos source code. The majority of that time should be due to compilation of source and test code (both of which will be sped up by -j); a sequential run of the test suite should be within 10 min IIRC. On Thu, Jul 9, 2015 at 2:40 PM, Marco Massenzio ma...@mesosphere.io wrote: @Vinod: unfortunately, the tests must be run sequentially, so (at least, as far as I can tell) there's virtually no speedup in 'make check' by using the -j switch. As someone else pointed out, it would be grand if we could have a 'test compilation' step (which can be run in parallel and speeds up) distinct from a 'run tests' step (which must run sequentially). *Marco Massenzio* *Distributed Systems Engineer* On Thu, Jul 9, 2015 at 2:28 PM, Vinod Kone vinodk...@gmail.com wrote: As a tangent, you can speed up the build by doing make -j#threads check. On Thu, Jul 9, 2015 at 1:35 PM, Jeff Schroeder jeffschroe...@computer.org wrote: I'm unable to replicate the same failure on another up to date RHEL 7.1 machine for some strange reason. Even blowing away the checkout, doing a fresh clone, and waiting ~50 minutes for make check to finish, it still pops. However on my laptop, this test passes fine. Let's chock this one up to works on my *other* machine. = jschroeder@omniscience:~/git/mesos (master)$ bin/mesos-tests.sh --gtest_filter=ExamplesTest.PythonFramework --verbose Source directory: /home/jschroeder/git/mesos Build directory: /home/jschroeder/git/mesos - We cannot run any cgroups tests that require mounting hierarchies because you have the following hierarchies mounted: /sys/fs/cgroup
Re: [VOTE] Release Apache Mesos 0.23.0 (rc2)
@Vinod: unfortunately, the tests must be run sequentially, so (at least, as far as I can tell) there's virtually no speedup in 'make check' by using the -j switch. As someone else pointed out, it would be grand if we could have a 'test compilation' step (which can be run in parallel and speeds up) distinct from a 'run tests' step (which must run sequentially). *Marco Massenzio* *Distributed Systems Engineer* On Thu, Jul 9, 2015 at 2:28 PM, Vinod Kone vinodk...@gmail.com wrote: As a tangent, you can speed up the build by doing make -j#threads check. On Thu, Jul 9, 2015 at 1:35 PM, Jeff Schroeder jeffschroe...@computer.org wrote: I'm unable to replicate the same failure on another up to date RHEL 7.1 machine for some strange reason. Even blowing away the checkout, doing a fresh clone, and waiting ~50 minutes for make check to finish, it still pops. However on my laptop, this test passes fine. Let's chock this one up to works on my *other* machine. = jschroeder@omniscience:~/git/mesos (master)$ bin/mesos-tests.sh --gtest_filter=ExamplesTest.PythonFramework --verbose Source directory: /home/jschroeder/git/mesos Build directory: /home/jschroeder/git/mesos - We cannot run any cgroups tests that require mounting hierarchies because you have the following hierarchies mounted: /sys/fs/cgroup/blkio, /sys/fs/cgroup/cpu,cpuacct, /sys/fs/cgroup/cpuset, /sys/fs/cgroup/devices, /sys/fs/cgroup/freezer, /sys/fs/cgroup/hugetlb, /sys/fs/cgroup/memory, /sys/fs/cgroup/net_cls, /sys/fs/cgroup/perf_event, /sys/fs/cgroup/systemd We'll disable the CgroupsNoHierarchyTest test fixture for now. - /usr/bin/nc Note: Google Test filter = ExamplesTest.PythonFramework-DockerContainerizerTest.ROOT_DOCKER_Launch_Executor:DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged:DockerContainerizerTest.ROOT_DOCKER_Launch:DockerContainerizerTest.ROOT_DOCKER_Kill:DockerContainerizerTest.ROOT_DOCKER_Usage:DockerContainerizerTest.ROOT_DOCKER_Update:DockerContainerizerTest.ROOT_DOCKER_Recover:DockerContainerizerTest.ROOT_DOCKER_SkipRecoverNonDocker:DockerContainerizerTest.ROOT_DOCKER_Logs:DockerContainerizerTest.ROOT_DOCKER_Default_CMD:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Override:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Args:DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer:DockerContainerizerTest.DISABLED_ROOT_DOCKER_SlaveRecoveryExecutorContainer:DockerContainerizerTest.ROOT_DOCKER_NC_PortMapping:DockerContainerizerTest.ROOT_DOCKER_LaunchSandboxWithColon:DockerContainerizerTest.ROOT_DOCKER_DestroyWhileFetching:DockerContainerizerTest.ROOT_DOCKER_DestroyWhilePulling:DockerContainerizerTest.ROOT_DOCKER_ExecutorCleanupWhenLaunchFailed:DockerContainerizerTest.ROOT_DOCKER_FetchFailure:DockerContainerizerTest.ROOT_DOCKER_DockerPullFailure:DockerContainerizerTest.ROOT_DOCKER_DockerInspectDiscard:DockerTest.ROOT_DOCKER_interface:DockerTest.ROOT_DOCKER_CheckCommandWithShell:DockerTest.ROOT_DOCKER_CheckPortResource:DockerTest.ROOT_DOCKER_CancelPull:DockerTest.ROOT_DOCKER_MountRelative:DockerTest.ROOT_DOCKER_MountAbsolute:CpuIsolatorTest/1.UserCpuUsage:CpuIsolatorTest/1.SystemCpuUsage:RevocableCpuIsolatorTest.ROOT_CGROUPS_RevocableCpu:LimitedCpuIsolatorTest.ROOT_CGROUPS_Cfs:LimitedCpuIsolatorTest.ROOT_CGROUPS_Cfs_Big_Quota:LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids:MemIsolatorTest/0.MemUsage:MemIsolatorTest/1.MemUsage:MemIsolatorTest/2.MemUsage:PerfEventIsolatorTest.ROOT_CGROUPS_Sample:SharedFilesystemIsolatorTest.ROOT_RelativeVolume:SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume:NamespacesPidIsolatorTest.ROOT_PidNamespace:UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup:UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup:UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup:MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PerfRollForward:MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward:MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward:SlaveTest.ROOT_RunTaskWithCommandInfoWithoutUser:SlaveTest.DISABLED_ROOT_RunTaskWithCommandInfoWithUser:ContainerizerTest.ROOT_CGROUPS_BalloonFramework:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Enabled:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Mounted:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get:CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Tasks:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Read:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Write:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Cfs_Big_Quota:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Busy:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_SubsystemsHierarchy:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FindCgroupSubsystems:CgroupsAnyHierarchyWithCpuMemoryTes
Re: Multi-mastersD
(I'm sure I'm missing something here, so please forgive if I'm stating the obvious) This is actually very well supported right now: you can use slave attributes (if, eg, you want to name the various clusters differently and launch tasks according to those criteria) that would be passed on to the Frameworks along with the resource offers: the frameworks could then decide whether to accept the offer and launch tasks based on whatever logic you want to implement. You could use something like --attributes=cluster:01z99; os:ubuntu-14-04; jdk:8 or whatever makes sense. *Marco Massenzio* *Distributed Systems Engineer* On Tue, Jul 7, 2015 at 8:55 AM, CCAAT cc...@tampabay.rr.com wrote: Hello team_mesos, Is there any reason one set of (3) masters cannot talk to and manage several (many) different slave clusters of (3)? These slave clusters would be different arch, different mixes of resources and be running different frameworks, but all share/use the same (3) masters. Ideas on how to architect this experiment, would be keenly appreciated. James
Re: [RESULT] [VOTE] Release Apache Mesos 0.23.0 (rc1)
As a general rule, we should not include anything other than the fixes in an RC, to avoid introducing further bugs in a never-ending cycle. Please keep the cherry-picking strictly limited to a very narrow set (which I'm sure you're already doing, but your email seemed to imply otherwise ;-) Thanks! — Sent from Mailbox On Tue, Jul 7, 2015 at 3:56 PM, Adam Bordelon a...@mesosphere.io wrote: In case it wasn't obvious, rc1 did not pass the vote, due to a few build and unit test issues. Most of those fixes have been committed, so we will cut rc2 when the last blocker is resolved. This is your last chance to get any recently committed patches or resolved issues into 0.23.0. I am tracking the 0.23.0-rc2 cherry picks in https://docs.google.com/spreadsheets/d/14yUtwfU0mGQ7x7UcjfzZg2o1TuRMkn5SvJvetARM7JQ/edit#gid=0 Please contact me ASAP if you want anything else included. Thanks, -Adam- P.S. 0.23 Dashboard is still in action: https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12326227 On Tue, Jul 7, 2015 at 1:59 PM, Adam Bordelon a...@mesosphere.io wrote: -1 (non-binding) Network isolator will not compile. https://issues.apache.org/jira/browse/MESOS-3002 The changes for MESOS-2800 https://issues.apache.org/jira/browse/MESOS-2800 to Rename OptionT::get(const T _t) to getOrElse() happened after the 0.23.0-rc1 cut and are not planned for cherry-picking into the release. The Fix Version of MESOS-2800 https://issues.apache.org/jira/browse/MESOS-2800 is 0.24.0, so the Affects Version of MESOS-3002 https://issues.apache.org/jira/browse/MESOS-3002 is really 0.24.0, and hence its Target Version should also be 0.24.0. Please let me know otherwise if you actually saw this build error when building from the 0.23.0-rc1 tag. On Tue, Jul 7, 2015 at 11:48 AM, Paul Brett pbr...@twitter.com wrote: -1 (non-binding) Network isolator will not compile. https://issues.apache.org/jira/browse/MESOS-3002 On Tue, Jul 7, 2015 at 11:38 AM, Alexander Rojas alexan...@mesosphere.io wrote: +1 (non-binding) Ubuntu Server 15.04 gcc 4.9.2 and clang 3.6.0 OS X Yosemite clang Apple LLVM based on 3.6.0 On 06 Jul 2015, at 21:14, Jörg Schad jo...@mesosphere.io wrote: After more testing: -1 (non-binding) Docker tests failing on CentOS Linux release 7.1.1503 (Core) , Tim is already on the issue (see MESOS-2996) On Mon, Jul 6, 2015 at 8:59 PM, Kapil Arya ka...@mesosphere.io wrote: +1 (non-binding) OpenSUSE Tumbleweed, Linux 4.0.3 / gcc 4.8.3 On Mon, Jul 6, 2015 at 2:33 PM, Ben Whitehead ben.whiteh...@mesosphere.io wrote: +1 (non-binding) openSUSE 13.2 Linux 3.16.7 / gcc-4.8.3 Tested running Marathon 0.9.0-RC3 and Cassandra on Mesos 0.1.1-SNAPSHOT. On Mon, Jul 6, 2015 at 6:57 AM, Till Toenshoff toensh...@me.com wrote: Even though Alex has IMHO already “busted” this vote ;) .. THANKS ALEX! … , here are my results. +1 OS 10.10.4 (14E46) + Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn), make check - OK Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-32-generic x86_64) + gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, make check - OK On Jul 6, 2015, at 3:22 PM, Alex Rukletsov a...@mesosphere.com wrote: -1 Compilation error on Mac OS 10.10.4 with clang 3.5, which is supported according to release notes. More details: https://issues.apache.org/jira/browse/MESOS-2991 On Mon, Jul 6, 2015 at 11:55 AM, Jörg Schad jo...@mesosphere.io wrote: P.S. to my prior +1 Tested on ubuntu-trusty-14.04 including docker. On Sun, Jul 5, 2015 at 6:44 PM, Jörg Schad jo...@mesosphere.io wrote: +1 On Sun, Jul 5, 2015 at 4:36 PM, Nikolaos Ballas neXus nikolaos.bal...@nexusgroup.com wrote: +1 Sent from my Samsung device Original message From: tommy xiao xia...@gmail.com Date: 05/07/2015 15:14 (GMT+01:00) To: user@mesos.apache.org Subject: Re: [VOTE] Release Apache Mesos 0.23.0 (rc1) +1 2015-07-04 12:32 GMT+08:00 Weitao zhouwtl...@gmail.com: +1 发自我的 iPhone 在 2015年7月4日,09:41,Marco Massenzio ma...@mesosphere.io 写道: +1 *Marco Massenzio* *Distributed Systems Engineer* On Fri, Jul 3, 2015 at 12:25 PM, Adam Bordelon a...@mesosphere.io wrote: Hello Mesos community, Please vote on releasing the following candidate as Apache Mesos 0.23.0. 0.23.0 includes the following: - Per-container network isolation - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+. - Dockerized slaves will properly recover Docker containers upon failover. as well as experimental support for: - Fetcher Caching - Revocable Resources - SSL encryption - Persistent Volumes - Dynamic Reservations The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc1
Re: Java detector for mess masters and leader
Hi Donald, the information stored in the Zookeeper znode is a serialized Protocol Buffer (see MasterInfo in mesos/mesos.proto https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=include/mesos/mesos.proto;h=3dd4a5b7a4b3bc56bdc690d6adf05f88c0d28273;hb=HEAD); here is a brief explanation of what is in there, plus an example as to how to retrieve that info (in Python - but Java would work pretty much the same): http://codetrips.com/2015/06/12/apache-mesos-leader-master-discovery-using-zookeeper/ Please be aware that, as of 0.24 (currently planned for mid-August), we plan to publish that information *only* in JSON (exactly to help all the folks like you) so the method presented there will no longer work (for all intents and purposes, the serialized MasterInfo to ZK is considered deprecated as of 0.23 which is going out any day now: we're currently testing a RC). Note that if you intend to follow the leader you will need to set a Watcher on the node itself or, perhaps better, on the znode path, so as to get a callback whenever anything changes: the elected leader will always be the lowest-numbered ephemeral znode (I am guessing you know all this, but feel free to ping me if you need more info). Hope this helps. *Marco Massenzio* *Distributed Systems Engineer* On Tue, Jul 7, 2015 at 6:02 AM, Donald Laidlaw donlaid...@me.com wrote: Has anyone ever developed Java code to detect the mesos masters and leader, given a zookeeper connection? The reason I ask is because I would like to monitor mesos to report various metrics reported by the master. This requires detecting and tracking the leading master to query its /metrics/snapshot REST endpoint. Thanks, -Don
Re: [VOTE] Release Apache Mesos 0.23.0 (rc1)
+1 *Marco Massenzio* *Distributed Systems Engineer* On Fri, Jul 3, 2015 at 12:25 PM, Adam Bordelon a...@mesosphere.io wrote: Hello Mesos community, Please vote on releasing the following candidate as Apache Mesos 0.23.0. 0.23.0 includes the following: - Per-container network isolation - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+. - Dockerized slaves will properly recover Docker containers upon failover. as well as experimental support for: - Fetcher Caching - Revocable Resources - SSL encryption - Persistent Volumes - Dynamic Reservations The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc1 The candidate for Mesos 0.23.0 release is available at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz The tag to be voted on is 0.23.0-rc1: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc1 The MD5 checksum of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz.md5 The signature of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz.asc The PGP key used to sign the release is here: https://dist.apache.org/repos/dist/release/mesos/KEYS The JAR is up in Maven in a staging repository here: https://repository.apache.org/content/repositories/orgapachemesos-1056 Please vote on releasing this package as Apache Mesos 0.23.0! The vote is open until Fri July 10th, 12:00 PDT 2015 and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Mesos 0.23.0 [ ] -1 Do not release this package because ... Thanks, -Adam-
Re: mesos cluster can't fit federation cluster
On Wed, Jul 1, 2015 at 11:38 PM, tommy xiao xia...@gmail.com wrote: Hi Marco, I want to fault tolerance slave nodes over multi datacenter. but i found the possible setup methods is not production way. what kind of fault-tolerance are you looking for here? Against one (or either) of the DC going away or network partitioning? or one (or more) of the racks in one DC to go away? Depending on what you want to protect yourself against there may be different ways to achieve that. I'm sorry I haven't been around Mesos long enough to really be knowledgeable about the specifics here; but have built HA systems before around VPCs and On-Prem solutions, and I know bi-di routing can be achieved using gateways and/or VPN (dedicated) links (we also solved that very issue at Google too, but I can't talk about that :). I'm sure the Twitter folks have solved that same problem too, but I'm guessing they may not be able to share much either? 2015-07-02 1:38 GMT+08:00 Marco Massenzio ma...@mesosphere.io: Hi Tommy, not sure what your use-case is, but you are correct, the master/slave nodes need to have bi-directional connectivity. However, there is no fundamental reason why those have to be public IPs - so long as they are routable (either via DNS discovery and / or VPN or other network-layer mechanisms) that will work. (I mean, without even thinking too hard about this - so I may be entirely wrong here - you could place a couple of Nginx/HAproxy nodes with two NICs, one visible to the Slaves, the other in the VPC subnet, and forward all traffic? I'm sure I'm missing something here :) When you launch the master nodes, you specify the NICs they need to listen to via the --ip option, while the slave nodes have the --master flag that should have either a hostname:port of ip:port argument: so long as they are routable, this *should* work (although, admittedly, I've never tried this personally). One concern I would have in such an arrangement though, would be about network partitioning: if the DC/DC connectivity were to drop, you'd suddenly lose all master/slave connectivity; it's also not clear to me that having sectioned the Masters from the Slaves would give you better availability and/or reliability and/or security? It would be great to understand the use-case, so we could see what could be added (if anything) to Mesos going forward. *Marco Massenzio* *Distributed Systems Engineer* On Wed, Jul 1, 2015 at 9:15 AM, tommy xiao xia...@gmail.com wrote: Hello, I would like to deploy master nodes in a private zone, and setup mesos slaves in another datacenter. But the multi-datacenter mode can't work. it need slave node can reach master node in public network ip. But in production zone, the gateway ip is not belong to master nodes. Does anyone have same experience on multi-datacenter deployment case? I prefer kubernets cluster proposal. https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/proposals/federation-high-level-arch.png -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com
Re: step-step guide for New-to-Mesos
Hey Hajira, you may find this blog entry useful: https://mesosphere.com/blog/2015/04/02/continuous-deployment-with-mesos-marathon-docker/ A bit older, but more specific to Docker, please have a look here: https://mesosphere.github.io/marathon/docs/native-docker.html generally speaking, there is a lot of info available at: http://docs.mesosphere.com you could find useful too. Obviously, you can launch containers using a very simple framework, but that's largely not necessary; and most certainly, don't change the mesos.proto contents (this will prevent a lot of stuff from working): that is meant to be a read-only file (well, unless one is doing development on Mesos itself). We have made RENDLER publicly available as an example framework: https://github.com/mesosphere/RENDLER HTH *Marco Massenzio* *Distributed Systems Engineer* On Wed, Jul 1, 2015 at 9:23 AM, haosdent haosd...@gmail.com wrote: Sorry, marthon should be marathon https://mesosphere.github.io/marathon/ On Wed, Jul 1, 2015 at 9:23 PM, haosdent haosd...@gmail.com wrote: Hi, @Hajira Next step is to run tasks in containers in Mesos. You want to run somethinng like web application in docker or others? You could try marthon or other exist framework first. I think you don't need to write a framework. On Wed, Jul 1, 2015 at 9:07 PM, Hajira Jabeen hajirajab...@gmail.com wrote: Hello, Being new to Mesos (and everything related to big data), I have been able to install mesos and run example frameworks. Next step is to run tasks in containers in Mesos. Do I have to write a framework for this , or just change the ContainerInfo etc. fields in Mesos.proto file ? Is there any step-step working guide ? Mesos documentation assumes a lot background knowledge, that I do not have .. Any help and pointers will be appreciated .. Regards Hajira On 30 June 2015 at 00:23, Andras Kerekes andras.kere...@ishisystems.com wrote: Hi, Is there a preferred way to do service discovery in Mesos via mesos-dns running on CoreOS? I’m trying to implement a simple app which consists of two docker containers and one of them (A) depends on the other (B). What I’d like to do is to tell container A to use a fix dns name (containerB.marathon.mesos in case of mesos-dns) to find the other service. There are at least 3 different ways I think it can be done, but the 3 I found all have some shortcomings. 1. Use SRV records to get the port along with the IP. Con: I’d prefer not to build the logic of handling SRV records into the app, it can be a legacy app that is difficult to modify 2. Use haproxy on slaves and connect via a well-known port on localhost. Cons: the Marathon provided script does not run on CoreOS, also I don’t know how to run haproxy on CoreOS outside of a docker container. If it is running in a docker container, then how can it dynamically allocate ports on localhost if a new service is discovered in Marathon/Mesos? 3. Use dedicated port to bind the containers to. Con: I can have only as many instances of a service as many slaves I have because they bind to the same port. What other alternatives are there? Thanks, Andras -- Best Regards, Haosdent Huang -- Best Regards, Haosdent Huang
Re: mesos cluster can't fit federation cluster
Hi Tommy, not sure what your use-case is, but you are correct, the master/slave nodes need to have bi-directional connectivity. However, there is no fundamental reason why those have to be public IPs - so long as they are routable (either via DNS discovery and / or VPN or other network-layer mechanisms) that will work. (I mean, without even thinking too hard about this - so I may be entirely wrong here - you could place a couple of Nginx/HAproxy nodes with two NICs, one visible to the Slaves, the other in the VPC subnet, and forward all traffic? I'm sure I'm missing something here :) When you launch the master nodes, you specify the NICs they need to listen to via the --ip option, while the slave nodes have the --master flag that should have either a hostname:port of ip:port argument: so long as they are routable, this *should* work (although, admittedly, I've never tried this personally). One concern I would have in such an arrangement though, would be about network partitioning: if the DC/DC connectivity were to drop, you'd suddenly lose all master/slave connectivity; it's also not clear to me that having sectioned the Masters from the Slaves would give you better availability and/or reliability and/or security? It would be great to understand the use-case, so we could see what could be added (if anything) to Mesos going forward. *Marco Massenzio* *Distributed Systems Engineer* On Wed, Jul 1, 2015 at 9:15 AM, tommy xiao xia...@gmail.com wrote: Hello, I would like to deploy master nodes in a private zone, and setup mesos slaves in another datacenter. But the multi-datacenter mode can't work. it need slave node can reach master node in public network ip. But in production zone, the gateway ip is not belong to master nodes. Does anyone have same experience on multi-datacenter deployment case? I prefer kubernets cluster proposal. https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/proposals/federation-high-level-arch.png -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com
Re: [Breaking Change 0.24 Upgrade path] ZooKeeper MasterInfo change.
Folks, as heads-up, we are planning to convert the format of the MasterInfo information stored in ZooKeeper from the Protocol Buffer binary format to JSON - this is in conjunction with the HTTP API development, to allow frameworks *not* to depend on libmesos and other binary dependencies to interact with Mesos Master nodes. *NOTE* - there is no change in 0.23 (so any Master/Slave/Framework that is currently working in 0.22 *will continue to work* in 0.23 too) but as of Mesos 0.24, frameworks and other clients relying on the binary format will break. The details of the design are in this Google Doc: https://docs.google.com/document/d/1i2pWJaIjnFYhuR-000NG-AC1rFKKrRh3Wn47Y2G6lRE/edit the actual work is detailed in MESOS-2340: https://issues.apache.org/jira/browse/MESOS-2340 and the patch (and associated test) are here: https://reviews.apache.org/r/35571/ https://reviews.apache.org/r/35815/ *Marco Massenzio* *Distributed Systems Engineer*
Re: mesosphere.io broken?
Just to add some color to the Elastic Mesos thing, we're working with Google to enable deploying a complete DCOS cluster on GCP using their brand new Deployment Manager (v2) via the Click-to-Deploy framework. We have these working on an experimental basis: we need to conduct a bit more testing and work on a couple of rough edges before we can release them beta for people to have a good user experience. I must say it's pretty exciting to click a button and see shortly aftewards a full Mesos Cluster come to life on Google Cloud, so I'm really itching to get the templates in a state where they can be used by other folks! *Marco Massenzio* *Distributed Systems Engineer* On Wed, Jun 17, 2015 at 4:30 AM, Alex Rukletsov a...@mesosphere.com wrote: For downloads, use https://mesosphere.com/downloads/ Elastic Mesos has been decommissioned, use https://google.mesosphere.com/ or https://digitalocean.mesosphere.com/ but keep in mind they will be decommissioned soon (~1 month) as well. However, if you want to try DCOS installation on AWS, check https://mesosphere.com/product/ On Wed, Jun 17, 2015 at 12:51 PM, Brian Candler b.cand...@pobox.com wrote: Looking for Mesos .deb packages, on Google I find links to http://mesosphere.io/downloads/ http://elastic.mesosphere.io/ but these are giving 503 Service Unavailable errors. Is there a problem, or have these sites gone / migrated away?
Re: Introducing BDS: A datacenter scripting language
That's awesome, Pablo - will definitely be fooling around with it! Thanks for using Mesos, BTW - always good to see folks building cool stuff on top of it :) *Marco Massenzio* *Distributed Systems Engineer* On Thu, May 14, 2015 at 6:45 PM, Pablo Cingolani pablo.e.cingol...@gmail.com wrote: Hi Everyone, I've been working on a simple programming language to create large data pipelines on Mesos. The language is called BDS which stands for BigDataScript (yes, the name is kind of a joke for all jargon-lovers out there) and here is the web page: http://pcingola.github.io/BigDataScript/ Needles to say, it's open source and the code is available is GitHub. At the moment I'm using BDS mostly for analysis of large genetic datasets on our 25,000 core cluster, but it should scale to large(er) clusters as well. BDS has a few interesting features: - Runs on Mesos (obviously) as well as SunGridEngine, Torque, MOAB, a large server or just your laptop. - You can develop on your laptop (without having to install Mesos or any cluster management system) and then deploy your script to a Mesos cluster/datacenter without modification. - It performs automatic task dependency and schedules tasks according to the implicit (or explicit) DAG. - It has lazy processing. Checks whether performing a task is necessary and skips tasks whose output does not need to be updated (make-style). - It performs automatic checkpointing and has absolute serialization, so you can copy the checkpoint file to another computer and continue running exactly where you left. - It can handle several parallel pipeline branches (threads). - Allows to define DAGs in a declarative form (using 'goals'). - Cleans up stale files (and queues tasks in non-Mesos cluster). Other cool features: - Automatically parses command line options in your scripts (it also creates help for you) - Logs all processes's stdout / stderr and exit status - It has a built in debugger - It has a built in unity testing framework You can read more about all these features here: http://pcingola.github.io/BigDataScript/bigDataScript_manual.html I hope you find it useful and please do send me any feedback you have. Yours Pablo
Re: Cisco is Powered By Mesos
Thanks, Keith, for sharing this! That's pretty cool stuff, I guess we'll have to check Shipped out ;) Thanks for using Mesos! *Marco Massenzio* *Distributed Systems Engineer* On Tue, May 12, 2015 at 1:38 PM, Keith Chambers (kechambe) kecha...@cisco.com wrote: Hello Adam, Yesterday at Cloud Foundry Summit Cisco first discussed our product called “Shipped” so I guess I can talk about it now. :-) Our tag line is “Your idea running in production in 5 minutes.” It’s developed by developers for developers. Shipped makes it simple to create on-demand production like dev environments, build applications using microservices patterns, and deploy them to an instance of the open source microservices-infrastructure https://github.com/CiscoCloud/microservices-infrastructure container runtime (multi-dc Marathon). Shipped integrates with tool developers *actually like using*. We leverage GitHub for authentication and source control, Vagrant for on-demand developer environments, and Bintray for wickedly fast Docker repos. The Shipped CI service is powered by open source Drone, which we have 2 full time developers working on. We’re also developing a Drone framework for Mesos that we will release to GitHub under Apache license. Shipped maintains a “timeline” for every project. The timeline is a chronological history of high value events across Dev and Ops. i.e., pull requests, failed builds, production failures, etc. One killer feature of Shipped is that we automatically integrate the project timeline with a room in Cisco Spark http://www.webex.com/ciscospark/ (similar to Slack). This makes it simple for teams to work together and deliver software quicker — honestly it’s pretty slick! Shipped itself runs on top of Marathon in Docker containers. We have a number of microservices, all written in Go and all using Cassandra for their backend DB. We use the excellent Kafka framework from Joe Stein for cross service messaging and event collection. We are interested in creating a multi-DC Cassandra Mesos framework, but for now Cassandra is on VMs. We’re at 50 Mesos followers nodes now and growing quickly. Thanks! Keith From: Adam Bordelon a...@mesosphere.io Reply-To: user@mesos.apache.org user@mesos.apache.org Date: Monday, May 11, 2015 at 10:50 PM To: user@mesos.apache.org user@mesos.apache.org Subject: Re: Cisco is Powered By Mesos Glad to hear it Keith! We're very excited to have you in the community. I've added Cisco to the adopters list, and it will go out with the next website update. Can you share any juicy details about how you're using Mesos and at what scale? On Mon, May 11, 2015 at 10:20 AM, Keith Chambers (kechambe) kecha...@cisco.com wrote: We use Mesos in production at Cisco. Please add us to the “Powered By Mesos” list too! https://mesos.apache.org/documentation/latest/powered-by-mesos/ Keith :-)
Re: Writing outside the sandbox
Out of my own curiousity (sorry, I have no fresh insights into the issue here) did you try to run the script and write to a non-NFS mounted directory? (same ownership/permissions) This way we could at least find out whether it's something related to NFS, or a more general permission-related issue. *Marco Massenzio* *Distributed Systems Engineer* On Sat, May 9, 2015 at 5:10 AM, John Omernik j...@omernik.com wrote: Here is the testing I am doing. I used a simple script (run.sh) It writes the user it is running as to stderr (so it's the same log as the errors from file writing) and then tries to make a directory in nfs, and then touch a file in nfs. Note: This script directly run works on every node. You can see the JSON I used in marathon, and in the sandbox results, you can see the user is indeed darkness and the directory cannot be created. However when directly run, it the script, with the same user, creates the directory with no issue. Now, I realize this COULD still be a NFS quirk here, however, this testing points at some restriction in how marathon kicks off the cmd. Any thoughts on where to look would be very helpful! John Script: #!/bin/bash echo Writing whoami to stderr for one stop logging 12 whoami 12 mkdir /mapr/brewpot/mesos/storm/test/test1 touch /mapr/brewpot/mesos/storm/test/test1/testing.go Run Via Marathon { cmd: /mapr/brewpot/mesos/storm/run.sh, cpus: 1.0, mem: 1024, id: permtest, user: darkness, instances: 1 } I0509 07:02:52.457242 9562 exec.cpp:132] Version: 0.21.0 I0509 07:02:52.462700 9570 exec.cpp:206] Executor registered on slave 20150505-145508-1644210368-5050-8608-S0 Writing whoami to stderr for one stop logging darkness mkdir: cannot create directory `/mapr/brewpot/mesos/storm/test/test1': Permission denied touch: cannot touch `/mapr/brewpot/mesos/storm/test/test1/testing.go': No such file or directory Run Via Shell: $ /mapr/brewpot/mesos/storm/run.sh Writing whoami to stderr for one stop logging darkness darkness@hadoopmapr1:/mapr/brewpot/mesos/storm$ ls ./test/ test1 darkness@hadoopmapr1:/mapr/brewpot/mesos/storm$ ls ./test/test1/ testing.go On Sat, May 9, 2015 at 3:14 AM, Adam Bordelon a...@mesosphere.io wrote: I don't know of anything inside of Mesos that would prevent you from writing to NFS. Maybe examine the environment variables set when running as that user. Or are you running in a Docker container? Those can have additional restrictions. On Fri, May 8, 2015 at 4:44 PM, John Omernik j...@omernik.com wrote: I am doing something where people may recommend against my course of action. However, I am curious if there is a way basically I have a process being kicked off in marathon that is trying to write to a nfs location. The permissions of the user running the task and the nfs location are good. So what component of mesos or marathon is keeping me from writing here ? ( I am getting permission denied). Is this one of those things that is just not allowed, or is there an option to pass to marathon to allow this? Thanks ! -- Sent from my iThing
Re: Google Borg paper
At Google there are always to do everything: the deprecated one and the one that's not quite ready yet I'm sure Borg is alive and well (but deprecated) and Omega has been deployed (but ain't quite ready yet) They were already working on it in 2010, I'm sure they're still at it. Will confirm soon as I find out more. On Apr 16, 2015 9:08 PM, Christos Kozyrakis kozyr...@gmail.com wrote: Maxime, to the best of my knowledge Borg is still doing just fine at Google. It may have been enhanced by the Omega effort but it has not been replaced. Nevertheless, I will let any Googlers on the list go into details. Christos On Thu, Apr 16, 2015 at 4:19 PM, Maxime Brugidou maxime.brugi...@gmail.com wrote: Hi, Not sure if everyone noticed but Google just published a paper about the Borg architecture. I guess it's been replaced by Omega now internally at Google (if anyone from Google can confirm?) It might be of interest for Mesos :) http://research.google.com/pubs/pub43438.html Best, Maxime -- Christos