Re: [VOTE] Release Apache Mesos 0.22.0 (rc4)
+1 (non-binding) Mac OS 10.9.5 + clang CentOS 7 + gcc 4.4.7 [cgroups tests disabled] On Wed, Mar 18, 2015 at 4:04 PM, Brenden Matthews bren...@diddyinc.com wrote: +1 Tested with internal testing cluster. On Wed, Mar 18, 2015 at 1:25 PM, craig w codecr...@gmail.com wrote: +1 On Wed, Mar 18, 2015 at 3:52 PM, Niklas Nielsen nik...@mesosphere.io wrote: Hi all, Please vote on releasing the following candidate as Apache Mesos 0.22.0. 0.22.0 includes the following: * Support for explicitly sending status updates acknowledgements from schedulers; refer to the upgrades document for upgrading schedulers. * Rate limiting slave removal, to safeguard against unforeseen bugs leading to widespread slave removal. * Disk quota isolation in Mesos containerizer; refer to the containerization documentation to enable disk quota monitoring and enforcement. * Support for module hooks in task launch sequence. Refer to the modules documentation for more information. * Anonymous modules: a new kind of module that does not receive any callbacks but coexists with its parent process. * New service discovery info in task info allows framework users to specify discoverability of tasks for external service discovery systems. Refer to the framework development guide for more information. * New '--external_log_file' flag to serve external logs through the Mesos web UI. * New '--gc_disk_headroom' flag to control maxmimum executor sandbox age. The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.22.0-rc4 The candidate for Mesos 0.22.0 release is available at: https://dist.apache.org/repos/dist/dev/mesos/0.22.0-rc4/mesos-0.22.0.tar.gz The tag to be voted on is 0.22.0-rc4: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.22.0-rc4 The MD5 checksum of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.22.0-rc4/mesos-0.22.0.tar.gz.md5 The signature of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/0.22.0-rc4/mesos-0.22.0.tar.gz.asc The PGP key used to sign the release is here: https://dist.apache.org/repos/dist/release/mesos/KEYS The JAR is up in Maven in a staging repository here: https://repository.apache.org/content/repositories/orgapachemesos-1048 Please vote on releasing this package as Apache Mesos 0.22.0! The vote is open until Sat Mar 21 12:49:56 PDT 2015 and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Mesos 0.22.0 [ ] -1 Do not release this package because ... Thanks, Niklas -- https://github.com/mindscratch https://www.google.com/+CraigWickesser https://twitter.com/mind_scratch https://twitter.com/craig_links
Re: Resource allocation module
Hi Gidon, and thanks for your interest. As you have already noticed, the work is currently in progress and should land in master branch in around 2 weeks. It will also be part of 0.23 release. There is no documentation so far, but we plan to document the API once the patches land. Right now you may want to look at the Allocator interface and check the DRF implementation for more details. —Alex On Wed, Mar 18, 2015 at 7:05 AM, Gidon Gershinsky gi...@il.ibm.com wrote: We need to develop a new resource allocation module, replacing the off-the-shelf DRF. As I understand, the current mechanism http://mesos.apache.org/documentation/latest/allocation-module/ is being replaced with a less intrusive module architecture, https://issues.apache.org/jira/browse/MESOS-2160 The capabilities of the new mechanism have real advantages for us. However, it is not clear when it will be released. The jira has an 'in progress' status. What is the current target / horizon for making this available to the users? Also, is there any documentation on the SPIs / technical interfaces of these modules (what info is passed from slaves, frameworks, offers; what calls can be made by the modules; etc)? Regards, Gidon
Re: Deploying containers to every mesos slave node
No, this won't make it into 0.22. On Thu, Mar 12, 2015 at 10:28 AM, Gurvinder Singh gurvinder.si...@uninett.no wrote: On 03/12/2015 02:00 PM, Tim St Clair wrote: You may want to also view - https://issues.apache.org/jira/browse/MESOS-1806 as folks have discussed straight up consul integration on that JIRA. Any plans to resolve this JIRA for upcoming 0.22 release. - Gurvinder *From: *Aaron Carey aca...@ilm.com *To: *user@mesos.apache.org *Sent: *Thursday, March 12, 2015 3:54:52 AM *Subject: *Deploying containers to every mesos slave node Hi All, In setting up our cluster, we require things like consul to be running on all of our nodes. I was just wondering if there was any sort of best practice (or a scheduler perhaps) that people could share for this sort of thing? Currently the approach is to use salt to provision each node and add consul/mesos slave process and so on to it, but it'd be nice to remove the dependency on salt. Thanks, Aaron -- Cheers, Timothy St. Clair Red Hat Inc.
Re: Deploying containers to every mesos slave node
You don't even need to create a custom framework: you can run a separate instance of Marathon for a dedicated role. On Thu, Mar 12, 2015 at 10:58 AM, Brian Devins badev...@gmail.com wrote: This was actually going to be my suggestion. You could create a custom framework/scheduler to handle these types of tasks and configure mesos to give priority to this framework using roles, and weights. On Thu, Mar 12, 2015 at 1:38 PM, Konrad Scherer konrad.sche...@windriver.com wrote: On 03/12/2015 04:54 AM, Aaron Carey wrote: Hi All, In setting up our cluster, we require things like consul to be running on all of our nodes. I was just wondering if there was any sort of best practice (or a scheduler perhaps) that people could share for this sort of thing? I am in a similar situation. I want to start a single source cache (over 200GB) data container on each of my builder nodes. I had the idea of creating a custom resource on each slave and creating a scheduler to handle this resource only. Has anyone tried this? The only problem I can see is that there is no way to prevent another scheduler from taking the offered custom resource, but since it is custom it seems unlikely. I would love to use Marathon for this, but looks like Marathon does not support custom resources and the issue[1] is in the backlog. Perhaps when Mesos and Marathon get dynamic resources[2] support? [1]: https://github.com/mesosphere/marathon/issues/375 [2]: https://issues.apache.org/jira/browse/MESOS-2018 -- Konrad Scherer, MTS, Linux Products Group, Wind River
Re: Question on Monitoring a Mesos Cluster
The master/cpus_percent metric is nothing else than used / total. It however represent resources allocated to tasks, but tasks may not use them fully (or use more if isolation is not enabled). You can't get actual cluster utilisation, the best option is to aggregate system/* metrics, that report the node load. This however includes all the process running on a node, not only mesos and its tasks. Hope this helps. On Mon, Mar 9, 2015 at 8:16 AM, Andras Kerekes andras.kere...@ishisystems.com wrote: We use the same monitoring script from rayrod2030. However instead of the master_cpus_percent, we use the master_cpus_used and master_cpus_total to calculate a percentage. And this will give the allocated percentage of CPUs in the cluster, the actual utilization is measured by collectd. -Original Message- From: rasput...@gmail.com [mailto:rasput...@gmail.com] On Behalf Of Dick Davies Sent: Saturday, March 07, 2015 2:15 PM To: user@mesos.apache.org Subject: Re: Question on Monitoring a Mesos Cluster Yeah, that confused me too - I think that figure is specific to the master/slave polled (and that'll just be the active one since you're only reporting when master/elected is true. I'm using this one https://github.com/rayrod2030/collectd-mesos , not sure if that's the same as yours? On 7 March 2015 at 18:56, Jeff Schroeder jeffschroe...@computer.org wrote: Responses inline On Sat, Mar 7, 2015 at 12:48 PM, CCAAT cc...@tampabay.rr.com wrote: ... snip ... After getting everything working, I built a few dashboards, one of which displays these stats from http://master:5051/metrics/snapshot: master/disk_percent master/cpus_percent master/mem_percent I had assumed that this was something like aggregate cluster utilization, but this seems incorrect in practice. I have a small cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a dozen or so small tasks running, and launched 500 tasks with 1G of memory and 1 CPU each. Now I'd expect to se the disk/cpu/mem percentage metrics above go up considerably. I did notice that cpus_percent went to around 0.94. What is the correct way to measure overall cluster utilization for capacity planning? We can have the NOC watch this and simply add more hardware when the number starts getting low. Boy, I cannot wait to read the tidbits of wisdom here. Maybe the development group has more accurate information if not some vague roadmap on resource/process monitoring. Sooner or later, this is going to become a quintessential need; so I hope the deep thinkers are all over this need both in the user and dev groups. In fact the monitoring can easily create a significant loading on the cluster/cloud, if one is not judicious in how this is architect, implemented and dynamically tuned. Monitoring via passive metrics gathering and application telemetry is one of the best ways to do it. That is how I've implemented things The beauty of the rest api is that it isn't heavyweight, and every master has it on port 5050 (by default) and every slave has it on port 5051 (by default). Since I'm throwing this all into graphite (well technically cassandra fronted by cyanite fronted by graphite-api... but same difference), I found a reasonable way to do capacity planning. Collectd will poll the master/slave on each mesos host every 10 seconds (localhost:5050 on masters and localhost:5151 on slaves). This gets put into graphite via collectd's write_graphite plugin. These 3 graphite targets give me percentages of utilization for nice graphs: alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used, collectd.mesos.clustername.gauge-master_cpu_total), Total CPU Usage) alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used, collectd.mesos.clustername.gauge-master_mem_total), Total Memory Usage) alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used, collectd.mesos.clustername.gauge-master_disk_total), Total Disk Usage) With that data, you can have your monitoring tools such as nagios/icinga poll graphite. Using the native graphite render api, you can do things like: * if the cpu usage is over 80% for 24 hours, send a warning event * if the cpu usage is over 95% for 6 hours, send a critical event This allows mostly no-impact monitoring since the monitoring tools are hitting graphite. Anyways, back to the original questions: How does everyone do proper monitoring and capacity planning for large mesos clusters? I expect my cluster to grow beyond what it currently is by quite a bit. -- Jeff Schroeder Don't drink and derive, alcohol and analysis don't mix. http://www.digitalprognosis.com
Re: Compiling Slave on Windows
Hi Alexandre, sorry for a tardy reply. Mesos master and slaves (or workers, as per MESOS-1478) communicate via protobuf messages. Any agent that understands these messages can be (or pretend) a Mesos slave. So the answer to your question is yes, it is possible to provide an alternative slave implementation. The question is what such slave will do with tasks it will get from the master after being successfully registered? But since for you a simplified version suffices, you can start such slave with custom resources, say win-cpus:4;win-mem:1024. Since there will be no overlap in resources between standard slaves and custom ones, you will have two independent subclusters in your cluster, with your win tasks sent only to the win subcluster. Does it make sense? In reality, implementing (and maintaining) such a slave is a lot of work. Anyway I would be happy to see and help out with this effort if you decide to work on it. Alex
Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster
Geoffroy, could you please provide master logs (both from killed and taking over masters)? On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello we are facing some unexpecting issues when testing high availability behaviors of our mesos cluster. *Our use case:* *State*: the mesos cluster is up (3 machines), 1 docker task is running on each slave (started from marathon) *Action*: stop the mesos master leader process *Expected*: mesos master leader has changed, *active tasks remain unchanged* *Seen*: mesos master leader has changed, *all active tasks are now FAILED but docker containers are still running*, marathon detects FAILED tasks and starts new tasks. We end with 2 docker containers running on each machine, but only one is linked to a RUNNING mesos task. Is the seen behavior correct? Have we misunderstood the high availability concept? We thought that doing this use case would not have any impact on the current cluster state (except leader re-election) Thanks in advance for your help Regards --- our setup is the following: 3 identical mesos nodes with: + zookeeper + docker 1.5 + mesos master 0.21.1 configured in HA mode + mesos slave 0.21.1 configured with checkpointing, strict and reconnect + marathon 0.8.0 configured in HA mode with checkpointing --- Command lines: *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181, 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050 --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19 --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos *mesos-slave* /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181, 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=10.195.30.19 --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443] *marathon* java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64 -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000 --local_port_min 31000 --task_launch_timeout 30 --http_port 8080 --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181, 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181, 10.195.30.20:2181,10.195.30.21:2181/mesos
Re: multiple frameworks or one big one
Not exactly the Enterprise oriented benchmark, but may give some insight though. https://mesos.apache.org/documentation/latest/powered-by-mesos/ https://www.youtube.com/watch?v=LQxnuPcRl4st=1m31s On Wed, Mar 4, 2015 at 2:18 AM, Diego Medina di...@fmpwizard.com wrote: Well, I deeply think that additionally to the architecture and the organisations concerns, Mesos need to provide some Enterprise oriented benchmark and proof to be able to really prime time on the enterprise world and not only on the Startup style enterprises, but it's not the topic of your post and I'll made my own regarding this specific topic. Looking forward to that discussion. Anyway, thank you very much for your answers. Regarding your choose of Golang instead of Scala because of some pain points, could you send me some exemples (except for compile time)? Even in private if you do not want to steal the thread, as I'm really balancing between those two. I'll send you a separate private message with the reply, I don't mind talking about it, but wouldn't want to distract this mailing list with the topic. Thanks Diego 2015-03-03 14:26 GMT+01:00 Diego Medina di...@fmpwizard.com: Hi Alex, On Tue, Mar 3, 2015 at 7:37 AM, Alex Rukletsov a...@mesosphere.io wrote: Next good big thing would be to handle task state updates. Instead of dying on TASK_LOST, you may want to retry this task several times. Yes, this is definitely something I need to address, for now I use it to help me find bugs in the code, if the app stops, I know I did something wrong :) I also need to find out why some tasks stay in status Staging on the Mesos UI, but I'll start a separate thread for it. Thanks Diego On Tue, Mar 3, 2015 at 10:38 AM, Billy Bones gael.ther...@gmail.com wrote: Oh and you've got a glitch on one of your executor name in your first code block. You've got: *extractorExe := mesos.ExecutorInfo{ ExecutorId: util.NewExecutorID(owl-cralwer-extractor), Name: proto.String(OwlCralwer Fetcher), Source: proto.String(owl-cralwer), Command: mesos.CommandInfo{ Value: proto.String(extractorExecutorCommand), Uris: executorUris, }, }* It should rather be: *extractorExe := mesos.ExecutorInfo{ ExecutorId: util.NewExecutorID(owl-cralwer-extractor), Name: proto.String(OwlCralwer Extractor), Source: proto.String(owl-cralwer), Command: mesos.CommandInfo{ Value: proto.String(extractorExecutorCommand), Uris: executorUris, }, }* 2015-03-03 10:28 GMT+01:00 Billy Bones gael.ther...@gmail.com: Hi Diego, did you already plan to make a benchmark of your result on the Mesos platform VS Bare-metal servers ? It would be really interesting for Enterprise evangelism, they love benchmark and metrics. I'm impress by your project and how it goes fast. I'm myself a fan of Golang, but why did you choose it? 2015-03-03 3:03 GMT+01:00 Diego Medina di...@fmpwizard.com: Hi everyone, based on all the great feedback I got here I updated the code and now I have one scheduler and two executors, one for fetching html and a second one that extracts links and text from the html. I also run the actual work on their own goroutines (like threads for tose not familiar with Go) and it's working great. I wrote about the changes here http://blog.fmpwizard.com/blog/owlcrawler-multiple-executors-using-meso and you can find the updated code here https://github.com/fmpwizard/owlcrawler Again, thanks everyone for your input. Diego On Fri, Feb 27, 2015 at 1:52 PM, Diego Medina di...@fmpwizard.com wrote: Thanks for looking at the code and the feedback Alex. I'll be working on those changes later tonight! Diego On Fri, Feb 27, 2015 at 12:15 PM, Alex Rukletsov a...@mesosphere.io wrote: Diego, I've checked your code, nice effort! Great to see people hacking with mesos and go bindings! One thing though. You do the actual job in the launchTask() of your executor. This prevents you from multiple tasks in parallel on one executor. That means you can't have more simultaneous tasks than executors in your cluster. You may want to spawn a thread for every incoming task and do the job there, while launchTasks() will do solely task initialization (basically, starting a thread). Check the project John referenced to: https://github.com/mesosphere/RENDLER. Best, Alex On Fri, Feb 27, 2015 at 11:03 AM, Diego Medina di...@fmpwizard.com wrote: Hi Billy, comments inline: On Fri, Feb 27, 2015 at 4:07 AM, Billy Bones gael.ther...@gmail.com wrote: Hi diego, as a real fan of the golang, I'm cudoes and clap for your work on this distributed crawler and hope you'll finally release it ;-) Thanks! my 3 month old baby is making sure I don't sleep much and have plenty of time to work on this project :) About your question, the common architecture is to have one scheduler and multiple
Re: multiple frameworks or one big one
Diego, I've checked your code, nice effort! Great to see people hacking with mesos and go bindings! One thing though. You do the actual job in the launchTask() of your executor. This prevents you from multiple tasks in parallel on one executor. That means you can't have more simultaneous tasks than executors in your cluster. You may want to spawn a thread for every incoming task and do the job there, while launchTasks() will do solely task initialization (basically, starting a thread). Check the project John referenced to: https://github.com/mesosphere/RENDLER. Best, Alex On Fri, Feb 27, 2015 at 11:03 AM, Diego Medina di...@fmpwizard.com wrote: Hi Billy, comments inline: On Fri, Feb 27, 2015 at 4:07 AM, Billy Bones gael.ther...@gmail.com wrote: Hi diego, as a real fan of the golang, I'm cudoes and clap for your work on this distributed crawler and hope you'll finally release it ;-) Thanks! my 3 month old baby is making sure I don't sleep much and have plenty of time to work on this project :) About your question, the common architecture is to have one scheduler and multiple executors rather than one big executor. The basics of mesos is to take any resources, put them together on a pool to then swarm tasks on this pool, so, basically the architecture of your application should share this philosophy and then explode / decouple your application as much as possible but be carreful to not loop lock yourself on threads and tasks if they're dependents. I don't know if I'm explaining myself correctly so do not hesitate if you need more clarification. Your answer was very clear. Today I started to split the executor into two, one that simply fetches the html and then a second one that extracts text without tags from it, this second executor gets the data from a database, so far it seems like a natural way to split the tasks. I was going with the idea of also having two schedulers, but I think I just figured out how to use just one. Thanks! Diego 2015-02-26 21:50 GMT+01:00 Diego Medina di...@fmpwizard.com: @John: thanks for the link, i see that RENDLER uses the ExecutorId from ExecutorInfo to decide what to do, I'll give this a try @Craig: you are right, after I sent the email I continued to read more of the mesos docs and saw that I used the wrong term, where I meant scheduler instead of framework, thanks. Thanks and looking forward to any other feedback you may all have. Diego On Thu, Feb 26, 2015 at 5:24 AM, craig w codecr...@gmail.com wrote: Diego, I'm also interested in hearing feedback to your qusestion. One minor thing I'd point out is that a Framework is made up of a Scheduler and Executor(s), so I think it's more correct to say you've created a Scheduler (instead of one big framework) and an Executor. Anyhow, for what it's worth, the Aurora framework has multiple executors ( https://github.com/apache/incubator-aurora/blob/master/examples/vagrant/aurorabuild.sh#L61). You might pop into the #aurora IRC chat room and ask, usually a few Aurora contributors are in there answering questions when they can. On Wed, Feb 25, 2015 at 9:02 PM, John Pampuch j...@mesosphere.io wrote: Diego- You might want to look at this project for some insights: https://github.com/mesosphere/RENDLER -John On Wed, Feb 25, 2015 at 5:27 PM, Diego Medina di...@fmpwizard.com wrote: Hi, Short: Is it better to have one big framework and executor with if statements to select what to do or several smaller framework - executors when writing a Mesos app? Longer question: Last week I started a side project based on mesos (using Go), http://blog.fmpwizard.com/blog/web-crawler-using-mesos-and-golang https://github.com/fmpwizard/owlcrawler It's a web crawler written on top of Mesos, The very first version of it had a framework that sent a task to an executor and that single executor would fetch the page, extract links from the html and then send them to a message queue. Then the framework reads that queue and starts again, run the executor, etc, etc. Now I'm splitting fetching the html and extracting links into two separate tasks, and putting those two tasks in the same executor doesn't feel right, so I'm thinking that I need at least two diff executors and one framework, but then I wonder if people more experienced with mesos would normally write several pairs of framework - executors to keep the design cleaner. On this particular case, I can see the project growing into even more tasks that can be decoupled. Any feedback on the design would be great and let me know if I should explain this better. Thanks Diego -- Diego Medina Lift/Scala consultant di...@fmpwizard.com http://fmpwizard.telegr.am -- https://github.com/mindscratch https://www.google.com/+CraigWickesser https://twitter.com/mind_scratch https://twitter.com/craig_links -- Diego Medina Lift/Scala consultant
Re: Mesos-slave start error
Hi Siva, it looks like you bumped into https://issues.apache.org/jira/browse/MESOS-2276. Feel free to upvote! On Thu, Feb 5, 2015 at 1:56 PM, Sivaram Kannan sivara...@gmail.com wrote: Hi, I am our deployments of mesos-slave, we are getting the following error during start up. I understand the slave is failing due to large number of fd's being opened. I have increased the ulimit of fd's to 4096 from 1024 but still the same behavior. What can I do to solve this problem, and what should I do to prevent it. Thanks, ./Siva. Initiating client connection, host=11.0.190.1:2181 sessionTimeout=1 watcher=0x7f6de4 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07628915 slave.cpp:169] Slave started on 1)@11.1.6.1:5051 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07654415 slave.cpp:289] Slave resources: cpus(*):24; mem(*):47336; disk(*):469416; ports(*):[31000-32000] Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07657515 slave.cpp:318] Slave hostname: 11.1.6.1 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07658215 slave.cpp:319] Slave checkpoint: true Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07813525 state.cpp:33] Recovering state from '/var/lib/mesos/slave/meta' Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07823320 status_update_manager.cpp:197] Recovering status update manager Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07833320 docker.cpp:767] Recovering Docker containers Feb 05 12:33:58 node-d4856455ad5c sh[32162]: 2015-02-05 12:33:58,102:6(0x7f6dc3fff700):ZOO_INFO@check_events@1703: initiated connection to server [11.0.190.1:2181] Feb 05 12:33:58 node-d4856455ad5c sh[32162]: 2015-02-05 12:33:58,104:6(0x7f6dc3fff700):ZOO_INFO@check_events@1750: session establishment complete on server [11.0.190.1:2181], sessionId=0x14b3c82555299c7, Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10467130 group.cpp:313] Group process (group(1)@11.1.6.1:5051) connected to ZooKeeper Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10470830 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10472530 group.cpp:385] Trying to create path '/mesos' in ZooKeeper Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10637622 detector.cpp:138] Detected a new leader: (id='3') Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10647725 group.cpp:659] Trying to get '/mesos/info_03' in ZooKeeper Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10729330 detector.cpp:433] A new leading master (UPID=master@11.1.4.1:5050) is detected Feb 05 12:33:58 node-d4856455ad5c sh[32162]: Failed to perform recovery: Collect failed: Failed to create pipe: Too many open files Feb 05 12:33:58 node-d4856455ad5c sh[32162]: To remedy this do as follows: Feb 05 12:33:58 node-d4856455ad5c sh[32162]: Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest Feb 05 12:33:58 node-d4856455ad5c sh[32162]: This ensures slave doesn't recover old live executors. Feb 05 12:33:58 node-d4856455ad5c sh[32162]: Step 2: Restart the slave. Feb 05 12:33:58 node-d4856455ad5c systemd[1]: mesos-slave.service: main process exited, code=exited, status=1/FAILURE Feb 05 12:33:58 node-d4856455ad5c docker[3351]: mesos_slave Feb 05 12:33:58 node-d4856455ad5c systemd[1]: Unit mesos-slave.service entered failed state.
Re: Unable to follow Sandbox links from Mesos UI.
Let's make sure instead of assuming. Could you please add this line: console.log('Fetch url:', url); between lines 17 and 18, click the link, copy the output from Firebug or Chrome dev console and paste it here together with the link corresponding to the download button? Thanks, Alex On Tue, Jan 27, 2015 at 9:19 PM, Dan Dong dongda...@gmail.com wrote: Hi, All, Checked again that when I hover cursor on the stdout/stderr, it points to links of IP address of the master node, so that's why when clicking on it nothing will show up. While the Download button beside it points correctly to the IP address of slave node, so I can download them without problem. Seems a config problem somewhere? A bit lost here. So seems the host in the pailer function is resolved to master instead of slave node: 14 // Invokes the pailer for the specified host and path using the 15 // specified window_title. 16 function pailer(host, path, window_title) { 17 var url = 'http://' + host + '/files/read.json?path=' + path; 18 var pailer = 19 window.open('/static/pailer.html', url, 'width=580px, height=700px'); 20 21 // Need to use window.onload instead of document.ready to make 22 // sure the title doesn't get overwritten. 23 pailer.onload = function() { 24 pailer.document.title = window_title + ' (' + host + ')'; 25 }; 26 } 27 Cheers, Dan 2015-01-27 2:51 GMT-06:00 Alex Rukletsov a...@mesosphere.io: Dan, the link to the sandbox on a slave is prepared in the JS. As Geoffroy suggests, could you please check that the JS code works correctly and the url is constructed normally (see controllers.js:16)? If everything's fine on your side, could you please file a JIRA for this issue? On Tue, Jan 27, 2015 at 8:21 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello just in case, which internet browser are you using? Do you have installed any extensions (NoScript, Ghostery, ...) that could prevent the display /statis/pailer display? I personnaly use NoScript with Firefox, and i have to turn it off on all @IP of our cluster to correctly access slave information from Mesos UI. My 2 cents Regards 2015-01-26 21:08 GMT+01:00 Suijian Zhou suijian.z...@ige-project.eu: Hi, Alex, Yes, I can see the link points to the slave machine when I hover on the Download button and stdout/stderr can be downloaded. So do you mean it is expected/designed that clicking on 'stdout/stderr' themselves will not show you anything? Thanks! Cheers, Dan 2015-01-26 7:44 GMT-06:00 Alex Rukletsov a...@mesosphere.io: Dan, that's correct. The 'static/pailer.html' is a page that lives on the master and it gets a url to the actual slave as a parameter. The url is computed in 'controllers.js' based on where the associated executor lives. You should see this 'actual' url if you hover over the Download button. Please check this url for correctness and that you can access it from your browser. On Fri, Jan 23, 2015 at 9:24 PM, Dan Dong dongda...@gmail.com wrote: I see the problem: when I move the cursor onto the link, e.g: stderr, it actually points to the IP address of the master machine, so it trys to follow links of Master_IP:/tmp/mesos/slaves/... which is not there. So why the link does not point to the IP address of slaves( config problems somewhere?)? Cheers, Dan 2015-01-23 11:25 GMT-06:00 Dick Davies d...@hellooperator.net: Start with 'inspect element' in the browser and see if that gives any clues. Sounds like your network is a little strict so it may be something else needs opening up. On 23 January 2015 at 16:56, Dan Dong dongda...@gmail.com wrote: Hi, Alex, That is what expected, but when I click on it, it pops a new blank window(pailer.html) without the content of the file(9KB size). Any hints? Cheers, Dan 2015-01-23 4:37 GMT-06:00 Alex Rukletsov a...@mesosphere.io: Dan, you should be able to view file contents just by clicking on the link. On Thu, Jan 22, 2015 at 9:57 PM, Dan Dong dongda...@gmail.com wrote: Yes, --hostname solves the problem. Now I can see all files there like stdout, stderr etc, but when I click on e.g stdout, it pops a new blank window(pailer.html) without the content of the file(9KB size). Although it provides a Download link beside, it would be much more convenient if one can view the stdout and stderr directly. Is this normal or there is still problem on my envs? Thanks! Cheers, Dan 2015-01-22 11:33 GMT-06:00 Adam Bordelon a...@mesosphere.io: Try the --hostname parameters for master/slave. If you want to be extra explicit about the IP (e.g. publish the public IP instead of the private one in a cloud environment), you
Re: Trying to debug an issue in mesos task tracking
Itamar, you are right, Mesos executor and containerizer cannot distinguish between busy and stuck processes. However, since you use your own custom executor, you may want to implement a sort of health checks. It depends on what your task processes are doing. There are hundreds of reasons why an OS process may get stuck; it doesn't look like it's Mesos-related in this case. On Sat, Jan 24, 2015 at 9:17 PM, Itamar Ostricher ita...@yowza3d.com wrote: Alex, Sharma, thanks for your input! Trying to recreate the issue with a small cluster for the last few days, I was not able to observe a scenario that I can be sure that my executor sent the TASK_FINISHED update, but the scheduler did not receive it. I did observe multiple times a scenario that a task seemed to be stuck in TASK_RUNNING state, but when I SSH'ed into the slave that has the task, I always saw that the process related to that task is still running (by grepping `ps aux`). Most of the times, it seemed that the process did the work (by examining the logs produced by the PID), but for some reason it was stuck without exiting cleanly. Some times it seemed that the process didn't do any work (an empty log file with the PID). All times, as soon as I killed the PID, a TASK_FAILED update was sent and received successfully. So, it seems that the problem is in processes spawned by my executor, but I don't fully understand why this happens. Any ideas why a process would do some work (either 1% (just creating a log file) or 99% (doing everything but not exiting) and get stuck? On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov a...@mesosphere.io wrote: Itamar, beyond checking master and slave logs, could you pleasse verify your executor does send the TASK_FINISHED update? You may want to add some logging and the check executor log. Mesos guarantees the delivery of status updates, so I suspect the problem is on the executor's side. On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila spod...@netflix.com wrote: Have you checked the mesos-slave and mesos-master logs for that task id? There should be logs in there for task state updates, including FINISHED. There can be specific cases where sometimes the task status is not reliably sent to your scheduler (due to mesos-master restarts, leader election changes, etc.). There is a task reconciliation support in Mesos. A periodic call to reconcile tasks from the scheduler can be helpful. There are also newer enhancements coming to the task reconciliation. In the mean time, there are other strategies such as what I use, which is periodic heartbeats from my custom executor to my scheduler (out of band). The timeouts for task runtimes are similar to heartbeats, except, you need a priori knowledge of all tasks' runtimes. Task runtime limits are not support inherently, as far as I know. Your executor can implement it, and that may be one simple way to do it. That could also be a good way to implement shell's rlimit*, in general. On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher ita...@yowza3d.com wrote: I'm using a custom internal framework, loosely based on MesosSubmit. The phenomenon I'm seeing is something like this: 1. Task X is assigned to slave S. 2. I know this task should run for ~10minutes. 3. On the master dashboard, I see that task X is in the Running state for several *hours*. 4. I SSH into slave S, and see that task X is *not* running. According to the local logs on that slave, task X finished a long time ago, and seemed to finish OK. 5. According to the scheduler logs, it never got any update from task X after the Staging-Running update. The phenomenon occurs pretty often, but it's not consistent or deterministic. I'd appreciate your input on how to go about debugging it, and/or implement a workaround to avoid wasted resources. I'm pretty sure the executor on the slave sends the TASK_FINISHED status update (how can I verify that beyond my own logging?). I'm pretty sure the scheduler never receives that update (again, how can I verify that beyond my own logging?). I have no idea if the master got the update and passed it through (how can I check that?). My scheduler and executor are written in Python. As for a workaround - setting a timeout on a task should do the trick. I did not see any timeout field in the TaskInfo message. Does mesos support the concept of per-task timeouts? Or should I implement my own task tracking and timeouting mechanism in the scheduler?
Re: Unable to follow Sandbox links from Mesos UI.
Dan, you should be able to view file contents just by clicking on the link. On Thu, Jan 22, 2015 at 9:57 PM, Dan Dong dongda...@gmail.com wrote: Yes, --hostname solves the problem. Now I can see all files there like stdout, stderr etc, but when I click on e.g stdout, it pops a new blank window(pailer.html) without the content of the file(9KB size). Although it provides a Download link beside, it would be much more convenient if one can view the stdout and stderr directly. Is this normal or there is still problem on my envs? Thanks! Cheers, Dan 2015-01-22 11:33 GMT-06:00 Adam Bordelon a...@mesosphere.io: Try the --hostname parameters for master/slave. If you want to be extra explicit about the IP (e.g. publish the public IP instead of the private one in a cloud environment), you can also set the --ip parameter on master/slave. On Thu, Jan 22, 2015 at 8:43 AM, Dan Dong dongda...@gmail.com wrote: Thanks Ryan, yes, from the machine where the browser is on slave hostnames could not be resolved, so that's why failure, but it can reach them by IP address( I don't think sys admin would like to add those VMs entries to /etc/hosts on the server). I tried to change masters and slaves of mesos to IP addresses instead of hostname but UI still points to hostnames of slaves. Is threre a way to let mesos only use IP address of master and slaves? Cheers, Dan 2015-01-22 9:48 GMT-06:00 Ryan Thomas r.n.tho...@gmail.com: It is a request from your browser session, not from the master that is going to the slaves - so in order to view the sandbox you need to ensure that the machine your browser is on can resolve and route to the masters _and_ the slaves. The master doesn't proxy the sandbox requests through itself (yet) - they are made directly from your browser instance to the slaves. Make sure you can resolve the slaves from the machine you're browsing the UI on. Cheers, ryan On 22 January 2015 at 15:42, Dan Dong dongda...@gmail.com wrote: Thank you all, the master and slaves can resolve each others' hostname and ssh login without password, firewalls have been switched off on all the machines too. So I'm confused what will block such a pull of info of slaves from UI? Cheers, Dan 2015-01-21 16:35 GMT-06:00 Cody Maloney c...@mesosphere.io: Also see https://issues.apache.org/jira/browse/MESOS-2129 if you want to track progress on changing this. Unfortunately it is on hold for me at the moment to fix. Cody On Wed, Jan 21, 2015 at 2:07 PM, Ryan Thomas r.n.tho...@gmail.com wrote: Hey Dan, The UI will attempt to pull that info directly from the slave so you need to make sure the host is resolvable and routeable from your browser. Cheers, Ryan From my phone On Wednesday, 21 January 2015, Dan Dong dongda...@gmail.com wrote: Hi, All, When I try to access sandbox on mesos UI, I see the following info( The same error appears on every slave sandbox.): Failed to connect to slave '20150115-144719-3205108908-5050-4552-S0' on 'centos-2.local:5051'. Potential reasons: The slave's hostname, 'centos-2.local', is not accessible from your network The slave's port, '5051', is not accessible from your network I checked that: slave centos-2.local can be login from any machine in the cluster without password by ssh centos-2.local ; port 5051 on slave centos-2.local could be connected from master by telnet centos-2.local 5051 The stdout and stderr are there on each slave's /tmp/mesos/..., but seems mesos UI just could not access it. (and Both master and slaves are on the same network IP ranges). Should I open any port on slaves? Any hint what's the problem here? Cheers, Dan
Re: Trying to debug an issue in mesos task tracking
Itamar, beyond checking master and slave logs, could you pleasse verify your executor does send the TASK_FINISHED update? You may want to add some logging and the check executor log. Mesos guarantees the delivery of status updates, so I suspect the problem is on the executor's side. On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila spod...@netflix.com wrote: Have you checked the mesos-slave and mesos-master logs for that task id? There should be logs in there for task state updates, including FINISHED. There can be specific cases where sometimes the task status is not reliably sent to your scheduler (due to mesos-master restarts, leader election changes, etc.). There is a task reconciliation support in Mesos. A periodic call to reconcile tasks from the scheduler can be helpful. There are also newer enhancements coming to the task reconciliation. In the mean time, there are other strategies such as what I use, which is periodic heartbeats from my custom executor to my scheduler (out of band). The timeouts for task runtimes are similar to heartbeats, except, you need a priori knowledge of all tasks' runtimes. Task runtime limits are not support inherently, as far as I know. Your executor can implement it, and that may be one simple way to do it. That could also be a good way to implement shell's rlimit*, in general. On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher ita...@yowza3d.com wrote: I'm using a custom internal framework, loosely based on MesosSubmit. The phenomenon I'm seeing is something like this: 1. Task X is assigned to slave S. 2. I know this task should run for ~10minutes. 3. On the master dashboard, I see that task X is in the Running state for several *hours*. 4. I SSH into slave S, and see that task X is *not* running. According to the local logs on that slave, task X finished a long time ago, and seemed to finish OK. 5. According to the scheduler logs, it never got any update from task X after the Staging-Running update. The phenomenon occurs pretty often, but it's not consistent or deterministic. I'd appreciate your input on how to go about debugging it, and/or implement a workaround to avoid wasted resources. I'm pretty sure the executor on the slave sends the TASK_FINISHED status update (how can I verify that beyond my own logging?). I'm pretty sure the scheduler never receives that update (again, how can I verify that beyond my own logging?). I have no idea if the master got the update and passed it through (how can I check that?). My scheduler and executor are written in Python. As for a workaround - setting a timeout on a task should do the trick. I did not see any timeout field in the TaskInfo message. Does mesos support the concept of per-task timeouts? Or should I implement my own task tracking and timeouting mechanism in the scheduler?