Re: Mess cluster resources utilization
Yaron, I meant by comparing the available info. You could query Marathon's /v2/apps endpoint to get the list of pending tasks and the resources requested for each of them, and you could check the Mesos master and slave /statistics.json to see the total amount of unallocated resources to estimate how many additional resources you need for how many instances (may need unique hosts) of pending tasks. Then you would have to map this onto a request in a (cloud) provisioning tool for X more nodes with Y resources each. Alternatively, you could use this same information, along with some notion of relative priority to kill off (and scale down) lower priority tasks until you have enough resources to satisfy your higher priority tasks. On Mon, May 4, 2015 at 10:32 AM, Tim Chen t...@mesosphere.io wrote: Hi Yaron, Marathon itself has its own REST endpoint you can hit (/v2/apps) that will return to you all the apps and tasks information, so you can see how many of the apps are launched and how many are still pending. Tim On Mon, May 4, 2015 at 5:28 AM, Yaron Rosenbaum yaron.rosenb...@gmail.com wrote: Hi Adam, For example, with Marathon - how can I get the list of pending tasks ? and by “how many additional nodes you would need to satisfy them” - do you mean, by comparing the two? or is there statistics for that too? Thanks (Y) On May 3, 2015, at 10:10 AM, Adam Bordelon a...@mesosphere.io wrote: Yaron, You could use the /statistics.json endpoints to monitor the cpu/memory allocation across your cluster, even on individual nodes. Only individual frameworks know their own pending tasks and how many additional resources you would need to satisfy them. Given these pieces of information, you should be able to trigger your own auto-provisioning mechanism. On Fri, May 1, 2015 at 11:18 AM, Yaron Rosenbaum yaron.rosenb...@gmail.com wrote: Hi Is there a way in mesos / marathon to know that tasks cannot be assigned due to lack of resources? or in other words - when to add mesos-slaves to the cluster? Or even more specifically, what amount of resources are missing (or in excess) given the current tasks and slaves? Thanks (Y)
Re: Apache Mesos Community Sync
Friendly reminder that the community sync is happening today. Same time, same doc https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit#, same deal. On Wed, Apr 1, 2015 at 3:18 AM, Adam Bordelon a...@mesosphere.io wrote: Reminder: We're having another Mesos Developer Community Sync this Thursday, April 2nd from 3-5pm Pacific. Agenda: https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing To Join: follow the BlueJeans instructions from the recurring meeting invite at the start of this thread. On Fri, Mar 6, 2015 at 11:11 AM, Vinod Kone vinodk...@apache.org wrote: Hi folks, We are planning to do monthly Mesos community meetings. Tentatively these are scheduled to occur on 1st Thursday of every month at 3 PM PST. See below for details to join the meeting remotely. This is a forum to ask questions/discuss about upcoming features, process etc. Everyone is welcome to join. Feel free to add items to the agenda for the next meeting here https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing . Cheers, On Thu, Mar 5, 2015 at 11:23 AM, Vinod Kone via Blue Jeans Network inv...@bluejeans.com wrote: [image: Blue Jeans] http://bluejeans.com Vinod Kone vi...@twitter.com has invited you to a video meeting. Meeting Title: Apache Mesos Community Sync Meeting Time: Every 4th week on Thursday • from March 5, 2015 • 3 p.m. PST / 2 hrs Join Meeting https://bluejeans.com/272369669?ll=eng=mrsxmqdnmvzw64zomfygcy3imuxg64th -- Connecting directly from a room system? 1) Dial: 199.48.152.152 or bjn.vc 2) Enter Meeting ID: 272369669 -or- use the pairing code Just want to dial in? (all numbers http://bluejeans.com/premium-numbers ) 1) Direct-dial with my iPhone +14087407256,,#272369669%23,%23 or +1 408 740 7256 +1%20408%20740%207256+1 408 740 7256 +1 888 240 2560 +1%20888%20240%202560+1 888 240 2560 (US Toll Free) +1 408 317 9253 +1%20408%20317%209253+1 408 317 9253 (Alternate Number) 2) Enter Meeting ID: 272369669 -- Description: We will try BlueJeans VC this time for our monthly community sync. If BlueJeans *doesn't* work out we will use the Google Hangout link (https://plus.google.com/hangouts/_/twitter.com/mesos-sync) instead. *Note:* No moderator is required to start this meeting. -- First time joining a Blue Jeans Video Meeting? http://bluejeans.com/support http://bluejeans.com/support?ll=en -- Want to test your video connection? http://bluejeans.com/111 http://bluejeans.com/111?ll=en -- ©Blue Jeans Network 2014
preventing registry failures from happening in mesos-master?
I know we're supposed to run the mesos daemons under supervision (i.e., bring them back up automatically if they fail). But I'm interested in not having the mesos-master fail at all, especially a failure in the registry / replicated_log, which I am already a little scared of. Situation: - Mesos version: 0.20.1 - 30 mesos-slave hosts (on bare metal) - originally had 30, now have 39 - 3 mesos-master hosts (on VMs) - 5 zookeepers (on bare metal) Problems during slave addition: (1) Brought up 1 brand new slave, this caused the acting master to die with this error: *Failed to admit slave ... Failed to update 'registry': Failed to perform store within 5secs* (2) 11 minutes later, brought up 8 more brand new slaves, this caused the new acting master to die with this error: *Failed to admit slave ... Failed to update 'registry': version mismatch* I'm now even more afraid of the registry now. :(Is it likely that there's some fundamental improperness in my configuration and/or setup that would lead to the registry being so fragile? I was guessing that running the mesos-master on VMs might be bad and lead to the inital error about the store not completing within 5 seconds. But the latter problem is just baffling to me. Everything *seems* ok right now. Maybe. Hopefully. Thanks! - Erik
Re: Debugging hadoop-mesos
Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around. After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end. This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ https://docs.mesosphere.com/reference/packages/ has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services. At this point, I have tested Mesos with: mesos-execute --master=localhost:5050 --name=test-exec --command=sleep 10 The only problem there is it seems that localhost isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI. Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages: 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed. 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. [Repeated a few times a second for five seconds] 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status Pending Map Tasks: 4 Pending Reduce Tasks: 1 Running Map Tasks: 0 Running Reduce Tasks: 0 Idle Map Slots: 0 Idle Reduce Slots: 0 Inactive Map Slots: 4 (launched but no hearbeat yet) Inactive Reduce Slots: 1 (launched but no hearbeat yet) Needed Map Slots: 0 Needed Reduce Slots: 0 Unhealthy Trackers: 0 This looks close. What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)? best, Brian On May 7, 2015, at 12:11 PM, Adam Bordelon a...@mesosphere.io wrote: From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)? On Wed, May 6, 2015 at 9:32 AM, haosdent haosd...@gmail.com mailto:haosd...@gmail.com wrote: Do you start tasktracker successfully? On Wed, May 6, 2015 at 11:32 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try. https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out. I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at map 0% reduce 0%. Any input kindly appreciated! Brian -- Best Regards, Haosdent Huang signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Debugging hadoop-mesos
Thanks Tom! I do see activity in the cluster: 1. mesos-master.WARNING log -- sequence of repeat messages being generated: W0507 18:10:21.794231 11729 master.cpp:2661] Cannot kill task Task_Tracker_34 of framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914 because it is unknown; performing reconciliation 2. The mesos-slave.WARNING log shows W0507 17:42:50.385308 11753 slave.cpp:1783] Cannot shut down unknown framework 20150507-164120-272093962-5050-11711-0004 from about the time that the job was launched. 3. mesos-master.INFO log -- sequence of repeat messages being generated : I0507 18:18:40.512228 11730 master.cpp:3760] Sending 1 offers to framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914 I0507 18:18:40.514377 11729 master.cpp:2273] Processing ACCEPT call for offers: [ 20150507-164120-272093962-5050-11711-O556 ] on slave 20150507-164120-272093962-5050-11711-S0 at slave(1)@10.211.55.16:5051 (10.211.55.16) for framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914 I0507 18:18:40.515120 11729 hierarchical.hpp:648] Recovered cpus(*):6; mem(*):2803; disk(*):45148; ports(*):[31000-32000] (total allocatable: cpus(*):6; mem(*):2803; disk(*):45148; ports(*):[31000-32000]) on slave 20150507-164120-272093962-5050-11711-S0 from framework 20150507-164120-272093962-5050-11711-0003 I0507 18:18:41.798447 11724 http.cpp:516] HTTP request for '/master/state.json' 4. mesos-slave.INFO has nothing but resource allocation messages showing current disk usage. 5. The UI shows several terminated frameworks and one active (the one above). But the detail screen for that framework says there are no active or completed tasks. Does this help? On May 7, 2015, at 6:05 PM, Tom Arnfeld t...@duedil.com wrote: Hi Brian, At this point you should see the TT attempting to be launched via Mesos. The launched but not heartbeat yet count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet. Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs? -- Tom Arnfeld Developer // DueDil (+44) 7525940046 25 Christopher Street, London, EC2A 2BS On Thu, May 7, 2015 at 12:01 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around. After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end. This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ https://docs.mesosphere.com/reference/packages/ has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services. At this point, I have tested Mesos with: mesos-execute --master=localhost:5050 --name=test-exec --command=sleep 10 The only problem there is it seems that localhost isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI. Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages: 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed. 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. [Repeated a few times a second for five seconds] 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status Pending Map Tasks: 4 Pending Reduce Tasks: 1 Running Map Tasks: 0 Running Reduce Tasks: 0 Idle Map Slots: 0 Idle Reduce Slots: 0 Inactive Map Slots: 4 (launched but no hearbeat yet) Inactive Reduce Slots: 1 (launched but no hearbeat yet) Needed Map Slots: 0 Needed Reduce Slots: 0 Unhealthy Trackers: 0 This looks close. What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish
Re: cpu hard limit for docker containerizer?
Thanks Tim, I'll take a look if I can help. -- Thanks, Chengwei On Thu, May 07, 2015 at 09:56:35PM -0700, Tim Chen wrote: Hi Chengwei, It's a known issue and there is a open JIRA (MESOS-2154) and also a open reviewboard that hasn't been updated for a while. I'd like this to go into to 0.23 if we can get to it, if you like to pick up the reviewboard feel free to do so. Tim On Thu, May 7, 2015 at 7:21 PM, Chengwei Yang chengwei.yang...@gmail.com wrote: Hi List, I see mesos-slave has `--cgroups_enable_cfs` option to enable CFS hard cpu limit, that's may real helpful to running online aand offline jobs within a single mesos cluster, since some offline jobs are very CPU bindings. However, after having a small source code trip, I saw `--cgroups_enable_cfs ` is only used by *mesos* containerizer, is there a plan to reuse this in *docker* containerizer? Please correct me if I was wrong, thanks in advance! -- Thanks, Chengwei SECURITY NOTE: file ~/.netrc must not be accessible by others signature.asc Description: Digital signature
cpu hard limit for docker containerizer?
Hi List, I see mesos-slave has `--cgroups_enable_cfs` option to enable CFS hard cpu limit, that's may real helpful to running online aand offline jobs within a single mesos cluster, since some offline jobs are very CPU bindings. However, after having a small source code trip, I saw `--cgroups_enable_cfs` is only used by *mesos* containerizer, is there a plan to reuse this in *docker* containerizer? Please correct me if I was wrong, thanks in advance! -- Thanks, Chengwei signature.asc Description: Digital signature
Re: cpu hard limit for docker containerizer?
Hi Chengwei, It's a known issue and there is a open JIRA (MESOS-2154) and also a open reviewboard that hasn't been updated for a while. I'd like this to go into to 0.23 if we can get to it, if you like to pick up the reviewboard feel free to do so. Tim On Thu, May 7, 2015 at 7:21 PM, Chengwei Yang chengwei.yang...@gmail.com wrote: Hi List, I see mesos-slave has `--cgroups_enable_cfs` option to enable CFS hard cpu limit, that's may real helpful to running online aand offline jobs within a single mesos cluster, since some offline jobs are very CPU bindings. However, after having a small source code trip, I saw `--cgroups_enable_cfs` is only used by *mesos* containerizer, is there a plan to reuse this in *docker* containerizer? Please correct me if I was wrong, thanks in advance! -- Thanks, Chengwei
Brigade :: Powered By Mesos
We're utilizing Mesos within our organization for multiple projects. Anyone with access please feel free to add us to the https://mesos.apache.org/documentation/latest/powered-by-mesos/ page. Cheers! John Miller Engineer | www.brigade.com