Re: Mess cluster resources utilization

2015-05-07 Thread Adam Bordelon
Yaron, I meant by comparing the available info. You could query Marathon's
/v2/apps endpoint to get the list of pending tasks and the resources
requested for each of them, and you could check the Mesos master and slave
/statistics.json to see the total amount of unallocated resources to
estimate how many additional resources you need for how many instances (may
need unique hosts) of pending tasks. Then you would have to map this onto a
request in a (cloud) provisioning tool for X more nodes with Y resources
each.

Alternatively, you could use this same information, along with some notion
of relative priority to kill off (and scale down) lower priority tasks
until you have enough resources to satisfy your higher priority tasks.

On Mon, May 4, 2015 at 10:32 AM, Tim Chen t...@mesosphere.io wrote:

 Hi Yaron,

 Marathon itself has its own REST endpoint you can hit (/v2/apps) that will
 return to you all the apps and tasks information, so you can see how many
 of the apps are launched and how many are still pending.

 Tim

 On Mon, May 4, 2015 at 5:28 AM, Yaron Rosenbaum yaron.rosenb...@gmail.com
  wrote:

 Hi Adam,

 For example, with Marathon - how can I get the list of pending tasks ?
 and by  “how many additional nodes you would need to satisfy them” - do you
 mean, by comparing the two? or is there statistics for that too?

 Thanks

 (Y)

 On May 3, 2015, at 10:10 AM, Adam Bordelon a...@mesosphere.io wrote:

 Yaron,

 You could use the /statistics.json endpoints to monitor the cpu/memory
 allocation across your cluster, even on individual nodes.
 Only individual frameworks know their own pending tasks and how many
 additional resources you would need to satisfy them.
 Given these pieces of information, you should be able to trigger your own
 auto-provisioning mechanism.

 On Fri, May 1, 2015 at 11:18 AM, Yaron Rosenbaum 
 yaron.rosenb...@gmail.com wrote:

 Hi

 Is there a way in mesos / marathon to know that tasks cannot be assigned
 due to lack of resources? or in other words - when to add mesos-slaves to
 the cluster?

 Or even more specifically, what amount of resources are missing (or in
 excess) given the current tasks and slaves?

 Thanks
 (Y)







Re: Apache Mesos Community Sync

2015-05-07 Thread Vinod Kone
Friendly reminder that the community sync is happening today. Same time, same
doc
https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit#,
same deal.

On Wed, Apr 1, 2015 at 3:18 AM, Adam Bordelon a...@mesosphere.io wrote:

 Reminder: We're having another Mesos Developer Community Sync this
 Thursday, April 2nd from 3-5pm Pacific.

 Agenda:

 https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing
 To Join: follow the BlueJeans instructions from the recurring meeting
 invite at the start of this thread.

 On Fri, Mar 6, 2015 at 11:11 AM, Vinod Kone vinodk...@apache.org wrote:

  Hi folks,
 
  We are planning to do monthly Mesos community meetings. Tentatively these
  are scheduled to occur on 1st Thursday of every month at 3 PM PST. See
  below for details to join the meeting remotely.
 
  This is a forum to ask questions/discuss about upcoming features, process
  etc. Everyone is welcome to join. Feel free to add items to the agenda
 for
  the next meeting here
  
 
 https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing
  
  .
 
  Cheers,
 
  On Thu, Mar 5, 2015 at 11:23 AM, Vinod Kone via Blue Jeans Network 
  inv...@bluejeans.com wrote:
 
   [image: Blue Jeans] http://bluejeans.com   Vinod Kone
   vi...@twitter.com has invited you to a video meeting.   Meeting
  Title: Apache Mesos Community Sync
 Meeting Time: Every 4th week on Thursday • from March 5, 2015 • 3
 p.m.
   PST / 2 hrs  Join Meeting
   
  https://bluejeans.com/272369669?ll=eng=mrsxmqdnmvzw64zomfygcy3imuxg64th
 
   --
 Connecting directly from a room system?
  
   1) Dial: 199.48.152.152 or bjn.vc
   2) Enter Meeting ID: 272369669 -or- use the pairing code
  
  
   Just want to dial in? (all numbers 
 http://bluejeans.com/premium-numbers
  )
   1) Direct-dial with my iPhone +14087407256,,#272369669%23,%23 or
   +1 408 740 7256 +1%20408%20740%207256+1 408 740 7256
   +1 888 240 2560 +1%20888%20240%202560+1 888 240 2560 (US Toll Free)
   +1 408 317 9253 +1%20408%20317%209253+1 408 317 9253 (Alternate
 Number)
  
   2) Enter Meeting ID: 272369669
  
 --
 Description:
   We will try BlueJeans VC this time for our monthly community sync.
  
   If BlueJeans *doesn't* work out we will use the Google Hangout link
   (https://plus.google.com/hangouts/_/twitter.com/mesos-sync) instead.
   *Note:* No moderator is required to start this meeting.
   --
 First time joining a Blue Jeans Video Meeting?
   http://bluejeans.com/support http://bluejeans.com/support?ll=en
   --
 Want to test your video connection?
   http://bluejeans.com/111 http://bluejeans.com/111?ll=en
   --
  
   ©Blue Jeans Network 2014
  
 



preventing registry failures from happening in mesos-master?

2015-05-07 Thread Erik Weathers
I know we're supposed to run the mesos daemons under supervision (i.e.,
bring them back up automatically if they fail).   But I'm interested in not
having the mesos-master fail at all, especially a failure in the registry /
replicated_log, which I am already a little scared of.

Situation:

   - Mesos version: 0.20.1
   - 30 mesos-slave hosts (on bare metal)
  - originally had 30, now have 39
   - 3 mesos-master hosts (on VMs)
   - 5 zookeepers (on bare metal)

Problems during slave addition:

(1) Brought up 1 brand new slave, this caused the acting master to die with
this error:

*Failed to admit slave ... Failed to update 'registry': Failed to perform
store within 5secs*


(2) 11 minutes later, brought up 8 more brand new slaves, this caused the
new acting master to die with this error:

*Failed to admit slave ... Failed to update 'registry': version mismatch*


I'm now even more afraid of the registry now. :(Is it likely that
there's some fundamental improperness in my configuration and/or setup that
would lead to the registry being so fragile?   I was guessing that running
the mesos-master on VMs might be bad and lead to the inital error about the
store not completing within 5 seconds.  But the latter problem is just
baffling to me.  Everything *seems* ok right now.  Maybe.  Hopefully.

Thanks!

- Erik


Re: Debugging hadoop-mesos

2015-05-07 Thread Brian Topping
Thanks guys, this was helpful. I started the job tracker as a service, but 
apparently I never started the task tracker (or it failed to start and I didn't 
notice). I started it after Haosdent's message, but wasn't able to see any 
difference and I kept poking around.

After making some changes and the VM wouldn't boot, my OCD got the better of me 
and I reinstalled everything from scratch. There are just too many moving parts 
to hassle you guys with an imperfect install on my end.

This time through, I felt a lot more confident to use the Mesosphere RPMs, but 
I couldn't find the best way to get things launched. 
https://docs.mesosphere.com/reference/packages/ 
https://docs.mesosphere.com/reference/packages/ has a Last-Modified of Fri, 
01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any 
init.d service descriptions as the packages page would indicate. For now, I 
just launched them manually, but would like to get the machine to completely 
load on boot as services.

At this point, I have tested Mesos with:

mesos-execute --master=localhost:5050 --name=test-exec 
--command=sleep 10

The only problem there is it seems that localhost isn't good enough for my 
install, it needs to be the FQDN, but it works and the job flows through the UI.

Now, back to a hadoop job. When I try the job now, the logs show the following 
stream of repeated messages:

 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: 
 Satisfied map and reduce slots needed.
 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 [Repeated a few times a second for five seconds]
 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: 
 JobTracker Status
   Pending Map Tasks: 4
Pending Reduce Tasks: 1
   Running Map Tasks: 0
Running Reduce Tasks: 0
  Idle Map Slots: 0
   Idle Reduce Slots: 0
  Inactive Map Slots: 4 (launched but no hearbeat yet)
   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
Needed Map Slots: 0
 Needed Reduce Slots: 0
  Unhealthy Trackers: 0

This looks close.

What's the best way to get a JDWP port set up to break in this code (i.e. 
learning to fish...)?

best, Brian


 On May 7, 2015, at 12:11 PM, Adam Bordelon a...@mesosphere.io wrote:
 
 From the mesos-master log and the JT log, it doesn't look like the 
 MesosScheduler ever registered with Mesos, which should mean that it wouldn't 
 start any TTs or map/reduce tasks. However, your `ps` output does seem to 
 show a tasktracker running. Did you start that yourself (or automatically as 
 a system service)?
 
 On Wed, May 6, 2015 at 9:32 AM, haosdent haosd...@gmail.com 
 mailto:haosd...@gmail.com wrote:
 Do you start tasktracker successfully?
 
 On Wed, May 6, 2015 at 11:32 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 
 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. 
 Hoping someone might have a few minutes to parse what I've got here and 
 suggest something to try.
 
 https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 
 https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 hopefully has all 
 the data necessary between the console output of the client run, the mesos 
 master and slave console, the XML configuration of the JT and the output that 
 was generated by it. Please let me know if I've left something out.
 
 I iterated a few times getting all the errors from missing paths or libraries 
 sorted out, but the example client ultimately just sits waiting forever at 
 map 0% reduce 0%.
 
 Any input kindly appreciated!
 
 Brian
 
 
 
 --
 Best Regards,
 Haosdent Huang
 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Debugging hadoop-mesos

2015-05-07 Thread Brian Topping
Thanks Tom! I do see activity in the cluster:

1. mesos-master.WARNING log -- sequence of repeat messages being generated:

 W0507 18:10:21.794231 11729 master.cpp:2661] Cannot kill task Task_Tracker_34 
 of framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 
 9001, WebUI port: 50030)) at 
 scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914 because it 
 is unknown; performing reconciliation

2. The mesos-slave.WARNING log shows W0507 17:42:50.385308 11753 
slave.cpp:1783] Cannot shut down unknown framework 
20150507-164120-272093962-5050-11711-0004 from about the time that the job was 
launched.

3. mesos-master.INFO log -- sequence of repeat messages being generated :

 I0507 18:18:40.512228 11730 master.cpp:3760] Sending 1 offers to framework 
 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI 
 port: 50030)) at 
 scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914
 I0507 18:18:40.514377 11729 master.cpp:2273] Processing ACCEPT call for 
 offers: [ 20150507-164120-272093962-5050-11711-O556 ] on slave 
 20150507-164120-272093962-5050-11711-S0 at slave(1)@10.211.55.16:5051 
 (10.211.55.16) for framework 20150507-164120-272093962-5050-11711-0003 
 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at 
 scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914
 I0507 18:18:40.515120 11729 hierarchical.hpp:648] Recovered cpus(*):6; 
 mem(*):2803; disk(*):45148; ports(*):[31000-32000] (total allocatable: 
 cpus(*):6; mem(*):2803; disk(*):45148; ports(*):[31000-32000]) on slave 
 20150507-164120-272093962-5050-11711-S0 from framework 
 20150507-164120-272093962-5050-11711-0003
 I0507 18:18:41.798447 11724 http.cpp:516] HTTP request for 
 '/master/state.json'

4. mesos-slave.INFO has nothing but resource allocation messages showing 
current disk usage.

5. The UI shows several terminated frameworks and one active (the one above). 
But the detail screen for that framework says there are no active or completed 
tasks.

Does this help?

 On May 7, 2015, at 6:05 PM, Tom Arnfeld t...@duedil.com wrote:
 
 Hi Brian,
 
 At this point you should see the TT attempting to be launched via Mesos. The 
 launched but not heartbeat yet count tells us that the framework has 
 accepted resources for 4 slots but the TT hasn't actually come up yet.
 
 Do you see the task in your Meaos cluster UI, and is there anything 
 interesting in the task logs?
 
 --
 
 Tom Arnfeld
 Developer // DueDil
 
 (+44) 7525940046
 25 Christopher Street, London, EC2A 2BS
 
 
 On Thu, May 7, 2015 at 12:01 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 
 Thanks guys, this was helpful. I started the job tracker as a service, but 
 apparently I never started the task tracker (or it failed to start and I 
 didn't notice). I started it after Haosdent's message, but wasn't able to see 
 any difference and I kept poking around.
 
 After making some changes and the VM wouldn't boot, my OCD got the better of 
 me and I reinstalled everything from scratch. There are just too many moving 
 parts to hassle you guys with an imperfect install on my end.
 
 This time through, I felt a lot more confident to use the Mesosphere RPMs, 
 but I couldn't find the best way to get things launched. 
 https://docs.mesosphere.com/reference/packages/ 
 https://docs.mesosphere.com/reference/packages/ has a Last-Modified of Fri, 
 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any 
 init.d service descriptions as the packages page would indicate. For now, I 
 just launched them manually, but would like to get the machine to completely 
 load on boot as services.
 
 At this point, I have tested Mesos with:
 
   mesos-execute --master=localhost:5050 --name=test-exec 
 --command=sleep 10
 
 The only problem there is it seems that localhost isn't good enough for my 
 install, it needs to be the FQDN, but it works and the job flows through the 
 UI.
 
 Now, back to a hadoop job. When I try the job now, the logs show the 
 following stream of repeated messages:
 
 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: 
 Satisfied map and reduce slots needed.
 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 [Repeated a few times a second for five seconds]
 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: 
 JobTracker Status
   Pending Map Tasks: 4
Pending Reduce Tasks: 1
   Running Map Tasks: 0
Running Reduce Tasks: 0
  Idle Map Slots: 0
   Idle Reduce Slots: 0
  Inactive Map Slots: 4 (launched but no hearbeat yet)
   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
Needed Map Slots: 0
 Needed Reduce Slots: 0
  Unhealthy Trackers: 0
 
 This looks close.
 
 What's the best way to get a JDWP port set up to break in this code (i.e. 
 learning to fish

Re: cpu hard limit for docker containerizer?

2015-05-07 Thread Chengwei Yang
Thanks Tim, I'll take a look if I can help.

-- 
Thanks,
Chengwei

On Thu, May 07, 2015 at 09:56:35PM -0700, Tim Chen wrote:
 Hi Chengwei,
 
 It's a known issue and there is a open JIRA (MESOS-2154) and also a open
 reviewboard that hasn't been updated for a while.
 
 I'd like this to go into to 0.23 if we can get to it, if you like to pick up
 the reviewboard feel free to do so.
 
 Tim
 
 On Thu, May 7, 2015 at 7:21 PM, Chengwei Yang chengwei.yang...@gmail.com
 wrote:
 
 Hi List,
 
 I see mesos-slave has `--cgroups_enable_cfs` option to enable CFS hard cpu
 limit, that's may real helpful to running online aand offline jobs within 
 a
 single mesos cluster, since some offline jobs are very CPU bindings.
 
 However, after having a small source code trip, I saw 
 `--cgroups_enable_cfs
 ` is
 only used by *mesos* containerizer, is there a plan to reuse this in
 *docker*
 containerizer?
 
 Please correct me if I was wrong, thanks in advance!

 --
 Thanks,
 Chengwei
 
 
 SECURITY NOTE: file ~/.netrc must not be accessible by others


signature.asc
Description: Digital signature


cpu hard limit for docker containerizer?

2015-05-07 Thread Chengwei Yang
Hi List,

I see mesos-slave has `--cgroups_enable_cfs` option to enable CFS hard cpu
limit, that's may real helpful to running online aand offline jobs within a
single mesos cluster, since some offline jobs are very CPU bindings.

However, after having a small source code trip, I saw `--cgroups_enable_cfs` is
only used by *mesos* containerizer, is there a plan to reuse this in *docker*
containerizer?

Please correct me if I was wrong, thanks in advance!

-- 
Thanks,
Chengwei


signature.asc
Description: Digital signature


Re: cpu hard limit for docker containerizer?

2015-05-07 Thread Tim Chen
Hi Chengwei,

It's a known issue and there is a open JIRA (MESOS-2154) and also a open
reviewboard that hasn't been updated for a while.

I'd like this to go into to 0.23 if we can get to it, if you like to pick
up the reviewboard feel free to do so.

Tim

On Thu, May 7, 2015 at 7:21 PM, Chengwei Yang chengwei.yang...@gmail.com
wrote:

 Hi List,

 I see mesos-slave has `--cgroups_enable_cfs` option to enable CFS hard cpu
 limit, that's may real helpful to running online aand offline jobs within a
 single mesos cluster, since some offline jobs are very CPU bindings.

 However, after having a small source code trip, I saw
 `--cgroups_enable_cfs` is
 only used by *mesos* containerizer, is there a plan to reuse this in
 *docker*
 containerizer?

 Please correct me if I was wrong, thanks in advance!

 --
 Thanks,
 Chengwei



Brigade :: Powered By Mesos

2015-05-07 Thread John Miller
We're utilizing Mesos within our organization for multiple projects.

Anyone with access please feel free to add us to the
https://mesos.apache.org/documentation/latest/powered-by-mesos/ page.

Cheers!


John Miller
Engineer | www.brigade.com