Re: [DISCUSS] Renaming Mesos Slave

2015-06-08 Thread Brian Topping
The moment it costs money for deployments to change these names, I'm +1 no 
change  — keep master/slave.

https://mail-archives.apache.org/mod_mbox/mesos-user/201506.mbox/%3c556f52ce.1050...@tampabay.rr.com%3e
 
https://mail-archives.apache.org/mod_mbox/mesos-user/201506.mbox/%3c556f52ce.1050...@tampabay.rr.com%3E
 kind of summarizes it for me.

 On Jun 9, 2015, at 4:55 AM, Lawrence Rau larry...@mac.com wrote:
 
 +1 no change  — keep master/slave
 
 
 On Jun 8, 2015, at 4:17 PM, Steven Schlansker sschlans...@opentable.com 
 wrote:
 
 
 On Jun 8, 2015, at 1:12 AM, Aaron Carey aca...@ilm.com wrote:
 
 I've been following this thread with interest, it draws a lot of parallels 
 with similar problems my wife faces as a teacher (and I imagine this 
 happens in other government/public sector organisations, earlier in this 
 thread James pointed me to an interested Wikipedia article which suggested 
 this also happens occasionally in software: eg County of Los Angeles in 
 2003). Every few years teachers are told to change the words used to 
 describe various things related to kids with minority backgrounds, from 
 underprivileged families or with disabilities and so on, usually to stop 
 other children from using them as derogatory terms or insults. It works for 
 a while and then the pupils catch on and start using the new words and the 
 cycle repeats.
 
 I guess the point I'm trying to make here is that if you do decide to 
 change the naming of master/slave because some naughty programmers in the 
 community have been using the terms offensively, you better make damn sure 
 you choose new terms which aren't likely to cause offence in the future and 
 require the whole renaming process to run again. Which is why I'm voting 
 for:
 
 +1 Gru/Minion
 
 Which then is great right up until Universal Pictures sues the Apache 
 foundation to get Gru changed.  Plus master/slave is immediately obvious 
 to anyone working in software.  I had to search the web to even figure out 
 what Gru was, and then it was not even the first result... ( 
 http://en.wikipedia.org/wiki/Main_Intelligence_Directorate_%28Russia%29 )
 
 
 There could also be another option: These terms are all being used to 
 describe a master/slave relationship, the mesos master is in charge, it 
 assigns work to the slaves and ensures that they carry it out. I'd suggest 
 that whatever you call this pair, the relationship will always be one of 
 domination and servitude. Perhaps what is really needed here is to get rid 
 of the concept of a master altogether and re-architect mesos so all nodes 
 in the cluster are equal and reach a consensus together about work 
 distribution and so on?
 
 I propose all processes, regardless of function, should be mesos-comrade 
 to ensure none of them feel slighted :)
 
 
 
 From: Nikolay Borodachev [nbo...@adobe.com]
 Sent: 06 June 2015 04:34
 To: user@mesos.apache.org
 Subject: RE: 答复: [DISCUSS] Renaming Mesos Slave
 
 +1 master/slave – no need to change
 
 From: Sam Salisbury [mailto:samsalisb...@gmail.com]
 Sent: Friday, June 05, 2015 8:31 AM
 To: user@mesos.apache.org
 Subject: Re: 答复: [DISCUSS] Renaming Mesos Slave
 
 Master/Minion +1
 
 On 5 June 2015 at 15:14, CCAAT cc...@tampabay.rr.com wrote:
 
 +1 master/slave, no change needed.  is the same as
 master/slaveI.E. keep the nomenclature as it currently is
 
 This means keep the name 'master' and keep the name 'slave'.
 
 
 Are you applying fuzzy math or kalman filters to your summations below?
 
 It looks to me, tallying things up, Master is kept as it is
 and 'Slave' is kept as it is. There did not seem to be any consensus
 on the new names if the pair names are updated. Or you can vote separately 
 on each name? On an  real ballot, you enter the choices,
 vote according to your needs, tally the results and publish them.
 Applying a 'fuzzy filter' to what has occurred in this debate so far
 is ridiculous.
 
 Why not repost the question like this or something on a more fair
 voting preference:
 
 
 Please vote for your favourite Name-pair in Mesos, for what is currently
 Master-Slave. Note Master-Slave is the no change vote option.
 
 [] Master-Slave
 [] Mesos-Slave
 [] Mesos-Minion
 [] Master-Minion
 [] Master-Follower
 [] Mesos-Follower
 [] Master-worker
 [] Mesos-worker
 [] etc etc
 
 -
 
 
 Tally the result and go from there.
 James
 
 
 
 
 On 06/05/2015 04:27 AM, Adam Bordelon wrote:
 Wow, what a response! Allow me to attempt to summarize the sentiment so far.
 
 Let's start with the implicit question,
 _0. Should we rename Mesos Slave?_
 +1 (Explicit approval) 12, including 7 from JIRA
 +0.5 (Implicit approval, suggested alternate name) 18
 -0.5 (Some disapproval, wouldn't block it) 5, including 1 from JIRA
 -1 (Strong disapproval) 16
 
 _1. What should we call the Mesos Slave node/host/machine?_
 Worker: +10, -2
 Agent: +6
 Follower (+Leader): +4, -1
 Minion: +2, -1
 Drone (+Director/Queen): +2
 

Re: Debugging hadoop-mesos

2015-05-08 Thread Brian Topping
Thanks Hasodent, I've updated 
https://gist.github.com/briantopping/311960f8e5454dbe9aab 
https://gist.github.com/briantopping/311960f8e5454dbe9aab with the output 
logs of what I am currently seeing. I've edited them for size, the message 
INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: 
http://10.211.55.16:50060; appeared a few thousand times in the logs. The 
configuration I have is probably still broken, 50060 is a Jetty port that 
returns a Cloudera string when telnetting to it.

The error I saw below were apparently the result of building against the older 
version of CDH, when I updated the hadoop-mesos POM to match my deployment 
version, the incorrectly calculated slots problem in my previous message has 
resolved.

My current problem is a Hadoop logging problem and nothing to do with Mesos, so 
I didn't post. I changed hadoop.log.dir=/var/log/hadoop in 
/etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any 
difference. Just getting back into it now.

 On May 8, 2015, at 1:56 PM, haosdent haosd...@gmail.com wrote:
 
 Could you post the log in executors which run jobtracker and taskstracks? It 
 would be helpful to find the cause of this problem.
 
 On Fri, May 8, 2015 at 3:05 AM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 I think there's something weird here:
   cpus: offered 2.0 needed at least 1.0
   mem : offered 1724.0 needed at least 1024.0
   disk: offered 44124.0 needed at least 1024.0
   ports:  at least 2 (sufficient)
 
 Am I misreading this? All of the requirements seem to be met.
 
 Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
 
 int slots = mapSlotsMax + reduceSlotsMax;
 slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
 slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
 slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
 
 // Is this offer too small for even the minimum slots?
 if (slots  1) {
   return false;
 }
 
 Not exactly sure what this is doing.
 
 Sorry for the noise.
 
 
 On May 7, 2015, at 6:32 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 
 Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab 
 https://gist.github.com/briantopping/311960f8e5454dbe9aab has some more 
 information necessary at this point... sorry for the omission..
 
 On May 7, 2015, at 6:05 PM, Tom Arnfeld t...@duedil.com 
 mailto:t...@duedil.com wrote:
 
 Hi Brian,
 
 At this point you should see the TT attempting to be launched via Mesos. 
 The launched but not heartbeat yet count tells us that the framework has 
 accepted resources for 4 slots but the TT hasn't actually come up yet.
 
 Do you see the task in your Meaos cluster UI, and is there anything 
 interesting in the task logs?
 
 --
 
 Tom Arnfeld
 Developer // DueDil
 
 (+44) 7525940046 tel:%28%2B44%29%207525940046
 25 Christopher Street, London, EC2A 2BS
 
 
 On Thu, May 7, 2015 at 12:01 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 
 Thanks guys, this was helpful. I started the job tracker as a service, but 
 apparently I never started the task tracker (or it failed to start and I 
 didn't notice). I started it after Haosdent's message, but wasn't able to 
 see any difference and I kept poking around.
 
 After making some changes and the VM wouldn't boot, my OCD got the better 
 of me and I reinstalled everything from scratch. There are just too many 
 moving parts to hassle you guys with an imperfect install on my end.
 
 This time through, I felt a lot more confident to use the Mesosphere RPMs, 
 but I couldn't find the best way to get things launched. 
 https://docs.mesosphere.com/reference/packages/ 
 https://docs.mesosphere.com/reference/packages/ has a Last-Modified of 
 Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't 
 have any init.d service descriptions as the packages page would indicate. 
 For now, I just launched them manually, but would like to get the machine 
 to completely load on boot as services.
 
 At this point, I have tested Mesos with:
 
 mesos-execute --master=localhost:5050 --name=test-exec 
 --command=sleep 10
 
 The only problem there is it seems that localhost isn't good enough for 
 my install, it needs to be the FQDN, but it works and the job flows through 
 the UI.
 
 Now, back to a hadoop job. When I try the job now, the logs show the 
 following stream of repeated messages:
 
 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: 
 Satisfied map and reduce slots needed.
 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 [Repeated a few times a second for five seconds]
 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: 
 JobTracker Status
   Pending Map Tasks: 4
Pending Reduce Tasks: 1
   Running Map

Re: Debugging hadoop-mesos

2015-05-08 Thread Brian Topping
That's correct, but /usr/lib/hadoop/logs doesn't even exist. It should be 
logging to /var/log/hadoop.

 On May 8, 2015, at 2:38 PM, haosdent haosd...@gmail.com wrote:
 
 Seems you don't have permission for this directory:
 
 java.io.IOException: Could not create job user log directory: 
 file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
 
 at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
 
 
 On Fri, May 8, 2015 at 3:32 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 Thanks Hasodent, I've updated 
 https://gist.github.com/briantopping/311960f8e5454dbe9aab 
 https://gist.github.com/briantopping/311960f8e5454dbe9aab with the output 
 logs of what I am currently seeing. I've edited them for size, the message 
 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: 
 http://10.211.55.16:50060 http://10.211.55.16:50060/ appeared a few 
 thousand times in the logs. The configuration I have is probably still 
 broken, 50060 is a Jetty port that returns a Cloudera string when telnetting 
 to it.
 
 The error I saw below were apparently the result of building against the 
 older version of CDH, when I updated the hadoop-mesos POM to match my 
 deployment version, the incorrectly calculated slots problem in my previous 
 message has resolved.
 
 My current problem is a Hadoop logging problem and nothing to do with Mesos, 
 so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in 
 /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any 
 difference. Just getting back into it now.
 
 On May 8, 2015, at 1:56 PM, haosdent haosd...@gmail.com 
 mailto:haosd...@gmail.com wrote:
 
 Could you post the log in executors which run jobtracker and taskstracks? It 
 would be helpful to find the cause of this problem.
 
 On Fri, May 8, 2015 at 3:05 AM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 I think there's something weird here:
   cpus: offered 2.0 needed at least 1.0
   mem : offered 1724.0 needed at least 1024.0
   disk: offered 44124.0 needed at least 1024.0
   ports:  at least 2 (sufficient)
 
 Am I misreading this? All of the requirements seem to be met.
 
 Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
 
 int slots = mapSlotsMax + reduceSlotsMax;
 slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
 slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
 slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
 
 // Is this offer too small for even the minimum slots?
 if (slots  1) {
   return false;
 }
 
 Not exactly sure what this is doing.
 
 Sorry for the noise.
 
 
 On May 7, 2015, at 6:32 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 
 Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab 
 https://gist.github.com/briantopping/311960f8e5454dbe9aab has some more 
 information necessary at this point... sorry for the omission..
 
 On May 7, 2015, at 6:05 PM, Tom Arnfeld t...@duedil.com 
 mailto:t...@duedil.com wrote:
 
 Hi Brian,
 
 At this point you should see the TT attempting to be launched via Mesos. 
 The launched but not heartbeat yet count tells us that the framework has 
 accepted resources for 4 slots but the TT hasn't actually come up yet.
 
 Do you see the task in your Meaos cluster UI, and is there anything 
 interesting in the task logs?
 
 --
 
 Tom Arnfeld
 Developer // DueDil
 
 (+44) 7525940046 tel:%28%2B44%29%207525940046
 25 Christopher Street, London, EC2A 2BS
 
 
 On Thu, May 7, 2015 at 12:01 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 
 Thanks guys, this was helpful. I started the job tracker as a service, but 
 apparently I never started the task tracker (or it failed to start and I 
 didn't notice). I started it after Haosdent's message, but wasn't able to 
 see any difference and I kept poking around.
 
 After making some changes and the VM wouldn't boot, my OCD got the better 
 of me and I reinstalled everything from scratch. There are just too many 
 moving parts to hassle you guys with an imperfect install on my end.
 
 This time through, I felt a lot more confident to use the Mesosphere RPMs, 
 but I couldn't find the best way to get things launched. 
 https://docs.mesosphere.com/reference/packages/ 
 https://docs.mesosphere.com/reference/packages/ has a Last-Modified of 
 Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't 
 have any init.d service descriptions as the packages page would indicate. 
 For now, I just launched them manually, but would like to get the machine 
 to completely load on boot as services.
 
 At this point, I have tested Mesos with:
 
mesos-execute --master=localhost:5050 --name=test-exec 
 --command=sleep 10
 
 The only problem there is it seems that localhost isn't good enough

Re: Debugging hadoop-mesos

2015-05-08 Thread Brian Topping
Indeed, this was all that was left to get jobs working, thanks!

Last thing I need to do for initial setup is get rid of the thousands of these 
messages, about three or four per second. I'm running against 
2.6.0-mr1-cdh5.4.0, maybe there was a change to the API semantics.

 2015-05-08 03:33:24,421 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 2015-05-08 03:33:24,724 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 2015-05-08 03:33:25,028 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 2015-05-08 03:33:25,331 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 2015-05-08 03:33:25,636 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 2015-05-08 03:33:25,940 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 2015-05-08 03:33:26,243 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 2015-05-08 03:33:26,546 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 2015-05-08 03:33:26,850 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 2015-05-08 03:33:27,153 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 2015-05-08 03:33:27,456 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 
 On May 8, 2015, at 2:47 PM, haosdent haosd...@gmail.com wrote:
 
 I think you could export HADOOP_LOG_DIR=/tmp to temp. And try again.
 
 On Fri, May 8, 2015 at 3:43 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 Mesos runs as root, hadoop is as a separate user.
 
 On May 8, 2015, at 2:41 PM, haosdent haosd...@gmail.com 
 mailto:haosd...@gmail.com wrote:
 
 You run everything in root?
 
 On Fri, May 8, 2015 at 3:38 PM, haosdent haosd...@gmail.com 
 mailto:haosd...@gmail.com wrote:
 Seems you don't have permission for this directory:
 
 java.io.IOException: Could not create job user log directory: 
 file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
 
 at javax.security.auth.Subject.doAs(Subject.java:415)
  at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
 
 
 On Fri, May 8, 2015 at 3:32 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 Thanks Hasodent, I've updated 
 https://gist.github.com/briantopping/311960f8e5454dbe9aab 
 https://gist.github.com/briantopping/311960f8e5454dbe9aab with the output 
 logs of what I am currently seeing. I've edited them for size, the message 
 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: 
 http://10.211.55.16:50060 http://10.211.55.16:50060/ appeared a few 
 thousand times in the logs. The configuration I have is probably still 
 broken, 50060 is a Jetty port that returns a Cloudera string when telnetting 
 to it.
 
 The error I saw below were apparently the result of building against the 
 older version of CDH, when I updated the hadoop-mesos POM to match my 
 deployment version, the incorrectly calculated slots problem in my 
 previous message has resolved.
 
 My current problem is a Hadoop logging problem and nothing to do with Mesos, 
 so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in 
 /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any 
 difference. Just getting back into it now.
 
 On May 8, 2015, at 1:56 PM, haosdent haosd...@gmail.com 
 mailto:haosd...@gmail.com wrote:
 
 Could you post the log in executors which run jobtracker and taskstracks? 
 It would be helpful to find the cause of this problem.
 
 On Fri, May 8, 2015 at 3:05 AM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 I think there's something weird here:
   cpus: offered 2.0 needed at least 1.0
   mem : offered 1724.0 needed at least 1024.0
   disk: offered 44124.0 needed at least 1024.0
   ports:  at least 2 (sufficient)
 
 Am I misreading this? All of the requirements seem to be met.
 
 Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
 
 int slots = mapSlotsMax + reduceSlotsMax;
 slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
 slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
 slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
 
 // Is this offer too small for even the minimum slots?
 if (slots  1) {
   return false;
 }
 
 Not exactly sure what this is doing.
 
 Sorry for the noise.
 
 
 On May 7, 2015, at 6:32 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 
 Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab 
 https

RESOLVED -- Re: Debugging hadoop-mesos

2015-05-08 Thread Brian Topping
Ok, I stared at the code for a long time and came up with 
https://github.com/mesos/hadoop/pull/55 
https://github.com/mesos/hadoop/pull/55. It probably should have been 
separate PRs for cleanups and method shuffling in one and the meat of the 
changes in another, sorry about that. The PR itself should have a decent 
description, please feel free to ask questions or critique it in the PR.

It seems like the build needs help with unit testing and release process. I 
think there's going to need to be a CI build that can build for various 
versions of CDH and assign the version to an artifact classifier before they 
can be easily managed on central. I'm happy to pitch in on these if anyone is 
interested. Testing this kind of code is a little tricky, but it generally 
results in better patterns when it's all finished.

Thanks for all of your help!! I'm looking forward to starting what I came to 
this stack to work on :)

Brian

 On May 8, 2015, at 3:06 PM, Brian Topping brian.topp...@gmail.com wrote:
 
 Indeed, this was all that was left to get jobs working, thanks!
 
 Last thing I need to do for initial setup is get rid of the thousands of 
 these messages, about three or four per second. I'm running against 
 2.6.0-mr1-cdh5.4.0, maybe there was a change to the API semantics.
 
 2015-05-08 03:33:24,421 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 2015-05-08 03:33:24,724 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 2015-05-08 03:33:25,028 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 2015-05-08 03:33:25,331 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 2015-05-08 03:33:25,636 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 2015-05-08 03:33:25,940 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 2015-05-08 03:33:26,243 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 2015-05-08 03:33:26,546 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 2015-05-08 03:33:26,850 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 2015-05-08 03:33:27,153 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 2015-05-08 03:33:27,456 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 
 On May 8, 2015, at 2:47 PM, haosdent haosd...@gmail.com 
 mailto:haosd...@gmail.com wrote:
 
 I think you could export HADOOP_LOG_DIR=/tmp to temp. And try again.
 
 On Fri, May 8, 2015 at 3:43 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 Mesos runs as root, hadoop is as a separate user.
 
 On May 8, 2015, at 2:41 PM, haosdent haosd...@gmail.com 
 mailto:haosd...@gmail.com wrote:
 
 You run everything in root?
 
 On Fri, May 8, 2015 at 3:38 PM, haosdent haosd...@gmail.com 
 mailto:haosd...@gmail.com wrote:
 Seems you don't have permission for this directory:
 
 java.io.IOException: Could not create job user log directory: 
 file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
 
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
 
 
 On Fri, May 8, 2015 at 3:32 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 Thanks Hasodent, I've updated 
 https://gist.github.com/briantopping/311960f8e5454dbe9aab 
 https://gist.github.com/briantopping/311960f8e5454dbe9aab with the output 
 logs of what I am currently seeing. I've edited them for size, the message 
 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: 
 http://10.211.55.16:50060 http://10.211.55.16:50060/ appeared a few 
 thousand times in the logs. The configuration I have is probably still 
 broken, 50060 is a Jetty port that returns a Cloudera string when 
 telnetting to it.
 
 The error I saw below were apparently the result of building against the 
 older version of CDH, when I updated the hadoop-mesos POM to match my 
 deployment version, the incorrectly calculated slots problem in my 
 previous message has resolved.
 
 My current problem is a Hadoop logging problem and nothing to do with 
 Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in 
 /etc/hadoop/conf.pseudo.mr1

Re: Debugging hadoop-mesos

2015-05-07 Thread Brian Topping
Thanks guys, this was helpful. I started the job tracker as a service, but 
apparently I never started the task tracker (or it failed to start and I didn't 
notice). I started it after Haosdent's message, but wasn't able to see any 
difference and I kept poking around.

After making some changes and the VM wouldn't boot, my OCD got the better of me 
and I reinstalled everything from scratch. There are just too many moving parts 
to hassle you guys with an imperfect install on my end.

This time through, I felt a lot more confident to use the Mesosphere RPMs, but 
I couldn't find the best way to get things launched. 
https://docs.mesosphere.com/reference/packages/ 
https://docs.mesosphere.com/reference/packages/ has a Last-Modified of Fri, 
01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any 
init.d service descriptions as the packages page would indicate. For now, I 
just launched them manually, but would like to get the machine to completely 
load on boot as services.

At this point, I have tested Mesos with:

mesos-execute --master=localhost:5050 --name=test-exec 
--command=sleep 10

The only problem there is it seems that localhost isn't good enough for my 
install, it needs to be the FQDN, but it works and the job flows through the UI.

Now, back to a hadoop job. When I try the job now, the logs show the following 
stream of repeated messages:

 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: 
 Satisfied map and reduce slots needed.
 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060.
 [Repeated a few times a second for five seconds]
 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: 
 JobTracker Status
   Pending Map Tasks: 4
Pending Reduce Tasks: 1
   Running Map Tasks: 0
Running Reduce Tasks: 0
  Idle Map Slots: 0
   Idle Reduce Slots: 0
  Inactive Map Slots: 4 (launched but no hearbeat yet)
   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
Needed Map Slots: 0
 Needed Reduce Slots: 0
  Unhealthy Trackers: 0

This looks close.

What's the best way to get a JDWP port set up to break in this code (i.e. 
learning to fish...)?

best, Brian


 On May 7, 2015, at 12:11 PM, Adam Bordelon a...@mesosphere.io wrote:
 
 From the mesos-master log and the JT log, it doesn't look like the 
 MesosScheduler ever registered with Mesos, which should mean that it wouldn't 
 start any TTs or map/reduce tasks. However, your `ps` output does seem to 
 show a tasktracker running. Did you start that yourself (or automatically as 
 a system service)?
 
 On Wed, May 6, 2015 at 9:32 AM, haosdent haosd...@gmail.com 
 mailto:haosd...@gmail.com wrote:
 Do you start tasktracker successfully?
 
 On Wed, May 6, 2015 at 11:32 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 
 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. 
 Hoping someone might have a few minutes to parse what I've got here and 
 suggest something to try.
 
 https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 
 https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 hopefully has all 
 the data necessary between the console output of the client run, the mesos 
 master and slave console, the XML configuration of the JT and the output that 
 was generated by it. Please let me know if I've left something out.
 
 I iterated a few times getting all the errors from missing paths or libraries 
 sorted out, but the example client ultimately just sits waiting forever at 
 map 0% reduce 0%.
 
 Any input kindly appreciated!
 
 Brian
 
 
 
 --
 Best Regards,
 Haosdent Huang
 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Debugging hadoop-mesos

2015-05-07 Thread Brian Topping
Thanks Tom! I do see activity in the cluster:

1. mesos-master.WARNING log -- sequence of repeat messages being generated:

 W0507 18:10:21.794231 11729 master.cpp:2661] Cannot kill task Task_Tracker_34 
 of framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 
 9001, WebUI port: 50030)) at 
 scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914 because it 
 is unknown; performing reconciliation

2. The mesos-slave.WARNING log shows W0507 17:42:50.385308 11753 
slave.cpp:1783] Cannot shut down unknown framework 
20150507-164120-272093962-5050-11711-0004 from about the time that the job was 
launched.

3. mesos-master.INFO log -- sequence of repeat messages being generated :

 I0507 18:18:40.512228 11730 master.cpp:3760] Sending 1 offers to framework 
 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI 
 port: 50030)) at 
 scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914
 I0507 18:18:40.514377 11729 master.cpp:2273] Processing ACCEPT call for 
 offers: [ 20150507-164120-272093962-5050-11711-O556 ] on slave 
 20150507-164120-272093962-5050-11711-S0 at slave(1)@10.211.55.16:5051 
 (10.211.55.16) for framework 20150507-164120-272093962-5050-11711-0003 
 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at 
 scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914
 I0507 18:18:40.515120 11729 hierarchical.hpp:648] Recovered cpus(*):6; 
 mem(*):2803; disk(*):45148; ports(*):[31000-32000] (total allocatable: 
 cpus(*):6; mem(*):2803; disk(*):45148; ports(*):[31000-32000]) on slave 
 20150507-164120-272093962-5050-11711-S0 from framework 
 20150507-164120-272093962-5050-11711-0003
 I0507 18:18:41.798447 11724 http.cpp:516] HTTP request for 
 '/master/state.json'

4. mesos-slave.INFO has nothing but resource allocation messages showing 
current disk usage.

5. The UI shows several terminated frameworks and one active (the one above). 
But the detail screen for that framework says there are no active or completed 
tasks.

Does this help?

 On May 7, 2015, at 6:05 PM, Tom Arnfeld t...@duedil.com wrote:
 
 Hi Brian,
 
 At this point you should see the TT attempting to be launched via Mesos. The 
 launched but not heartbeat yet count tells us that the framework has 
 accepted resources for 4 slots but the TT hasn't actually come up yet.
 
 Do you see the task in your Meaos cluster UI, and is there anything 
 interesting in the task logs?
 
 --
 
 Tom Arnfeld
 Developer // DueDil
 
 (+44) 7525940046
 25 Christopher Street, London, EC2A 2BS
 
 
 On Thu, May 7, 2015 at 12:01 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 
 Thanks guys, this was helpful. I started the job tracker as a service, but 
 apparently I never started the task tracker (or it failed to start and I 
 didn't notice). I started it after Haosdent's message, but wasn't able to see 
 any difference and I kept poking around.
 
 After making some changes and the VM wouldn't boot, my OCD got the better of 
 me and I reinstalled everything from scratch. There are just too many moving 
 parts to hassle you guys with an imperfect install on my end.
 
 This time through, I felt a lot more confident to use the Mesosphere RPMs, 
 but I couldn't find the best way to get things launched. 
 https://docs.mesosphere.com/reference/packages/ 
 https://docs.mesosphere.com/reference/packages/ has a Last-Modified of Fri, 
 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any 
 init.d service descriptions as the packages page would indicate. For now, I 
 just launched them manually, but would like to get the machine to completely 
 load on boot as services.
 
 At this point, I have tested Mesos with:
 
   mesos-execute --master=localhost:5050 --name=test-exec 
 --command=sleep 10
 
 The only problem there is it seems that localhost isn't good enough for my 
 install, it needs to be the FQDN, but it works and the job flows through the 
 UI.
 
 Now, back to a hadoop job. When I try the job now, the logs show the 
 following stream of repeated messages:
 
 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: 
 Satisfied map and reduce slots needed.
 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: 
 Unknown/exited TaskTracker: http://10.211.55.16:50060 
 http://10.211.55.16:50060/.
 [Repeated a few times a second for five seconds]
 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: 
 JobTracker Status
   Pending Map Tasks: 4
Pending Reduce Tasks: 1
   Running Map Tasks: 0
Running Reduce Tasks: 0
  Idle Map Slots: 0
   Idle Reduce Slots: 0
  Inactive Map Slots: 4 (launched but no hearbeat yet)
   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
Needed Map Slots: 0
 Needed Reduce Slots: 0
  Unhealthy Trackers: 0
 
 This looks close.
 
 What's the best way to get a JDWP port set up to break in this code (i.e. 
 learning to fish

Re: SEGV in 'make check'

2015-04-30 Thread Brian Topping
Also, I just checked this in the 0.22.1-RC6 and had the same problem.

 On Apr 30, 2015, at 9:27 PM, Brian Topping brian.topp...@gmail.com wrote:
 
 Greetings all, I'm having a problem with my first attempts building Mesos. I 
 started the other day with CentOS 7 and quickly realized it was far better to 
 be using 6.6. I've created a machine, but it's crashing in 'make check'.
 
 Till Toenshoff was kind enough to give me some leads in JIRA on what to do 
 next, but it didn't change stack trace to include symbols.
 
 https://gist.github.com/briantopping/51197bad452dd3b3277c 
 https://gist.github.com/briantopping/51197bad452dd3b3277c has the dump of 
 what I've done, the first file shows everything done in an empty build 
 directory and the crash at the end. The second file there is a dump of the 
 machine configuration -- the uname output, /proc/cpuinfo and all the 
 installed RPM packages.
 
 I guess the first question is why didn't the ../configure --enable-debug 
 work to generate the proper symbolics on the stack trace generated? Anyone 
 have suggestions on what I can try?
 
 Cheers, Brian



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: SEGV in 'make check'

2015-04-30 Thread Brian Topping
Getting closer. After finding 
http://garyzhu.net/notes/CentOS7-Systemd-Mesos-Marathon.html, I set up another 
new CentOS 7 machine, got a lot further on the compile this time, symbols too! 
This is with 0.22.1-RC6, CentOS Linux release 7.1.1503, kernel 
3.10.0-229.1.2.el7.x86_64.

Output from the last test during a make check.

 [--] 1 test from PerfEventIsolatorTest
 [ RUN  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
 F0430 14:03:34.169455 13504 isolator_tests.cpp:710] CHECK_SOME(isolator): 
 Failed to create PerfEvent isolator, invalid events: { cycles, task-clock }
 *** Check failure stack trace: ***
 @ 0x7f4c7ecea4ca  google::LogMessage::Fail()
 @ 0x7f4c7ecea429  google::LogMessage::SendToLog()
 @ 0x7f4c7ece9e3a  google::LogMessage::Flush()
 @ 0x7f4c7ececb6e  google::LogMessageFatal::~LogMessageFatal()
 @   0xa265b2  _CheckFatal::~_CheckFatal()
 @   0xc93e88  
 mesos::internal::tests::PerfEventIsolatorTest_ROOT_CGROUPS_Sample_Test::TestBody()
 @  0x1135c4f  
 testing::internal::HandleSehExceptionsInMethodIfSupported()
 @  0x1130e0a  
 testing::internal::HandleExceptionsInMethodIfSupported()
 @  0x1119233  testing::Test::Run()
 @  0x1119956  testing::TestInfo::Run()
 @  0x1119ede  testing::TestCase::Run()
 @  0x111ec5a  testing::internal::UnitTestImpl::RunAllTests()
 @  0x1136ac1  
 testing::internal::HandleSehExceptionsInMethodIfSupported()
 @  0x1131afb  
 testing::internal::HandleExceptionsInMethodIfSupported()
 @  0x111db0a  testing::UnitTest::Run()
 @   0xd28155  main
 @ 0x7f4c7a741af5  __libc_start_main
 @   0x8fae89  (unknown)

Full results and machine configuration at 
https://gist.github.com/briantopping/ac4f320bcc24e14328cd 
https://gist.github.com/briantopping/ac4f320bcc24e14328cd.

Not sure where to go with this, any insight appreciated!


 On Apr 30, 2015, at 11:33 PM, Brian Topping brian.topp...@gmail.com wrote:
 
 Also, I just checked this in the 0.22.1-RC6 and had the same problem.
 
 On Apr 30, 2015, at 9:27 PM, Brian Topping brian.topp...@gmail.com 
 mailto:brian.topp...@gmail.com wrote:
 
 Greetings all, I'm having a problem with my first attempts building Mesos. I 
 started the other day with CentOS 7 and quickly realized it was far better 
 to be using 6.6. I've created a machine, but it's crashing in 'make check'.
 
 Till Toenshoff was kind enough to give me some leads in JIRA on what to do 
 next, but it didn't change stack trace to include symbols.
 
 https://gist.github.com/briantopping/51197bad452dd3b3277c 
 https://gist.github.com/briantopping/51197bad452dd3b3277c has the dump of 
 what I've done, the first file shows everything done in an empty build 
 directory and the crash at the end. The second file there is a dump of the 
 machine configuration -- the uname output, /proc/cpuinfo and all the 
 installed RPM packages.
 
 I guess the first question is why didn't the ../configure --enable-debug 
 work to generate the proper symbolics on the stack trace generated? Anyone 
 have suggestions on what I can try?
 
 Cheers, Brian
 



signature.asc
Description: Message signed with OpenPGP using GPGMail