Re: [DISCUSS] Renaming Mesos Slave
The moment it costs money for deployments to change these names, I'm +1 no change — keep master/slave. https://mail-archives.apache.org/mod_mbox/mesos-user/201506.mbox/%3c556f52ce.1050...@tampabay.rr.com%3e https://mail-archives.apache.org/mod_mbox/mesos-user/201506.mbox/%3c556f52ce.1050...@tampabay.rr.com%3E kind of summarizes it for me. On Jun 9, 2015, at 4:55 AM, Lawrence Rau larry...@mac.com wrote: +1 no change — keep master/slave On Jun 8, 2015, at 4:17 PM, Steven Schlansker sschlans...@opentable.com wrote: On Jun 8, 2015, at 1:12 AM, Aaron Carey aca...@ilm.com wrote: I've been following this thread with interest, it draws a lot of parallels with similar problems my wife faces as a teacher (and I imagine this happens in other government/public sector organisations, earlier in this thread James pointed me to an interested Wikipedia article which suggested this also happens occasionally in software: eg County of Los Angeles in 2003). Every few years teachers are told to change the words used to describe various things related to kids with minority backgrounds, from underprivileged families or with disabilities and so on, usually to stop other children from using them as derogatory terms or insults. It works for a while and then the pupils catch on and start using the new words and the cycle repeats. I guess the point I'm trying to make here is that if you do decide to change the naming of master/slave because some naughty programmers in the community have been using the terms offensively, you better make damn sure you choose new terms which aren't likely to cause offence in the future and require the whole renaming process to run again. Which is why I'm voting for: +1 Gru/Minion Which then is great right up until Universal Pictures sues the Apache foundation to get Gru changed. Plus master/slave is immediately obvious to anyone working in software. I had to search the web to even figure out what Gru was, and then it was not even the first result... ( http://en.wikipedia.org/wiki/Main_Intelligence_Directorate_%28Russia%29 ) There could also be another option: These terms are all being used to describe a master/slave relationship, the mesos master is in charge, it assigns work to the slaves and ensures that they carry it out. I'd suggest that whatever you call this pair, the relationship will always be one of domination and servitude. Perhaps what is really needed here is to get rid of the concept of a master altogether and re-architect mesos so all nodes in the cluster are equal and reach a consensus together about work distribution and so on? I propose all processes, regardless of function, should be mesos-comrade to ensure none of them feel slighted :) From: Nikolay Borodachev [nbo...@adobe.com] Sent: 06 June 2015 04:34 To: user@mesos.apache.org Subject: RE: 答复: [DISCUSS] Renaming Mesos Slave +1 master/slave – no need to change From: Sam Salisbury [mailto:samsalisb...@gmail.com] Sent: Friday, June 05, 2015 8:31 AM To: user@mesos.apache.org Subject: Re: 答复: [DISCUSS] Renaming Mesos Slave Master/Minion +1 On 5 June 2015 at 15:14, CCAAT cc...@tampabay.rr.com wrote: +1 master/slave, no change needed. is the same as master/slaveI.E. keep the nomenclature as it currently is This means keep the name 'master' and keep the name 'slave'. Are you applying fuzzy math or kalman filters to your summations below? It looks to me, tallying things up, Master is kept as it is and 'Slave' is kept as it is. There did not seem to be any consensus on the new names if the pair names are updated. Or you can vote separately on each name? On an real ballot, you enter the choices, vote according to your needs, tally the results and publish them. Applying a 'fuzzy filter' to what has occurred in this debate so far is ridiculous. Why not repost the question like this or something on a more fair voting preference: Please vote for your favourite Name-pair in Mesos, for what is currently Master-Slave. Note Master-Slave is the no change vote option. [] Master-Slave [] Mesos-Slave [] Mesos-Minion [] Master-Minion [] Master-Follower [] Mesos-Follower [] Master-worker [] Mesos-worker [] etc etc - Tally the result and go from there. James On 06/05/2015 04:27 AM, Adam Bordelon wrote: Wow, what a response! Allow me to attempt to summarize the sentiment so far. Let's start with the implicit question, _0. Should we rename Mesos Slave?_ +1 (Explicit approval) 12, including 7 from JIRA +0.5 (Implicit approval, suggested alternate name) 18 -0.5 (Some disapproval, wouldn't block it) 5, including 1 from JIRA -1 (Strong disapproval) 16 _1. What should we call the Mesos Slave node/host/machine?_ Worker: +10, -2 Agent: +6 Follower (+Leader): +4, -1 Minion: +2, -1 Drone (+Director/Queen): +2
Re: Debugging hadoop-mesos
Thanks Hasodent, I've updated https://gist.github.com/briantopping/311960f8e5454dbe9aab https://gist.github.com/briantopping/311960f8e5454dbe9aab with the output logs of what I am currently seeing. I've edited them for size, the message INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060; appeared a few thousand times in the logs. The configuration I have is probably still broken, 50060 is a Jetty port that returns a Cloudera string when telnetting to it. The error I saw below were apparently the result of building against the older version of CDH, when I updated the hadoop-mesos POM to match my deployment version, the incorrectly calculated slots problem in my previous message has resolved. My current problem is a Hadoop logging problem and nothing to do with Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any difference. Just getting back into it now. On May 8, 2015, at 1:56 PM, haosdent haosd...@gmail.com wrote: Could you post the log in executors which run jobtracker and taskstracks? It would be helpful to find the cause of this problem. On Fri, May 8, 2015 at 3:05 AM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: I think there's something weird here: cpus: offered 2.0 needed at least 1.0 mem : offered 1724.0 needed at least 1024.0 disk: offered 44124.0 needed at least 1024.0 ports: at least 2 (sufficient) Am I misreading this? All of the requirements seem to be met. Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable: int slots = mapSlotsMax + reduceSlotsMax; slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus); slots = (int) Math.min(slots, (mem - containerMem) / slotMem); slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk); // Is this offer too small for even the minimum slots? if (slots 1) { return false; } Not exactly sure what this is doing. Sorry for the noise. On May 7, 2015, at 6:32 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab https://gist.github.com/briantopping/311960f8e5454dbe9aab has some more information necessary at this point... sorry for the omission.. On May 7, 2015, at 6:05 PM, Tom Arnfeld t...@duedil.com mailto:t...@duedil.com wrote: Hi Brian, At this point you should see the TT attempting to be launched via Mesos. The launched but not heartbeat yet count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet. Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs? -- Tom Arnfeld Developer // DueDil (+44) 7525940046 tel:%28%2B44%29%207525940046 25 Christopher Street, London, EC2A 2BS On Thu, May 7, 2015 at 12:01 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around. After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end. This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ https://docs.mesosphere.com/reference/packages/ has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services. At this point, I have tested Mesos with: mesos-execute --master=localhost:5050 --name=test-exec --command=sleep 10 The only problem there is it seems that localhost isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI. Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages: 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed. 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. [Repeated a few times a second for five seconds] 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status Pending Map Tasks: 4 Pending Reduce Tasks: 1 Running Map
Re: Debugging hadoop-mesos
That's correct, but /usr/lib/hadoop/logs doesn't even exist. It should be logging to /var/log/hadoop. On May 8, 2015, at 2:38 PM, haosdent haosd...@gmail.com wrote: Seems you don't have permission for this directory: java.io.IOException: Could not create job user log directory: file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001 at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) On Fri, May 8, 2015 at 3:32 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Thanks Hasodent, I've updated https://gist.github.com/briantopping/311960f8e5454dbe9aab https://gist.github.com/briantopping/311960f8e5454dbe9aab with the output logs of what I am currently seeing. I've edited them for size, the message INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/ appeared a few thousand times in the logs. The configuration I have is probably still broken, 50060 is a Jetty port that returns a Cloudera string when telnetting to it. The error I saw below were apparently the result of building against the older version of CDH, when I updated the hadoop-mesos POM to match my deployment version, the incorrectly calculated slots problem in my previous message has resolved. My current problem is a Hadoop logging problem and nothing to do with Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any difference. Just getting back into it now. On May 8, 2015, at 1:56 PM, haosdent haosd...@gmail.com mailto:haosd...@gmail.com wrote: Could you post the log in executors which run jobtracker and taskstracks? It would be helpful to find the cause of this problem. On Fri, May 8, 2015 at 3:05 AM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: I think there's something weird here: cpus: offered 2.0 needed at least 1.0 mem : offered 1724.0 needed at least 1024.0 disk: offered 44124.0 needed at least 1024.0 ports: at least 2 (sufficient) Am I misreading this? All of the requirements seem to be met. Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable: int slots = mapSlotsMax + reduceSlotsMax; slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus); slots = (int) Math.min(slots, (mem - containerMem) / slotMem); slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk); // Is this offer too small for even the minimum slots? if (slots 1) { return false; } Not exactly sure what this is doing. Sorry for the noise. On May 7, 2015, at 6:32 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab https://gist.github.com/briantopping/311960f8e5454dbe9aab has some more information necessary at this point... sorry for the omission.. On May 7, 2015, at 6:05 PM, Tom Arnfeld t...@duedil.com mailto:t...@duedil.com wrote: Hi Brian, At this point you should see the TT attempting to be launched via Mesos. The launched but not heartbeat yet count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet. Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs? -- Tom Arnfeld Developer // DueDil (+44) 7525940046 tel:%28%2B44%29%207525940046 25 Christopher Street, London, EC2A 2BS On Thu, May 7, 2015 at 12:01 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around. After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end. This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ https://docs.mesosphere.com/reference/packages/ has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services. At this point, I have tested Mesos with: mesos-execute --master=localhost:5050 --name=test-exec --command=sleep 10 The only problem there is it seems that localhost isn't good enough
Re: Debugging hadoop-mesos
Indeed, this was all that was left to get jobs working, thanks! Last thing I need to do for initial setup is get rid of the thousands of these messages, about three or four per second. I'm running against 2.6.0-mr1-cdh5.4.0, maybe there was a change to the API semantics. 2015-05-08 03:33:24,421 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. 2015-05-08 03:33:24,724 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. 2015-05-08 03:33:25,028 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. 2015-05-08 03:33:25,331 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. 2015-05-08 03:33:25,636 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. 2015-05-08 03:33:25,940 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. 2015-05-08 03:33:26,243 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. 2015-05-08 03:33:26,546 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. 2015-05-08 03:33:26,850 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. 2015-05-08 03:33:27,153 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. 2015-05-08 03:33:27,456 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. On May 8, 2015, at 2:47 PM, haosdent haosd...@gmail.com wrote: I think you could export HADOOP_LOG_DIR=/tmp to temp. And try again. On Fri, May 8, 2015 at 3:43 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Mesos runs as root, hadoop is as a separate user. On May 8, 2015, at 2:41 PM, haosdent haosd...@gmail.com mailto:haosd...@gmail.com wrote: You run everything in root? On Fri, May 8, 2015 at 3:38 PM, haosdent haosd...@gmail.com mailto:haosd...@gmail.com wrote: Seems you don't have permission for this directory: java.io.IOException: Could not create job user log directory: file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001 at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) On Fri, May 8, 2015 at 3:32 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Thanks Hasodent, I've updated https://gist.github.com/briantopping/311960f8e5454dbe9aab https://gist.github.com/briantopping/311960f8e5454dbe9aab with the output logs of what I am currently seeing. I've edited them for size, the message INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/ appeared a few thousand times in the logs. The configuration I have is probably still broken, 50060 is a Jetty port that returns a Cloudera string when telnetting to it. The error I saw below were apparently the result of building against the older version of CDH, when I updated the hadoop-mesos POM to match my deployment version, the incorrectly calculated slots problem in my previous message has resolved. My current problem is a Hadoop logging problem and nothing to do with Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any difference. Just getting back into it now. On May 8, 2015, at 1:56 PM, haosdent haosd...@gmail.com mailto:haosd...@gmail.com wrote: Could you post the log in executors which run jobtracker and taskstracks? It would be helpful to find the cause of this problem. On Fri, May 8, 2015 at 3:05 AM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: I think there's something weird here: cpus: offered 2.0 needed at least 1.0 mem : offered 1724.0 needed at least 1024.0 disk: offered 44124.0 needed at least 1024.0 ports: at least 2 (sufficient) Am I misreading this? All of the requirements seem to be met. Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable: int slots = mapSlotsMax + reduceSlotsMax; slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus); slots = (int) Math.min(slots, (mem - containerMem) / slotMem); slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk); // Is this offer too small for even the minimum slots? if (slots 1) { return false; } Not exactly sure what this is doing. Sorry for the noise. On May 7, 2015, at 6:32 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab https
RESOLVED -- Re: Debugging hadoop-mesos
Ok, I stared at the code for a long time and came up with https://github.com/mesos/hadoop/pull/55 https://github.com/mesos/hadoop/pull/55. It probably should have been separate PRs for cleanups and method shuffling in one and the meat of the changes in another, sorry about that. The PR itself should have a decent description, please feel free to ask questions or critique it in the PR. It seems like the build needs help with unit testing and release process. I think there's going to need to be a CI build that can build for various versions of CDH and assign the version to an artifact classifier before they can be easily managed on central. I'm happy to pitch in on these if anyone is interested. Testing this kind of code is a little tricky, but it generally results in better patterns when it's all finished. Thanks for all of your help!! I'm looking forward to starting what I came to this stack to work on :) Brian On May 8, 2015, at 3:06 PM, Brian Topping brian.topp...@gmail.com wrote: Indeed, this was all that was left to get jobs working, thanks! Last thing I need to do for initial setup is get rid of the thousands of these messages, about three or four per second. I'm running against 2.6.0-mr1-cdh5.4.0, maybe there was a change to the API semantics. 2015-05-08 03:33:24,421 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. 2015-05-08 03:33:24,724 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. 2015-05-08 03:33:25,028 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. 2015-05-08 03:33:25,331 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. 2015-05-08 03:33:25,636 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. 2015-05-08 03:33:25,940 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. 2015-05-08 03:33:26,243 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. 2015-05-08 03:33:26,546 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. 2015-05-08 03:33:26,850 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. 2015-05-08 03:33:27,153 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. 2015-05-08 03:33:27,456 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. On May 8, 2015, at 2:47 PM, haosdent haosd...@gmail.com mailto:haosd...@gmail.com wrote: I think you could export HADOOP_LOG_DIR=/tmp to temp. And try again. On Fri, May 8, 2015 at 3:43 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Mesos runs as root, hadoop is as a separate user. On May 8, 2015, at 2:41 PM, haosdent haosd...@gmail.com mailto:haosd...@gmail.com wrote: You run everything in root? On Fri, May 8, 2015 at 3:38 PM, haosdent haosd...@gmail.com mailto:haosd...@gmail.com wrote: Seems you don't have permission for this directory: java.io.IOException: Could not create job user log directory: file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001 at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) On Fri, May 8, 2015 at 3:32 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Thanks Hasodent, I've updated https://gist.github.com/briantopping/311960f8e5454dbe9aab https://gist.github.com/briantopping/311960f8e5454dbe9aab with the output logs of what I am currently seeing. I've edited them for size, the message INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/ appeared a few thousand times in the logs. The configuration I have is probably still broken, 50060 is a Jetty port that returns a Cloudera string when telnetting to it. The error I saw below were apparently the result of building against the older version of CDH, when I updated the hadoop-mesos POM to match my deployment version, the incorrectly calculated slots problem in my previous message has resolved. My current problem is a Hadoop logging problem and nothing to do with Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in /etc/hadoop/conf.pseudo.mr1
Re: Debugging hadoop-mesos
Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around. After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end. This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ https://docs.mesosphere.com/reference/packages/ has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services. At this point, I have tested Mesos with: mesos-execute --master=localhost:5050 --name=test-exec --command=sleep 10 The only problem there is it seems that localhost isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI. Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages: 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed. 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060. [Repeated a few times a second for five seconds] 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status Pending Map Tasks: 4 Pending Reduce Tasks: 1 Running Map Tasks: 0 Running Reduce Tasks: 0 Idle Map Slots: 0 Idle Reduce Slots: 0 Inactive Map Slots: 4 (launched but no hearbeat yet) Inactive Reduce Slots: 1 (launched but no hearbeat yet) Needed Map Slots: 0 Needed Reduce Slots: 0 Unhealthy Trackers: 0 This looks close. What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)? best, Brian On May 7, 2015, at 12:11 PM, Adam Bordelon a...@mesosphere.io wrote: From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)? On Wed, May 6, 2015 at 9:32 AM, haosdent haosd...@gmail.com mailto:haosd...@gmail.com wrote: Do you start tasktracker successfully? On Wed, May 6, 2015 at 11:32 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try. https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out. I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at map 0% reduce 0%. Any input kindly appreciated! Brian -- Best Regards, Haosdent Huang signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Debugging hadoop-mesos
Thanks Tom! I do see activity in the cluster: 1. mesos-master.WARNING log -- sequence of repeat messages being generated: W0507 18:10:21.794231 11729 master.cpp:2661] Cannot kill task Task_Tracker_34 of framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914 because it is unknown; performing reconciliation 2. The mesos-slave.WARNING log shows W0507 17:42:50.385308 11753 slave.cpp:1783] Cannot shut down unknown framework 20150507-164120-272093962-5050-11711-0004 from about the time that the job was launched. 3. mesos-master.INFO log -- sequence of repeat messages being generated : I0507 18:18:40.512228 11730 master.cpp:3760] Sending 1 offers to framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914 I0507 18:18:40.514377 11729 master.cpp:2273] Processing ACCEPT call for offers: [ 20150507-164120-272093962-5050-11711-O556 ] on slave 20150507-164120-272093962-5050-11711-S0 at slave(1)@10.211.55.16:5051 (10.211.55.16) for framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914 I0507 18:18:40.515120 11729 hierarchical.hpp:648] Recovered cpus(*):6; mem(*):2803; disk(*):45148; ports(*):[31000-32000] (total allocatable: cpus(*):6; mem(*):2803; disk(*):45148; ports(*):[31000-32000]) on slave 20150507-164120-272093962-5050-11711-S0 from framework 20150507-164120-272093962-5050-11711-0003 I0507 18:18:41.798447 11724 http.cpp:516] HTTP request for '/master/state.json' 4. mesos-slave.INFO has nothing but resource allocation messages showing current disk usage. 5. The UI shows several terminated frameworks and one active (the one above). But the detail screen for that framework says there are no active or completed tasks. Does this help? On May 7, 2015, at 6:05 PM, Tom Arnfeld t...@duedil.com wrote: Hi Brian, At this point you should see the TT attempting to be launched via Mesos. The launched but not heartbeat yet count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet. Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs? -- Tom Arnfeld Developer // DueDil (+44) 7525940046 25 Christopher Street, London, EC2A 2BS On Thu, May 7, 2015 at 12:01 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around. After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end. This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ https://docs.mesosphere.com/reference/packages/ has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services. At this point, I have tested Mesos with: mesos-execute --master=localhost:5050 --name=test-exec --command=sleep 10 The only problem there is it seems that localhost isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI. Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages: 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed. 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 http://10.211.55.16:50060/. [Repeated a few times a second for five seconds] 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status Pending Map Tasks: 4 Pending Reduce Tasks: 1 Running Map Tasks: 0 Running Reduce Tasks: 0 Idle Map Slots: 0 Idle Reduce Slots: 0 Inactive Map Slots: 4 (launched but no hearbeat yet) Inactive Reduce Slots: 1 (launched but no hearbeat yet) Needed Map Slots: 0 Needed Reduce Slots: 0 Unhealthy Trackers: 0 This looks close. What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish
Re: SEGV in 'make check'
Also, I just checked this in the 0.22.1-RC6 and had the same problem. On Apr 30, 2015, at 9:27 PM, Brian Topping brian.topp...@gmail.com wrote: Greetings all, I'm having a problem with my first attempts building Mesos. I started the other day with CentOS 7 and quickly realized it was far better to be using 6.6. I've created a machine, but it's crashing in 'make check'. Till Toenshoff was kind enough to give me some leads in JIRA on what to do next, but it didn't change stack trace to include symbols. https://gist.github.com/briantopping/51197bad452dd3b3277c https://gist.github.com/briantopping/51197bad452dd3b3277c has the dump of what I've done, the first file shows everything done in an empty build directory and the crash at the end. The second file there is a dump of the machine configuration -- the uname output, /proc/cpuinfo and all the installed RPM packages. I guess the first question is why didn't the ../configure --enable-debug work to generate the proper symbolics on the stack trace generated? Anyone have suggestions on what I can try? Cheers, Brian signature.asc Description: Message signed with OpenPGP using GPGMail
Re: SEGV in 'make check'
Getting closer. After finding http://garyzhu.net/notes/CentOS7-Systemd-Mesos-Marathon.html, I set up another new CentOS 7 machine, got a lot further on the compile this time, symbols too! This is with 0.22.1-RC6, CentOS Linux release 7.1.1503, kernel 3.10.0-229.1.2.el7.x86_64. Output from the last test during a make check. [--] 1 test from PerfEventIsolatorTest [ RUN ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample F0430 14:03:34.169455 13504 isolator_tests.cpp:710] CHECK_SOME(isolator): Failed to create PerfEvent isolator, invalid events: { cycles, task-clock } *** Check failure stack trace: *** @ 0x7f4c7ecea4ca google::LogMessage::Fail() @ 0x7f4c7ecea429 google::LogMessage::SendToLog() @ 0x7f4c7ece9e3a google::LogMessage::Flush() @ 0x7f4c7ececb6e google::LogMessageFatal::~LogMessageFatal() @ 0xa265b2 _CheckFatal::~_CheckFatal() @ 0xc93e88 mesos::internal::tests::PerfEventIsolatorTest_ROOT_CGROUPS_Sample_Test::TestBody() @ 0x1135c4f testing::internal::HandleSehExceptionsInMethodIfSupported() @ 0x1130e0a testing::internal::HandleExceptionsInMethodIfSupported() @ 0x1119233 testing::Test::Run() @ 0x1119956 testing::TestInfo::Run() @ 0x1119ede testing::TestCase::Run() @ 0x111ec5a testing::internal::UnitTestImpl::RunAllTests() @ 0x1136ac1 testing::internal::HandleSehExceptionsInMethodIfSupported() @ 0x1131afb testing::internal::HandleExceptionsInMethodIfSupported() @ 0x111db0a testing::UnitTest::Run() @ 0xd28155 main @ 0x7f4c7a741af5 __libc_start_main @ 0x8fae89 (unknown) Full results and machine configuration at https://gist.github.com/briantopping/ac4f320bcc24e14328cd https://gist.github.com/briantopping/ac4f320bcc24e14328cd. Not sure where to go with this, any insight appreciated! On Apr 30, 2015, at 11:33 PM, Brian Topping brian.topp...@gmail.com wrote: Also, I just checked this in the 0.22.1-RC6 and had the same problem. On Apr 30, 2015, at 9:27 PM, Brian Topping brian.topp...@gmail.com mailto:brian.topp...@gmail.com wrote: Greetings all, I'm having a problem with my first attempts building Mesos. I started the other day with CentOS 7 and quickly realized it was far better to be using 6.6. I've created a machine, but it's crashing in 'make check'. Till Toenshoff was kind enough to give me some leads in JIRA on what to do next, but it didn't change stack trace to include symbols. https://gist.github.com/briantopping/51197bad452dd3b3277c https://gist.github.com/briantopping/51197bad452dd3b3277c has the dump of what I've done, the first file shows everything done in an empty build directory and the crash at the end. The second file there is a dump of the machine configuration -- the uname output, /proc/cpuinfo and all the installed RPM packages. I guess the first question is why didn't the ../configure --enable-debug work to generate the proper symbolics on the stack trace generated? Anyone have suggestions on what I can try? Cheers, Brian signature.asc Description: Message signed with OpenPGP using GPGMail