Re: Debugging hadoop-mesos

Brian Topping Fri, 08 May 2015 00:36:07 -0700

Thanks Hasodent, I've updated 
https://gist.github.com/briantopping/311960f8e5454dbe9aab 
<https://gist.github.com/briantopping/311960f8e5454dbe9aab> with the output 
logs of what I am currently seeing. I've edited them for size, the message 
"INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: 
http://10.211.55.16:50060"; appeared a few thousand times in the logs. The 
configuration I have is probably still broken, 50060 is a Jetty port that 
returns a Cloudera string when telnetting to it.


The error I saw below were apparently the result of building against the older 
version of CDH, when I updated the hadoop-mesos POM to match my deployment 
version, the incorrectly calculated "slots" problem in my previous message has 
resolved.

My current problem is a Hadoop logging problem and nothing to do with Mesos, so 
I didn't post. I changed hadoop.log.dir=/var/log/hadoop in 
/etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any 
difference. Just getting back into it now.

> On May 8, 2015, at 1:56 PM, haosdent <[email protected]> wrote:
> 
> Could you post the log in executors which run jobtracker and taskstracks? It 
> would be helpful to find the cause of this problem.
> 
> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <[email protected] 
> <mailto:[email protected]>> wrote:
> I think there's something weird here:
>>   cpus: offered 2.0 needed at least 1.0
>>   mem : offered 1724.0 needed at least 1024.0
>>   disk: offered 44124.0 needed at least 1024.0
>>   ports:  at least 2 (sufficient)
> 
> Am I misreading this? All of the requirements seem to be met.
> 
> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
> 
>> int slots = mapSlotsMax + reduceSlotsMax;
>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>> 
>> // Is this offer too small for even the minimum slots?
>> if (slots < 1) {
>>   return false;
>> }
> 
> Not exactly sure what this is doing.
> 
> Sorry for the noise.
> 
>> 
>> On May 7, 2015, at 6:32 PM, Brian Topping <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab 
>> <https://gist.github.com/briantopping/311960f8e5454dbe9aab> has some more 
>> information necessary at this point... sorry for the omission..
>> 
>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi Brian,
>>> 
>>> At this point you should see the TT attempting to be launched via Mesos. 
>>> The "launched but not heartbeat yet" count tells us that the framework has 
>>> accepted resources for 4 slots but the TT hasn't actually come up yet.
>>> 
>>> Do you see the task in your Meaos cluster UI, and is there anything 
>>> interesting in the task logs?
>>> 
>>> --
>>> 
>>> Tom Arnfeld
>>> Developer // DueDil
>>> 
>>> (+44) 7525940046 <tel:%28%2B44%29%207525940046>
>>> 25 Christopher Street, London, EC2A 2BS
>>> 
>>> 
>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Thanks guys, this was helpful. I started the job tracker as a service, but 
>>> apparently I never started the task tracker (or it failed to start and I 
>>> didn't notice). I started it after Haosdent's message, but wasn't able to 
>>> see any difference and I kept poking around.
>>> 
>>> After making some changes and the VM wouldn't boot, my OCD got the better 
>>> of me and I reinstalled everything from scratch. There are just too many 
>>> moving parts to hassle you guys with an imperfect install on my end.
>>> 
>>> This time through, I felt a lot more confident to use the Mesosphere RPMs, 
>>> but I couldn't find the best way to get things launched. 
>>> https://docs.mesosphere.com/reference/packages/ 
>>> <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of 
>>> Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't 
>>> have any init.d service descriptions as the packages page would indicate. 
>>> For now, I just launched them manually, but would like to get the machine 
>>> to completely load on boot as services.
>>> 
>>> At this point, I have tested Mesos with:
>>> 
>>>     mesos-execute --master="localhost:5050" --name="test-exec" 
>>> --command="sleep 10"
>>> 
>>> The only problem there is it seems that "localhost" isn't good enough for 
>>> my install, it needs to be the FQDN, but it works and the job flows through 
>>> the UI.
>>> 
>>> Now, back to a hadoop job. When I try the job now, the logs show the 
>>> following stream of repeated messages:
>>> 
>>>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: 
>>>> Satisfied map and reduce slots needed.
>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: 
>>>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>>>> <http://10.211.55.16:50060/>.
>>>> [Repeated a few times a second for five seconds]
>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: 
>>>> JobTracker Status
>>>>       Pending Map Tasks: 4
>>>>    Pending Reduce Tasks: 1
>>>>       Running Map Tasks: 0
>>>>    Running Reduce Tasks: 0
>>>>          Idle Map Slots: 0
>>>>       Idle Reduce Slots: 0
>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>        Needed Map Slots: 0
>>>>     Needed Reduce Slots: 0
>>>>      Unhealthy Trackers: 0
>>> 
>>> This looks close.
>>> 
>>> What's the best way to get a JDWP port set up to break in this code (i.e. 
>>> learning to fish...)?
>>> 
>>> best, Brian
>>> 
>>> 
>>>> On May 7, 2015, at 12:11 PM, Adam Bordelon <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> From the mesos-master log and the JT log, it doesn't look like the 
>>>> MesosScheduler ever registered with Mesos, which should mean that it 
>>>> wouldn't start any TTs or map/reduce tasks. However, your `ps` output does 
>>>> seem to show a tasktracker running. Did you start that yourself (or 
>>>> automatically as a system service)?
>>>> 
>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Do you start tasktracker successfully?
>>>> 
>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 
>>>> integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. 
>>>> Hoping someone might have a few minutes to parse what I've got here and 
>>>> suggest something to try.
>>>> 
>>>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 
>>>> <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has 
>>>> all the data necessary between the console output of the client run, the 
>>>> mesos master and slave console, the XML configuration of the JT and the 
>>>> output that was generated by it. Please let me know if I've left something 
>>>> out.
>>>> 
>>>> I iterated a few times getting all the errors from missing paths or 
>>>> libraries sorted out, but the example client ultimately just sits waiting 
>>>> forever at "map 0% reduce 0%".
>>>> 
>>>> Any input kindly appreciated!
>>>> 
>>>> Brian
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>> 
>>> 
>>> <signature.asc>
>>> 
>> 
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Debugging hadoop-mesos

Reply via email to