Re: Debugging hadoop-mesos

Brian Topping Thu, 07 May 2015 04:27:08 -0700

Thanks Tom! I do see activity in the cluster:

1. mesos-master.WARNING log -- sequence of repeat messages being generated:


> W0507 18:10:21.794231 11729 master.cpp:2661] Cannot kill task Task_Tracker_34 
> of framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 
> 9001, WebUI port: 50030)) at 
> [email protected]:35914 because it 
> is unknown; performing reconciliation

2. The mesos-slave.WARNING log shows "W0507 17:42:50.385308 11753 
slave.cpp:1783] Cannot shut down unknown framework 
20150507-164120-272093962-5050-11711-0004" from about the time that the job was 
launched.

3. mesos-master.INFO log -- sequence of repeat messages being generated :

> I0507 18:18:40.512228 11730 master.cpp:3760] Sending 1 offers to framework 
> 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI 
> port: 50030)) at 
> [email protected]:35914
> I0507 18:18:40.514377 11729 master.cpp:2273] Processing ACCEPT call for 
> offers: [ 20150507-164120-272093962-5050-11711-O556 ] on slave 
> 20150507-164120-272093962-5050-11711-S0 at slave(1)@10.211.55.16:5051 
> (10.211.55.16) for framework 20150507-164120-272093962-5050-11711-0003 
> (Hadoop: (RPC port: 9001, WebUI port: 50030)) at 
> [email protected]:35914
> I0507 18:18:40.515120 11729 hierarchical.hpp:648] Recovered cpus(*):6; 
> mem(*):2803; disk(*):45148; ports(*):[31000-32000] (total allocatable: 
> cpus(*):6; mem(*):2803; disk(*):45148; ports(*):[31000-32000]) on slave 
> 20150507-164120-272093962-5050-11711-S0 from framework 
> 20150507-164120-272093962-5050-11711-0003
> I0507 18:18:41.798447 11724 http.cpp:516] HTTP request for 
> '/master/state.json'

4. mesos-slave.INFO has nothing but resource allocation messages showing 
current disk usage.

5. The UI shows several terminated frameworks and one active (the one above). 
But the detail screen for that framework says there are no active or completed 
tasks.

Does this help?

> On May 7, 2015, at 6:05 PM, Tom Arnfeld <[email protected]> wrote:
> 
> Hi Brian,
> 
> At this point you should see the TT attempting to be launched via Mesos. The 
> "launched but not heartbeat yet" count tells us that the framework has 
> accepted resources for 4 slots but the TT hasn't actually come up yet.
> 
> Do you see the task in your Meaos cluster UI, and is there anything 
> interesting in the task logs?
> 
> --
> 
> Tom Arnfeld
> Developer // DueDil
> 
> (+44) 7525940046
> 25 Christopher Street, London, EC2A 2BS
> 
> 
> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Thanks guys, this was helpful. I started the job tracker as a service, but 
> apparently I never started the task tracker (or it failed to start and I 
> didn't notice). I started it after Haosdent's message, but wasn't able to see 
> any difference and I kept poking around.
> 
> After making some changes and the VM wouldn't boot, my OCD got the better of 
> me and I reinstalled everything from scratch. There are just too many moving 
> parts to hassle you guys with an imperfect install on my end.
> 
> This time through, I felt a lot more confident to use the Mesosphere RPMs, 
> but I couldn't find the best way to get things launched. 
> https://docs.mesosphere.com/reference/packages/ 
> <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 
> 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any 
> init.d service descriptions as the packages page would indicate. For now, I 
> just launched them manually, but would like to get the machine to completely 
> load on boot as services.
> 
> At this point, I have tested Mesos with:
> 
>       mesos-execute --master="localhost:5050" --name="test-exec" 
> --command="sleep 10"
> 
> The only problem there is it seems that "localhost" isn't good enough for my 
> install, it needs to be the FQDN, but it works and the job flows through the 
> UI.
> 
> Now, back to a hadoop job. When I try the job now, the logs show the 
> following stream of repeated messages:
> 
>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: 
>> Satisfied map and reduce slots needed.
>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> [Repeated a few times a second for five seconds]
>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: 
>> JobTracker Status
>>       Pending Map Tasks: 4
>>    Pending Reduce Tasks: 1
>>       Running Map Tasks: 0
>>    Running Reduce Tasks: 0
>>          Idle Map Slots: 0
>>       Idle Reduce Slots: 0
>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>        Needed Map Slots: 0
>>     Needed Reduce Slots: 0
>>      Unhealthy Trackers: 0
> 
> This looks close.
> 
> What's the best way to get a JDWP port set up to break in this code (i.e. 
> learning to fish...)?
> 
> best, Brian
> 
> 
>> On May 7, 2015, at 12:11 PM, Adam Bordelon <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> From the mesos-master log and the JT log, it doesn't look like the 
>> MesosScheduler ever registered with Mesos, which should mean that it 
>> wouldn't start any TTs or map/reduce tasks. However, your `ps` output does 
>> seem to show a tasktracker running. Did you start that yourself (or 
>> automatically as a system service)?
>> 
>> On Wed, May 6, 2015 at 9:32 AM, haosdent <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Do you start tasktracker successfully?
>> 
>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 
>> integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. 
>> Hoping someone might have a few minutes to parse what I've got here and 
>> suggest something to try.
>> 
>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 
>> <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has 
>> all the data necessary between the console output of the client run, the 
>> mesos master and slave console, the XML configuration of the JT and the 
>> output that was generated by it. Please let me know if I've left something 
>> out.
>> 
>> I iterated a few times getting all the errors from missing paths or 
>> libraries sorted out, but the example client ultimately just sits waiting 
>> forever at "map 0% reduce 0%".
>> 
>> Any input kindly appreciated!
>> 
>> Brian
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
>> 
> 
> <signature.asc>
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Debugging hadoop-mesos

Reply via email to