Re: Debugging hadoop-mesos

Brian Topping Fri, 08 May 2015 00:42:49 -0700

That's correct, but /usr/lib/hadoop/logs doesn't even exist. It should be 
logging to /var/log/hadoop.


> On May 8, 2015, at 2:38 PM, haosdent <[email protected]> wrote:
> 
> Seems you don't have permission for this directory:
> 
> java.io.IOException: Could not create job user log directory: 
> file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
> 
> at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> 
> 
> On Fri, May 8, 2015 at 3:32 PM, Brian Topping <[email protected] 
> <mailto:[email protected]>> wrote:
> Thanks Hasodent, I've updated 
> https://gist.github.com/briantopping/311960f8e5454dbe9aab 
> <https://gist.github.com/briantopping/311960f8e5454dbe9aab> with the output 
> logs of what I am currently seeing. I've edited them for size, the message 
> "INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: 
> http://10.211.55.16:50060 <http://10.211.55.16:50060/>" appeared a few 
> thousand times in the logs. The configuration I have is probably still 
> broken, 50060 is a Jetty port that returns a Cloudera string when telnetting 
> to it.
> 
> The error I saw below were apparently the result of building against the 
> older version of CDH, when I updated the hadoop-mesos POM to match my 
> deployment version, the incorrectly calculated "slots" problem in my previous 
> message has resolved.
> 
> My current problem is a Hadoop logging problem and nothing to do with Mesos, 
> so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in 
> /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any 
> difference. Just getting back into it now.
> 
>> On May 8, 2015, at 1:56 PM, haosdent <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Could you post the log in executors which run jobtracker and taskstracks? It 
>> would be helpful to find the cause of this problem.
>> 
>> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <[email protected] 
>> <mailto:[email protected]>> wrote:
>> I think there's something weird here:
>>>   cpus: offered 2.0 needed at least 1.0
>>>   mem : offered 1724.0 needed at least 1024.0
>>>   disk: offered 44124.0 needed at least 1024.0
>>>   ports:  at least 2 (sufficient)
>> 
>> Am I misreading this? All of the requirements seem to be met.
>> 
>> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>> 
>>> int slots = mapSlotsMax + reduceSlotsMax;
>>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>>> 
>>> // Is this offer too small for even the minimum slots?
>>> if (slots < 1) {
>>>   return false;
>>> }
>> 
>> Not exactly sure what this is doing.
>> 
>> Sorry for the noise.
>> 
>>> 
>>> On May 7, 2015, at 6:32 PM, Brian Topping <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab 
>>> <https://gist.github.com/briantopping/311960f8e5454dbe9aab> has some more 
>>> information necessary at this point... sorry for the omission..
>>> 
>>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Hi Brian,
>>>> 
>>>> At this point you should see the TT attempting to be launched via Mesos. 
>>>> The "launched but not heartbeat yet" count tells us that the framework has 
>>>> accepted resources for 4 slots but the TT hasn't actually come up yet.
>>>> 
>>>> Do you see the task in your Meaos cluster UI, and is there anything 
>>>> interesting in the task logs?
>>>> 
>>>> --
>>>> 
>>>> Tom Arnfeld
>>>> Developer // DueDil
>>>> 
>>>> (+44) 7525940046 <tel:%28%2B44%29%207525940046>
>>>> 25 Christopher Street, London, EC2A 2BS
>>>> 
>>>> 
>>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Thanks guys, this was helpful. I started the job tracker as a service, but 
>>>> apparently I never started the task tracker (or it failed to start and I 
>>>> didn't notice). I started it after Haosdent's message, but wasn't able to 
>>>> see any difference and I kept poking around.
>>>> 
>>>> After making some changes and the VM wouldn't boot, my OCD got the better 
>>>> of me and I reinstalled everything from scratch. There are just too many 
>>>> moving parts to hassle you guys with an imperfect install on my end.
>>>> 
>>>> This time through, I felt a lot more confident to use the Mesosphere RPMs, 
>>>> but I couldn't find the best way to get things launched. 
>>>> https://docs.mesosphere.com/reference/packages/ 
>>>> <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of 
>>>> Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't 
>>>> have any init.d service descriptions as the packages page would indicate. 
>>>> For now, I just launched them manually, but would like to get the machine 
>>>> to completely load on boot as services.
>>>> 
>>>> At this point, I have tested Mesos with:
>>>> 
>>>>    mesos-execute --master="localhost:5050" --name="test-exec" 
>>>> --command="sleep 10"
>>>> 
>>>> The only problem there is it seems that "localhost" isn't good enough for 
>>>> my install, it needs to be the FQDN, but it works and the job flows 
>>>> through the UI.
>>>> 
>>>> Now, back to a hadoop job. When I try the job now, the logs show the 
>>>> following stream of repeated messages:
>>>> 
>>>>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: 
>>>>> Satisfied map and reduce slots needed.
>>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: 
>>>>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>>>>> <http://10.211.55.16:50060/>.
>>>>> [Repeated a few times a second for five seconds]
>>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: 
>>>>> JobTracker Status
>>>>>       Pending Map Tasks: 4
>>>>>    Pending Reduce Tasks: 1
>>>>>       Running Map Tasks: 0
>>>>>    Running Reduce Tasks: 0
>>>>>          Idle Map Slots: 0
>>>>>       Idle Reduce Slots: 0
>>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>>        Needed Map Slots: 0
>>>>>     Needed Reduce Slots: 0
>>>>>      Unhealthy Trackers: 0
>>>> 
>>>> This looks close.
>>>> 
>>>> What's the best way to get a JDWP port set up to break in this code (i.e. 
>>>> learning to fish...)?
>>>> 
>>>> best, Brian
>>>> 
>>>> 
>>>>> On May 7, 2015, at 12:11 PM, Adam Bordelon <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> From the mesos-master log and the JT log, it doesn't look like the 
>>>>> MesosScheduler ever registered with Mesos, which should mean that it 
>>>>> wouldn't start any TTs or map/reduce tasks. However, your `ps` output 
>>>>> does seem to show a tasktracker running. Did you start that yourself (or 
>>>>> automatically as a system service)?
>>>>> 
>>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Do you start tasktracker successfully?
>>>>> 
>>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 
>>>>> integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on 
>>>>> Github. Hoping someone might have a few minutes to parse what I've got 
>>>>> here and suggest something to try.
>>>>> 
>>>>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 
>>>>> <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has 
>>>>> all the data necessary between the console output of the client run, the 
>>>>> mesos master and slave console, the XML configuration of the JT and the 
>>>>> output that was generated by it. Please let me know if I've left 
>>>>> something out.
>>>>> 
>>>>> I iterated a few times getting all the errors from missing paths or 
>>>>> libraries sorted out, but the example client ultimately just sits waiting 
>>>>> forever at "map 0% reduce 0%".
>>>>> 
>>>>> Any input kindly appreciated!
>>>>> 
>>>>> Brian
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best Regards,
>>>>> Haosdent Huang
>>>>> 
>>>> 
>>>> <signature.asc>
>>>> 
>>> 
>> 
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Debugging hadoop-mesos

Reply via email to