Ok, I stared at the code for a long time and came up with 
https://github.com/mesos/hadoop/pull/55 
<https://github.com/mesos/hadoop/pull/55>. It probably should have been 
separate PRs for cleanups and method shuffling in one and the meat of the 
changes in another, sorry about that. The PR itself should have a decent 
description, please feel free to ask questions or critique it in the PR.

It seems like the build needs help with unit testing and release process. I 
think there's going to need to be a CI build that can build for various 
versions of CDH and assign the version to an artifact classifier before they 
can be easily managed on central. I'm happy to pitch in on these if anyone is 
interested. Testing this kind of code is a little tricky, but it generally 
results in better patterns when it's all finished.

Thanks for all of your help!! I'm looking forward to starting what I came to 
this stack to work on :)

Brian

> On May 8, 2015, at 3:06 PM, Brian Topping <[email protected]> wrote:
> 
> Indeed, this was all that was left to get jobs working, thanks!
> 
> Last thing I need to do for initial setup is get rid of the thousands of 
> these messages, about three or four per second. I'm running against 
> 2.6.0-mr1-cdh5.4.0, maybe there was a change to the API semantics.
> 
>> 2015-05-08 03:33:24,421 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:24,724 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:25,028 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:25,331 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:25,636 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:25,940 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:26,243 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:26,546 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:26,850 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:27,153 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:27,456 INFO org.apache.hadoop.mapred.MesosScheduler: 
>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>> <http://10.211.55.16:50060/>.
>> 
>> On May 8, 2015, at 2:47 PM, haosdent <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> I think you could export HADOOP_LOG_DIR=/tmp to temp. And try again.
>> 
>> On Fri, May 8, 2015 at 3:43 PM, Brian Topping <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Mesos runs as root, hadoop is as a separate user.
>> 
>>> On May 8, 2015, at 2:41 PM, haosdent <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> You run everything in root?
>>> 
>>> On Fri, May 8, 2015 at 3:38 PM, haosdent <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Seems you don't have permission for this directory:
>>> 
>>> java.io.IOException: Could not create job user log directory: 
>>> file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
>>> 
>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>     at 
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>>> 
>>> 
>>> On Fri, May 8, 2015 at 3:32 PM, Brian Topping <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Thanks Hasodent, I've updated 
>>> https://gist.github.com/briantopping/311960f8e5454dbe9aab 
>>> <https://gist.github.com/briantopping/311960f8e5454dbe9aab> with the output 
>>> logs of what I am currently seeing. I've edited them for size, the message 
>>> "INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: 
>>> http://10.211.55.16:50060 <http://10.211.55.16:50060/>" appeared a few 
>>> thousand times in the logs. The configuration I have is probably still 
>>> broken, 50060 is a Jetty port that returns a Cloudera string when 
>>> telnetting to it.
>>> 
>>> The error I saw below were apparently the result of building against the 
>>> older version of CDH, when I updated the hadoop-mesos POM to match my 
>>> deployment version, the incorrectly calculated "slots" problem in my 
>>> previous message has resolved.
>>> 
>>> My current problem is a Hadoop logging problem and nothing to do with 
>>> Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in 
>>> /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any 
>>> difference. Just getting back into it now.
>>> 
>>>> On May 8, 2015, at 1:56 PM, haosdent <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Could you post the log in executors which run jobtracker and taskstracks? 
>>>> It would be helpful to find the cause of this problem.
>>>> 
>>>> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> I think there's something weird here:
>>>>>   cpus: offered 2.0 needed at least 1.0
>>>>>   mem : offered 1724.0 needed at least 1024.0
>>>>>   disk: offered 44124.0 needed at least 1024.0
>>>>>   ports:  at least 2 (sufficient)
>>>> 
>>>> Am I misreading this? All of the requirements seem to be met.
>>>> 
>>>> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>>>> 
>>>>> int slots = mapSlotsMax + reduceSlotsMax;
>>>>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>>>>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>>>>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>>>>> 
>>>>> // Is this offer too small for even the minimum slots?
>>>>> if (slots < 1) {
>>>>>   return false;
>>>>> }
>>>> 
>>>> Not exactly sure what this is doing.
>>>> 
>>>> Sorry for the noise.
>>>> 
>>>>> 
>>>>> On May 7, 2015, at 6:32 PM, Brian Topping <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab 
>>>>> <https://gist.github.com/briantopping/311960f8e5454dbe9aab> has some more 
>>>>> information necessary at this point... sorry for the omission..
>>>>> 
>>>>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Hi Brian,
>>>>>> 
>>>>>> At this point you should see the TT attempting to be launched via Mesos. 
>>>>>> The "launched but not heartbeat yet" count tells us that the framework 
>>>>>> has accepted resources for 4 slots but the TT hasn't actually come up 
>>>>>> yet.
>>>>>> 
>>>>>> Do you see the task in your Meaos cluster UI, and is there anything 
>>>>>> interesting in the task logs?
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> Tom Arnfeld
>>>>>> Developer // DueDil
>>>>>> 
>>>>>> (+44) 7525940046 <tel:%28%2B44%29%207525940046>
>>>>>> 25 Christopher Street, London, EC2A 2BS
>>>>>> 
>>>>>> 
>>>>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Thanks guys, this was helpful. I started the job tracker as a service, 
>>>>>> but apparently I never started the task tracker (or it failed to start 
>>>>>> and I didn't notice). I started it after Haosdent's message, but wasn't 
>>>>>> able to see any difference and I kept poking around.
>>>>>> 
>>>>>> After making some changes and the VM wouldn't boot, my OCD got the 
>>>>>> better of me and I reinstalled everything from scratch. There are just 
>>>>>> too many moving parts to hassle you guys with an imperfect install on my 
>>>>>> end.
>>>>>> 
>>>>>> This time through, I felt a lot more confident to use the Mesosphere 
>>>>>> RPMs, but I couldn't find the best way to get things launched. 
>>>>>> https://docs.mesosphere.com/reference/packages/ 
>>>>>> <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of 
>>>>>> Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't 
>>>>>> have any init.d service descriptions as the packages page would 
>>>>>> indicate. For now, I just launched them manually, but would like to get 
>>>>>> the machine to completely load on boot as services.
>>>>>> 
>>>>>> At this point, I have tested Mesos with:
>>>>>> 
>>>>>>  mesos-execute --master="localhost:5050" --name="test-exec" 
>>>>>> --command="sleep 10"
>>>>>> 
>>>>>> The only problem there is it seems that "localhost" isn't good enough 
>>>>>> for my install, it needs to be the FQDN, but it works and the job flows 
>>>>>> through the UI.
>>>>>> 
>>>>>> Now, back to a hadoop job. When I try the job now, the logs show the 
>>>>>> following stream of repeated messages:
>>>>>> 
>>>>>>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: 
>>>>>>> Satisfied map and reduce slots needed.
>>>>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: 
>>>>>>> Unknown/exited TaskTracker: http://10.211.55.16:50060 
>>>>>>> <http://10.211.55.16:50060/>.
>>>>>>> [Repeated a few times a second for five seconds]
>>>>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: 
>>>>>>> JobTracker Status
>>>>>>>       Pending Map Tasks: 4
>>>>>>>    Pending Reduce Tasks: 1
>>>>>>>       Running Map Tasks: 0
>>>>>>>    Running Reduce Tasks: 0
>>>>>>>          Idle Map Slots: 0
>>>>>>>       Idle Reduce Slots: 0
>>>>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>>>>        Needed Map Slots: 0
>>>>>>>     Needed Reduce Slots: 0
>>>>>>>      Unhealthy Trackers: 0
>>>>>> 
>>>>>> This looks close.
>>>>>> 
>>>>>> What's the best way to get a JDWP port set up to break in this code 
>>>>>> (i.e. learning to fish...)?
>>>>>> 
>>>>>> best, Brian
>>>>>> 
>>>>>> 
>>>>>>> On May 7, 2015, at 12:11 PM, Adam Bordelon <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> From the mesos-master log and the JT log, it doesn't look like the 
>>>>>>> MesosScheduler ever registered with Mesos, which should mean that it 
>>>>>>> wouldn't start any TTs or map/reduce tasks. However, your `ps` output 
>>>>>>> does seem to show a tasktracker running. Did you start that yourself 
>>>>>>> (or automatically as a system service)?
>>>>>>> 
>>>>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Do you start tasktracker successfully?
>>>>>>> 
>>>>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Hi all, I'm happy to report that I'm very close to getting 
>>>>>>> 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 
>>>>>>> 0.10 code on Github. Hoping someone might have a few minutes to parse 
>>>>>>> what I've got here and suggest something to try.
>>>>>>> 
>>>>>>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 
>>>>>>> <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully 
>>>>>>> has all the data necessary between the console output of the client 
>>>>>>> run, the mesos master and slave console, the XML configuration of the 
>>>>>>> JT and the output that was generated by it. Please let me know if I've 
>>>>>>> left something out.
>>>>>>> 
>>>>>>> I iterated a few times getting all the errors from missing paths or 
>>>>>>> libraries sorted out, but the example client ultimately just sits 
>>>>>>> waiting forever at "map 0% reduce 0%".
>>>>>>> 
>>>>>>> Any input kindly appreciated!
>>>>>>> 
>>>>>>> Brian
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Haosdent Huang
>>>>>>> 
>>>>>> 
>>>>>> <signature.asc>
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>> 
>>> 
>>> 
>>> --
>>> Best Regards,
>>> Haosdent Huang
>> 
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
> 

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to