Hi Dmitry,

Did you check the MR AM logs to see if the node was blacklisted for too many 
container failures?

-Varun



On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <[email protected]> wrote:

>
>> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) 
>> <[email protected]> wrote:
>> 
>> Hi Dmitry,
>> Seems to be an interesting case, would like some more clarifications in this 
>> regard :
>> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have same 
>> resource capacity ? by 3000 cores if same config then expecting around 100 
>> nodes, am i correct ?
>
>
>I have 1 NN (and 1 SNN).
>To be precise, I have 113 32-core machines assigned to run jobs (113*32=3616 
>total VCores)
>
>
>> 2. How many applications are running and how many have got finished 
>> (basically available in RM) ? By 35000 you mean finished and running 
>> applications ?
>
>There were 1 application running at that time (with 35000 map tasks)
>
>
>> 3. Weather after some time, tasks are getting assigned ? Also is it only 
>> this host not getting assigned or no other host also gets any containers 
>> assigned ?
>
>
>This machine were excluded from running tasks for that job.  It got tasks 
>assigned after almost 1.5 hours when first job (during which machine failed) 
>was finished and next job was started, see timestampts:
>
>
>
>2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl 
>(NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager 
>to unblock new container-requests
>2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007] ipc.Server 
>(Server.java:saslProcess(1316)) - Auth successful for 
>appattempt_1441808341485_1975_000001 (auth:SIMPLE)
>
>
>Previous job (during which that node rebooted) did not run more tasks on this 
>host.
>
>
>> 
>> I suspect this issue might be similar to YARN-3990, hence the above 
>> questions. Further you can check the RM logs and inform weather you see some 
>> similar logs as below
>> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of 
>> event-queue is 14000 | AsyncDispatcher.java:235
>> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of 
>> event-queue is 15000 | AsyncDispatcher.java:235
>
>
>There were 2 of these:
>2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler] 
>event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue 
>is 1000
>2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler] 
>event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue 
>is 1000
>
>
>What does these mean?
>
>
>> 
>> 
>> Regards,
>> + Naga
>> 
>> 
>> From: Dmitry Sivachenko [[email protected]]
>> Sent: Wednesday, September 23, 2015 03:57
>> To: [email protected]
>> Subject: node remains unused after reboot
>> 
>> Hello!
>> 
>> I am using hadoop-2.7.1. I have a large map job running (total cores 
>> available on the cluster about 3000, total tasks 35000).
>> In the middle of this process one server reboots.
>> 
>> After reboot, nodemanager starts successfully end registers with resource 
>> manager:
>> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl 
>> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying 
>> ContainerManager to unblock new container-requests
>> 
>> In YARN web-interface I see this host as active, but VCores used remains 
>> zero (see screenshot).
>> But the map job mentioned is still running and have about 12000 pending 
>> tasks.
>> 
>> Why this host does not receive tasks to run?
>> 
>> PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 
>> 2.4.1: new tasks were spawning immediately after reboot.
>> 
>> Thanks!
>> 
>> 
>> 
>> 
>> <Screen Shot 2015-09-23 at 1.22.10.png>
>

Reply via email to