Hi Dmitry, Did you check the MR AM logs to see if the node was blacklisted for too many container failures?
-Varun On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <[email protected]> wrote: > >> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) >> <[email protected]> wrote: >> >> Hi Dmitry, >> Seems to be an interesting case, would like some more clarifications in this >> regard : >> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have same >> resource capacity ? by 3000 cores if same config then expecting around 100 >> nodes, am i correct ? > > >I have 1 NN (and 1 SNN). >To be precise, I have 113 32-core machines assigned to run jobs (113*32=3616 >total VCores) > > >> 2. How many applications are running and how many have got finished >> (basically available in RM) ? By 35000 you mean finished and running >> applications ? > >There were 1 application running at that time (with 35000 map tasks) > > >> 3. Weather after some time, tasks are getting assigned ? Also is it only >> this host not getting assigned or no other host also gets any containers >> assigned ? > > >This machine were excluded from running tasks for that job. It got tasks >assigned after almost 1.5 hours when first job (during which machine failed) >was finished and next job was started, see timestampts: > > > >2015-09-23 01:06:24,656 INFO [main] nodemanager.NodeStatusUpdaterImpl >(NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager >to unblock new container-requests >2015-09-23 02:29:33,301 INFO [Socket Reader #1 for port 10007] ipc.Server >(Server.java:saslProcess(1316)) - Auth successful for >appattempt_1441808341485_1975_000001 (auth:SIMPLE) > > >Previous job (during which that node rebooted) did not run more tasks on this >host. > > >> >> I suspect this issue might be similar to YARN-3990, hence the above >> questions. Further you can check the RM logs and inform weather you see some >> similar logs as below >> 2015-07-29 19:39:03,416 | INFO | AsyncDispatcher event handler | Size of >> event-queue is 14000 | AsyncDispatcher.java:235 >> 2015-07-29 19:39:03,417 | INFO | AsyncDispatcher event handler | Size of >> event-queue is 15000 | AsyncDispatcher.java:235 > > >There were 2 of these: >2015-09-23 00:54:39,623 INFO [AsyncDispatcher event handler] >event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue >is 1000 >2015-09-23 01:06:24,623 INFO [AsyncDispatcher event handler] >event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue >is 1000 > > >What does these mean? > > >> >> >> Regards, >> + Naga >> >> >> From: Dmitry Sivachenko [[email protected]] >> Sent: Wednesday, September 23, 2015 03:57 >> To: [email protected] >> Subject: node remains unused after reboot >> >> Hello! >> >> I am using hadoop-2.7.1. I have a large map job running (total cores >> available on the cluster about 3000, total tasks 35000). >> In the middle of this process one server reboots. >> >> After reboot, nodemanager starts successfully end registers with resource >> manager: >> 2015-09-23 01:06:24,656 INFO [main] nodemanager.NodeStatusUpdaterImpl >> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying >> ContainerManager to unblock new container-requests >> >> In YARN web-interface I see this host as active, but VCores used remains >> zero (see screenshot). >> But the map job mentioned is still running and have about 12000 pending >> tasks. >> >> Why this host does not receive tasks to run? >> >> PS: I recently upgraded from 2.4.1 and I did not notice such a problem with >> 2.4.1: new tasks were spawning immediately after reboot. >> >> Thanks! >> >> >> >> >> <Screen Shot 2015-09-23 at 1.22.10.png> >
