Hi Vinod, Thanks for the link, I went through it and it looks like the OOM killer picks a process that has the highest oom_score. I have tried to capture oom_score for all the YARN daemon processes after each run of my application.The first time I have captured these details, I see that the name node is killed where as the Node Manager has the highest score. So, I don't if it is really the OOM killer that has killed it!
Please see the output of my run attached, which also has the output of free command after each run. The output of free command doesn't either show any exhaustion of system memory. Also, one more thing I have done today is, I have added audit rules for each of the daemons to capture all the system calls. And, in the audit log, I see futex() system call occurring in the killed daemon processes. I don't know if it causes the daemon to die? and why does that call happen... Thanks, Kishore On Wed, Dec 18, 2013 at 12:31 AM, Vinod Kumar Vavilapalli < vino...@hortonworks.com> wrote: > That's good info. It is more than likely that it is the OOM killer. See > http://stackoverflow.com/questions/726690/who-killed-my-process-and-whyfor > example. > > Thanks, > +Vinod > > On Dec 17, 2013, at 1:26 AM, Krishna Kishore Bonagiri < > write2kish...@gmail.com> wrote: > > Hi Jeff, > > I have run the resource manager in the foreground without nohup and here > are the messages when it was killed, it says it is "Killed" but doesn't say > why! > > 13/12/17 03:14:54 INFO capacity.CapacityScheduler: Application > appattempt_1387266015651_0258_000001 released container > container_1387266015651_0258_01_000003 on node: host: isredeng:36576 > #containers=2 available=7936 used=256 with event: FINISHED > 13/12/17 03:14:54 INFO rmcontainer.RMContainerImpl: > container_1387266015651_0258_01_000005 Container Transitioned from ACQUIRED > to RUNNING > Killed > > > Thanks, > Kishore > > > On Mon, Dec 16, 2013 at 11:10 PM, Jeff Stuckman <stuck...@umd.edu> wrote: > >> What if you open the daemons in a "screen" session rather than running >> them in the background -- for example, run "yarn resourcemanager". Then you >> can see exactly when they terminate, and hopefully why. >> >> *From: *Krishna Kishore Bonagiri >> *Sent: *Monday, December 16, 2013 6:20 AM >> *To: *user@hadoop.apache.org >> *Reply To: *user@hadoop.apache.org >> *Subject: *Re: Yarn -- one of the daemons getting killed >> >> Hi Vinod, >> >> Yes, I am running on Linux. >> >> I was actually searching for a corresponding message in >> /var/log/messages to confirm that OOM killed my daemons, but could not find >> any corresponding messages there! According to the following link, it looks >> like if it is a memory issue, I should see a messages even if OOM is >> disabled, but I don't see it. >> >> http://www.redhat.com/archives/taroon-list/2007-August/msg00006.html >> >> And, is memory consumption more in case of two node cluster than a >> single node one? Also, I see this problem only when I give "*" as the node >> name. >> >> One other thing I suspected was the allowed number of user processes, >> I increased that to 31000 from 1024 but that also didn't help. >> >> Thanks, >> Kishore >> >> >> On Fri, Dec 13, 2013 at 11:51 PM, Vinod Kumar Vavilapalli < >> vino...@hortonworks.com> wrote: >> >>> Yes, that is what I suspect. That is why I asked if everything is on a >>> single node. If you are running linux, linux OOM killer may be shooting >>> things down. When it happens, you will see something like "'killed process" >>> in system's syslog. >>> >>> Thanks, >>> +Vinod >>> >>> On Dec 13, 2013, at 4:52 AM, Krishna Kishore Bonagiri < >>> write2kish...@gmail.com> wrote: >>> >>> Vinod, >>> >>> One more thing I observed is that, my Client which submits Application >>> Master one after another continuously also gets killed sometimes. So, it is >>> always any of the Java Processes that is getting killed. Does it indicate >>> some excessive memory usage by them or something like that, that is causing >>> them die? If so, how can we resolve this kind of issue? >>> >>> Thanks, >>> Kishore >>> >>> >>> On Fri, Dec 13, 2013 at 10:16 AM, Krishna Kishore Bonagiri < >>> write2kish...@gmail.com> wrote: >>> >>>> No, I am running on 2 node cluster. >>>> >>>> >>>> On Fri, Dec 13, 2013 at 1:52 AM, Vinod Kumar Vavilapalli < >>>> vino...@hortonworks.com> wrote: >>>> >>>>> Is all of this on a single node? >>>>> >>>>> Thanks, >>>>> +Vinod >>>>> >>>>> On Dec 12, 2013, at 3:26 AM, Krishna Kishore Bonagiri < >>>>> write2kish...@gmail.com> wrote: >>>>> >>>>> Hi, >>>>> I am running a small application on YARN (2.2.0) in a loop of 500 >>>>> times, and while doing so one of the daemons, node manager, resource >>>>> manager, or data node is getting killed (I mean disappearing) at a random >>>>> point. I see no information in the corresponding log files. How can I know >>>>> why is it happening so? >>>>> >>>>> And, one more observation is that, this is happening only when I am >>>>> using "*" for node name in the container requests, otherwise when I used a >>>>> specific node name, everything is fine. >>>>> >>>>> Thanks, >>>>> Kishore >>>>> >>>>> >>>>> >>>>> CONFIDENTIALITY NOTICE >>>>> NOTICE: This message is intended for the use of the individual or >>>>> entity to which it is addressed and may contain information that is >>>>> confidential, privileged and exempt from disclosure under applicable law. >>>>> If the reader of this message is not the intended recipient, you are >>>>> hereby >>>>> notified that any printing, copying, dissemination, distribution, >>>>> disclosure or forwarding of this communication is strictly prohibited. If >>>>> you have received this communication in error, please contact the sender >>>>> immediately and delete it from your system. Thank You. >>>> >>>> >>>> >>> >>> >>> CONFIDENTIALITY NOTICE >>> NOTICE: This message is intended for the use of the individual or entity >>> to which it is addressed and may contain information that is confidential, >>> privileged and exempt from disclosure under applicable law. If the reader >>> of this message is not the intended recipient, you are hereby notified that >>> any printing, copying, dissemination, distribution, disclosure or >>> forwarding of this communication is strictly prohibited. If you have >>> received this communication in error, please contact the sender immediately >>> and delete it from your system. Thank You. >>> >> >> > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >
Pid of name node is 9813 Pid of data node is 9927 Pid of secondary name node is 10121 Pid of resourcemanager is 10270 Pid of nodemanager is 10385 Pid of YarnClient is 12312 Run 1 total used free shared buffers cached Mem: 6126376 3318456 2807920 0 472732 1026524 -/+ buffers/cache: 1819200 4307176 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 2 total used free shared buffers cached Mem: 6126376 3239696 2886680 0 472740 1026476 -/+ buffers/cache: 1740480 4385896 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 3 total used free shared buffers cached Mem: 6126376 3243988 2882388 0 472748 1026548 -/+ buffers/cache: 1744692 4381684 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 4 total used free shared buffers cached Mem: 6126376 3332568 2793808 0 472756 1026792 -/+ buffers/cache: 1833020 4293356 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 5 total used free shared buffers cached Mem: 6126376 3335824 2790552 0 472768 1026900 -/+ buffers/cache: 1836156 4290220 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 6 total used free shared buffers cached Mem: 6126376 3336588 2789788 0 472776 1027036 -/+ buffers/cache: 1836776 4289600 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 7 total used free shared buffers cached Mem: 6126376 3265408 2860968 0 472784 1027020 -/+ buffers/cache: 1765604 4360772 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 8 total used free shared buffers cached Mem: 6126376 3362320 2764056 0 472824 1027332 -/+ buffers/cache: 1862164 4264212 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 9 total used free shared buffers cached Mem: 6126376 3282480 2843896 0 472828 1027284 -/+ buffers/cache: 1782368 4344008 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 10 total used free shared buffers cached Mem: 6126376 3293224 2833152 0 472828 1027372 -/+ buffers/cache: 1793024 4333352 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 11 total used free shared buffers cached Mem: 6126376 3378036 2748340 0 472840 1027628 -/+ buffers/cache: 1877568 4248808 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 12 total used free shared buffers cached Mem: 6126376 3294520 2831856 0 472840 1027552 -/+ buffers/cache: 1794128 4332248 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 13 total used free shared buffers cached Mem: 6126376 3379516 2746860 0 472844 1027788 -/+ buffers/cache: 1878884 4247492 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 14 total used free shared buffers cached Mem: 6126376 3382160 2744216 0 472844 1027948 -/+ buffers/cache: 1881368 4245008 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 15 total used free shared buffers cached Mem: 6126376 3381820 2744556 0 472848 1028028 -/+ buffers/cache: 1880944 4245432 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 587731 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 16 total used free shared buffers cached Mem: 6126376 3298772 2827604 0 472852 1027984 -/+ buffers/cache: 1797936 4328440 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 294077 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 17 total used free shared buffers cached Mem: 6126376 3298608 2827768 0 472852 1028052 -/+ buffers/cache: 1797704 4328672 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 294077 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 18 total used free shared buffers cached Mem: 6126376 3382572 2743804 0 472852 1028284 -/+ buffers/cache: 1881436 4244940 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 294077 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 19 total used free shared buffers cached Mem: 6126376 3300952 2825424 0 472856 1028220 -/+ buffers/cache: 1799876 4326500 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 294077 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 20 total used free shared buffers cached Mem: 6126376 3383520 2742856 0 472860 1028464 -/+ buffers/cache: 1882196 4244180 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 294077 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 21 total used free shared buffers cached Mem: 6126376 3301672 2824704 0 472864 1028420 -/+ buffers/cache: 1800388 4325988 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 294077 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 22 total used free shared buffers cached Mem: 6126376 3386056 2740320 0 472872 1028656 -/+ buffers/cache: 1884528 4241848 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 294077 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 23 total used free shared buffers cached Mem: 6126376 3303488 2822888 0 472872 1028604 -/+ buffers/cache: 1802012 4324364 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 294077 DataNode's: 315495 NameNode's: 316132 Secondary Name Node's 312514 Yarn Client's: 581112 Run 24 total used free shared buffers cached Mem: 6126376 3119852 3006524 0 472876 1028680 -/+ buffers/cache: 1618296 4508080 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 294077 DataNode's: 315495 NameNode's: cat: /proc/9813/oom_score: No such file or directory Secondary Name Node's 312514 Yarn Client's: 581112 Run 25 total used free shared buffers cached Mem: 6126376 3116444 3009932 0 472980 1028968 -/+ buffers/cache: 1614496 4511880 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 586795 NodeManager's: 294077 DataNode's: 315533 NameNode's: cat: /proc/9813/oom_score: No such file or directory Secondary Name Node's 312514 Yarn Client's: 581112 Run 26 total used free shared buffers cached Mem: 6126376 3128948 2997428 0 473076 1029272 -/+ buffers/cache: 1626600 4499776 Swap: 2064376 0 2064376 oom_scores are ResourceManager's: 293397 NodeManager's: 294077 DataNode's: 315533 NameNode's: cat: /proc/9813/oom_score: No such file or directory Secondary Name Node's 312514 Yarn Client's: 581112 Run 27