Hi Vinod,
 Thanks for the link, I went through it and it looks like the OOM killer
picks a process that has the highest oom_score. I have tried to capture
oom_score for all the YARN daemon processes after each run of my
application.The first time I have captured these details, I see that the
name node is killed where as the Node Manager has the highest score. So, I
don't if it is really the OOM killer that has killed it!

 Please see the output of my run attached, which also has the output of
free command after each run. The output of free command doesn't either show
any exhaustion of system memory.

Also, one more thing I have done today is, I have added audit rules for
each of the daemons to capture all the system calls. And, in the audit log,
I see futex() system call occurring in the killed daemon processes. I don't
know if it causes the daemon to die? and why does that call happen...


Thanks,
Kishore


On Wed, Dec 18, 2013 at 12:31 AM, Vinod Kumar Vavilapalli <
vino...@hortonworks.com> wrote:

> That's good info. It is more than likely that it is the OOM killer. See
> http://stackoverflow.com/questions/726690/who-killed-my-process-and-whyfor 
> example.
>
> Thanks,
> +Vinod
>
> On Dec 17, 2013, at 1:26 AM, Krishna Kishore Bonagiri <
> write2kish...@gmail.com> wrote:
>
> Hi Jeff,
>
>   I have run the resource manager in the foreground without nohup and here
> are the messages when it was killed, it says it is "Killed" but doesn't say
> why!
>
> 13/12/17 03:14:54 INFO capacity.CapacityScheduler: Application
> appattempt_1387266015651_0258_000001 released container
> container_1387266015651_0258_01_000003 on node: host: isredeng:36576
> #containers=2 available=7936 used=256 with event: FINISHED
> 13/12/17 03:14:54 INFO rmcontainer.RMContainerImpl:
> container_1387266015651_0258_01_000005 Container Transitioned from ACQUIRED
> to RUNNING
> Killed
>
>
> Thanks,
> Kishore
>
>
> On Mon, Dec 16, 2013 at 11:10 PM, Jeff Stuckman <stuck...@umd.edu> wrote:
>
>>  What if you open the daemons in a "screen" session rather than running
>> them in the background -- for example, run "yarn resourcemanager". Then you
>> can see exactly when they terminate, and hopefully why.
>>
>>    *From: *Krishna Kishore Bonagiri
>> *Sent: *Monday, December 16, 2013 6:20 AM
>> *To: *user@hadoop.apache.org
>> *Reply To: *user@hadoop.apache.org
>> *Subject: *Re: Yarn -- one of the daemons getting killed
>>
>>  Hi Vinod,
>>
>>   Yes, I am running on Linux.
>>
>>  I was actually searching for a corresponding message in
>> /var/log/messages to confirm that OOM killed my daemons, but could not find
>> any corresponding messages there! According to the following link, it looks
>> like if it is a memory issue, I should see a messages even if OOM is
>> disabled, but I don't see it.
>>
>>  http://www.redhat.com/archives/taroon-list/2007-August/msg00006.html
>>
>>    And, is memory consumption more in case of two node cluster than a
>> single node one? Also, I see this problem only when I give "*" as the node
>> name.
>>
>>    One other thing I suspected was the allowed number of user processes,
>> I increased that to 31000 from 1024 but that also didn't help.
>>
>>  Thanks,
>> Kishore
>>
>>
>> On Fri, Dec 13, 2013 at 11:51 PM, Vinod Kumar Vavilapalli <
>> vino...@hortonworks.com> wrote:
>>
>>> Yes, that is what I suspect. That is why I asked if everything is on a
>>> single node. If you are running linux, linux OOM killer may be shooting
>>> things down. When it happens, you will see something like "'killed process"
>>> in system's syslog.
>>>
>>>    Thanks,
>>> +Vinod
>>>
>>>  On Dec 13, 2013, at 4:52 AM, Krishna Kishore Bonagiri <
>>> write2kish...@gmail.com> wrote:
>>>
>>>  Vinod,
>>>
>>>   One more thing I observed is that, my Client which submits Application
>>> Master one after another continuously also gets killed sometimes. So, it is
>>> always any of the Java Processes that is getting killed. Does it indicate
>>> some excessive memory usage by them or something like that, that is causing
>>> them die? If so, how can we resolve this kind of issue?
>>>
>>>  Thanks,
>>> Kishore
>>>
>>>
>>> On Fri, Dec 13, 2013 at 10:16 AM, Krishna Kishore Bonagiri <
>>> write2kish...@gmail.com> wrote:
>>>
>>>> No, I am running on 2 node cluster.
>>>>
>>>>
>>>> On Fri, Dec 13, 2013 at 1:52 AM, Vinod Kumar Vavilapalli <
>>>> vino...@hortonworks.com> wrote:
>>>>
>>>>> Is all of this on a single node?
>>>>>
>>>>>   Thanks,
>>>>> +Vinod
>>>>>
>>>>>  On Dec 12, 2013, at 3:26 AM, Krishna Kishore Bonagiri <
>>>>> write2kish...@gmail.com> wrote:
>>>>>
>>>>>  Hi,
>>>>>   I am running a small application on YARN (2.2.0) in a loop of 500
>>>>> times, and while doing so one of the daemons, node manager, resource
>>>>> manager, or data node is getting killed (I mean disappearing) at a random
>>>>> point. I see no information in the corresponding log files. How can I know
>>>>> why is it happening so?
>>>>>
>>>>>   And, one more observation is that, this is happening only when I am
>>>>> using "*" for node name in the container requests, otherwise when I used a
>>>>> specific node name, everything is fine.
>>>>>
>>>>>  Thanks,
>>>>> Kishore
>>>>>
>>>>>
>>>>>
>>>>> CONFIDENTIALITY NOTICE
>>>>> NOTICE: This message is intended for the use of the individual or
>>>>> entity to which it is addressed and may contain information that is
>>>>> confidential, privileged and exempt from disclosure under applicable law.
>>>>> If the reader of this message is not the intended recipient, you are 
>>>>> hereby
>>>>> notified that any printing, copying, dissemination, distribution,
>>>>> disclosure or forwarding of this communication is strictly prohibited. If
>>>>> you have received this communication in error, please contact the sender
>>>>> immediately and delete it from your system. Thank You.
>>>>
>>>>
>>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>>
>>
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
Pid of name node is  9813
Pid of data node is  9927
Pid of secondary name node is  10121
Pid of resourcemanager is  10270
Pid of nodemanager is  10385
Pid of YarnClient is  12312
Run 1
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3318456    2807920          0     472732    1026524
-/+ buffers/cache:    1819200    4307176
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 2
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3239696    2886680          0     472740    1026476
-/+ buffers/cache:    1740480    4385896
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 3
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3243988    2882388          0     472748    1026548
-/+ buffers/cache:    1744692    4381684
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 4
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3332568    2793808          0     472756    1026792
-/+ buffers/cache:    1833020    4293356
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 5
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3335824    2790552          0     472768    1026900
-/+ buffers/cache:    1836156    4290220
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 6
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3336588    2789788          0     472776    1027036
-/+ buffers/cache:    1836776    4289600
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 7
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3265408    2860968          0     472784    1027020
-/+ buffers/cache:    1765604    4360772
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 8
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3362320    2764056          0     472824    1027332
-/+ buffers/cache:    1862164    4264212
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 9
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3282480    2843896          0     472828    1027284
-/+ buffers/cache:    1782368    4344008
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 10
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3293224    2833152          0     472828    1027372
-/+ buffers/cache:    1793024    4333352
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 11
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3378036    2748340          0     472840    1027628
-/+ buffers/cache:    1877568    4248808
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 12
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3294520    2831856          0     472840    1027552
-/+ buffers/cache:    1794128    4332248
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 13
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3379516    2746860          0     472844    1027788
-/+ buffers/cache:    1878884    4247492
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 14
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3382160    2744216          0     472844    1027948
-/+ buffers/cache:    1881368    4245008
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 15
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3381820    2744556          0     472848    1028028
-/+ buffers/cache:    1880944    4245432
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
587731
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 16
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3298772    2827604          0     472852    1027984
-/+ buffers/cache:    1797936    4328440
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
294077
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 17
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3298608    2827768          0     472852    1028052
-/+ buffers/cache:    1797704    4328672
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
294077
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 18
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3382572    2743804          0     472852    1028284
-/+ buffers/cache:    1881436    4244940
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
294077
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 19
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3300952    2825424          0     472856    1028220
-/+ buffers/cache:    1799876    4326500
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
294077
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 20
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3383520    2742856          0     472860    1028464
-/+ buffers/cache:    1882196    4244180
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
294077
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 21
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3301672    2824704          0     472864    1028420
-/+ buffers/cache:    1800388    4325988
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
294077
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 22
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3386056    2740320          0     472872    1028656
-/+ buffers/cache:    1884528    4241848
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
294077
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 23
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3303488    2822888          0     472872    1028604
-/+ buffers/cache:    1802012    4324364
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
294077
DataNode's:
315495
NameNode's:
316132
Secondary Name Node's
312514
Yarn Client's:
581112
Run 24
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3119852    3006524          0     472876    1028680
-/+ buffers/cache:    1618296    4508080
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
294077
DataNode's:
315495
NameNode's:
cat: /proc/9813/oom_score: No such file or directory
Secondary Name Node's
312514
Yarn Client's:
581112
Run 25
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3116444    3009932          0     472980    1028968
-/+ buffers/cache:    1614496    4511880
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
586795
NodeManager's:
294077
DataNode's:
315533
NameNode's:
cat: /proc/9813/oom_score: No such file or directory
Secondary Name Node's
312514
Yarn Client's:
581112
Run 26
 


             total       used       free     shared    buffers     cached
Mem:       6126376    3128948    2997428          0     473076    1029272
-/+ buffers/cache:    1626600    4499776
Swap:      2064376          0    2064376

oom_scores are 
ResourceManager's:
293397
NodeManager's:
294077
DataNode's:
315533
NameNode's:
cat: /proc/9813/oom_score: No such file or directory
Secondary Name Node's
312514
Yarn Client's:
581112
Run 27
 


Reply via email to