Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-13 Thread Ewan Higgs

Konstatinos,
Sure, if you have a resource leak then the collector can't free up 
memory and the process will use more memory. Time to break out the 
profiler and see where the memory is going.


The usual suspects are handles to resources (open file streams, sockets, 
etc) kept in containers (arrays, lists, etc). If they're in a container, 
they can't be collected. Another one is keeping handlers in a container 
which may keep an internal handle to an open resource. If the handler 
refers to an open resource and the handler (aka listener, aka observer) 
is in a container, then it the underlying resource can't be collected. 
Use a profiler to find out where the memory is going.


FWIW, hitting 1 million or 5 million inodes is going to be a likely 
bottleneck (profile to check). Consider bundling the files up into 
archives that you access together if you find the file system to be a 
bottleneck here.


HDFS, for example, was designed for larger files. Even if you're not 
using HDFS, millions of small files are kryptonite for parallel file 
systems (Panasas, Lustre, GPFS, etc).


Old Cloudera blog post, but may be relevant here:
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/


-Ewan

On 13/07/15 10:19, Konstantinos Kougios wrote:
I do have other non-xml tasks and I was getting the same SIGTERM on 
all of them. I think the issue might be due to me processing small 
files via binaryFiles or wholeTextFiles. Initially I had issues with 
Xmx memory because I got more than 1 mil files (and in 1 occasion it 
is 5 mil files). I sorted that out by processing them in batches of 
32k. But then this started happening. I've set the memoryOverhead to 
4g for most of the tasks and it is ok now. But 4g is too much for 
tasks that process small files. I do have 32 threads per executor on 
some tasks but 32meg for stack  thread overhead should do. Maybe the 
issue is sockets or some mem leak of network communication.


On 13/07/15 09:15, Ewan Higgs wrote:

It depends on how large the xml files are and how you're processing them.

If you're using !ENTITY tags then you don't need a very large piece 
of xml to consume a lot of memory. e.g. the billion laughs xml:

https://en.wikipedia.org/wiki/Billion_laughs

-Ewan

On 13/07/15 10:11, Konstantinos Kougios wrote:
it was the memoryOverhead. It runs ok with more of that, but do you 
know which libraries could affect this? I find it strange that it 
needs 4g for a task that processes some xml files. The task 
themselfs require less Xmx.


Cheers

On 13/07/15 06:29, Jong Wook Kim wrote:

Based on my experience, YARN containers can get SIGTERM when

- it produces too much logs and use up the hard drive
- it uses off-heap memory more than what is given by 
spark.yarn.executor.memoryOverhead configuration. It might be due 
to too many classes loaded (less than MaxPermGen but more than 
memoryOverhead), or some other off-heap memory allocated by 
networking library, etc.
- it opens too many file descriptors, which you can check on the 
executor node's /proc/executor jvm's pid/fd/


Does any of these apply to your situation?

Jong Wook

On Jul 7, 2015, at 19:16, Kostas Kougios 
kostas.koug...@googlemail.com wrote:


I am still receiving these weird sigterms on the executors. The 
driver claims

it lost the executor, the executor receives a SIGTERM (from whom???)

It doesn't seem a memory related issue though increasing memory 
takes the
job a bit further or completes it. But why? there is no memory 
pressure on

neither driver nor executor. And nothing in the logs indicating so.

driver:

15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task 
14762.0 in
stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069 
bytes)
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task 
14517.0 in
stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified 
(14507/42240)
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver 
terminated

or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost 
executor 1 on

cruncher05.stratified: remote Rpc client disassociated
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasks 
for 1

from TaskSet 0.0
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver 
terminated

or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor: 
Association with
remote system 
[akka.tcp://sparkExecutor@cruncher05.stratified:32976] has
failed, address is now gated for [5000] ms. Reason is: 
[Disassociated].


15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0 
in stage
0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure 
(executor 1

lost)

gc log for driver, it doesnt look like it run outofmem:

2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
1764131K-1391211K(3393024K), 0.0102839 secs]
2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-13 Thread Konstantinos Kougios
yes YARN was terminating the executor because the off heap memory limit 
was exceeded.


On 13/07/15 06:55, Ruslan Dautkhanov wrote:

 the executor receives a SIGTERM (from whom???)

From YARN Resource Manager.

Check if yarn fair scheduler preemption and/or speculative execution 
are turned on,

then it's quite possible and not a bug.



--
Ruslan Dautkhanov

On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim jongw...@nyu.edu 
mailto:jongw...@nyu.edu wrote:


Based on my experience, YARN containers can get SIGTERM when

- it produces too much logs and use up the hard drive
- it uses off-heap memory more than what is given by
spark.yarn.executor.memoryOverhead configuration. It might be due
to too many classes loaded (less than MaxPermGen but more than
memoryOverhead), or some other off-heap memory allocated by
networking library, etc.
- it opens too many file descriptors, which you can check on the
executor node's /proc/executor jvm's pid/fd/

Does any of these apply to your situation?

Jong Wook


On Jul 7, 2015, at 19:16, Kostas Kougios
kostas.koug...@googlemail.com
mailto:kostas.koug...@googlemail.com wrote:

I am still receiving these weird sigterms on the executors. The
driver claims
it lost the executor, the executor receives a SIGTERM (from whom???)

It doesn't seem a memory related issue though increasing memory
takes the
job a bit further or completes it. But why? there is no memory
pressure on
neither driver nor executor. And nothing in the logs indicating so.

driver:

15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task
14762.0 in
stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069
bytes)
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task
14517.0 in
stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified
(14507/42240)
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver
terminated
or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost
executor 1 on
cruncher05.stratified: remote Rpc client disassociated
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing
tasks for 1
from TaskSet 0.0
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver
terminated
or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor:
Association with
remote system
[akka.tcp://sparkExecutor@cruncher05.stratified:32976] has
failed, address is now gated for [5000] ms. Reason is:
[Disassociated].

15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task
14591.0 in stage
0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure
(executor 1
lost)

gc log for driver, it doesnt look like it run outofmem:

2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
1764131K-1391211K(3393024K), 0.0102839 secs]
2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)
1764971K-1391867K(3405312K), 0.0099062 secs]
2015-07-07T10:46:45.252+0100: [GC (Allocation Failure)
1782011K-1392596K(3401216K), 0.0167572 secs]

executor:

15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in
stage 0.0
(TID 14750)
15/07/07 10:47:03 INFO spark.CacheManager: Partition
rdd_493_14750 not
found, computing it
15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend:
RECEIVED
SIGNAL 15: SIGTERM
15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called

executor gc log (no outofmem as it seems):
2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC)
24696750K-23712939K(33523712K), 0.0416640 secs]
2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC)
24700520K-23722043K(33523712K), 0.0391156 secs]
2015-07-07T10:47:02.862+0100: [GC (Allocation Failure)
24709182K-23726510K(33518592K), 0.0390784 secs]





--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com http://Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org








Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-13 Thread Konstantinos Kougios
it was the memoryOverhead. It runs ok with more of that, but do you know 
which libraries could affect this? I find it strange that it needs 4g 
for a task that processes some xml files. The task themselfs require 
less Xmx.


Cheers

On 13/07/15 06:29, Jong Wook Kim wrote:

Based on my experience, YARN containers can get SIGTERM when

- it produces too much logs and use up the hard drive
- it uses off-heap memory more than what is given by 
spark.yarn.executor.memoryOverhead configuration. It might be due to 
too many classes loaded (less than MaxPermGen but more than 
memoryOverhead), or some other off-heap memory allocated by networking 
library, etc.
- it opens too many file descriptors, which you can check on the 
executor node's /proc/executor jvm's pid/fd/


Does any of these apply to your situation?

Jong Wook

On Jul 7, 2015, at 19:16, Kostas Kougios 
kostas.koug...@googlemail.com 
mailto:kostas.koug...@googlemail.com wrote:


I am still receiving these weird sigterms on the executors. The 
driver claims

it lost the executor, the executor receives a SIGTERM (from whom???)

It doesn't seem a memory related issue though increasing memory takes the
job a bit further or completes it. But why? there is no memory 
pressure on

neither driver nor executor. And nothing in the logs indicating so.

driver:

15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task 14762.0 in
stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069 bytes)
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task 14517.0 in
stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified (14507/42240)
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver 
terminated

or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost executor 1 on
cruncher05.stratified: remote Rpc client disassociated
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasks for 1
from TaskSet 0.0
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver 
terminated

or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor: Association 
with

remote system [akka.tcp://sparkExecutor@cruncher05.stratified:32976] has
failed, address is now gated for [5000] ms. Reason is: [Disassociated].

15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0 in 
stage

0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure (executor 1
lost)

gc log for driver, it doesnt look like it run outofmem:

2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
1764131K-1391211K(3393024K), 0.0102839 secs]
2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)
1764971K-1391867K(3405312K), 0.0099062 secs]
2015-07-07T10:46:45.252+0100: [GC (Allocation Failure)
1782011K-1392596K(3401216K), 0.0167572 secs]

executor:

15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in 
stage 0.0

(TID 14750)
15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not
found, computing it
15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
SIGNAL 15: SIGTERM
15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called

executor gc log (no outofmem as it seems):
2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC)
24696750K-23712939K(33523712K), 0.0416640 secs]
2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC)
24700520K-23722043K(33523712K), 0.0391156 secs]
2015-07-07T10:47:02.862+0100: [GC (Allocation Failure)
24709182K-23726510K(33518592K), 0.0390784 secs]





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com http://Nabble.com.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org 
mailto:user-h...@spark.apache.org








Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-13 Thread Konstantinos Kougios
I do have other non-xml tasks and I was getting the same SIGTERM on all 
of them. I think the issue might be due to me processing small files via 
binaryFiles or wholeTextFiles. Initially I had issues with Xmx memory 
because I got more than 1 mil files (and in 1 occasion it is 5 mil 
files). I sorted that out by processing them in batches of 32k. But then 
this started happening. I've set the memoryOverhead to 4g for most of 
the tasks and it is ok now. But 4g is too much for tasks that process 
small files. I do have 32 threads per executor on some tasks but 32meg 
for stack  thread overhead should do. Maybe the issue is sockets or 
some mem leak of network communication.


On 13/07/15 09:15, Ewan Higgs wrote:

It depends on how large the xml files are and how you're processing them.

If you're using !ENTITY tags then you don't need a very large piece of 
xml to consume a lot of memory. e.g. the billion laughs xml:

https://en.wikipedia.org/wiki/Billion_laughs

-Ewan

On 13/07/15 10:11, Konstantinos Kougios wrote:
it was the memoryOverhead. It runs ok with more of that, but do you 
know which libraries could affect this? I find it strange that it 
needs 4g for a task that processes some xml files. The task themselfs 
require less Xmx.


Cheers

On 13/07/15 06:29, Jong Wook Kim wrote:

Based on my experience, YARN containers can get SIGTERM when

- it produces too much logs and use up the hard drive
- it uses off-heap memory more than what is given by 
spark.yarn.executor.memoryOverhead configuration. It might be due to 
too many classes loaded (less than MaxPermGen but more than 
memoryOverhead), or some other off-heap memory allocated by 
networking library, etc.
- it opens too many file descriptors, which you can check on the 
executor node's /proc/executor jvm's pid/fd/


Does any of these apply to your situation?

Jong Wook

On Jul 7, 2015, at 19:16, Kostas Kougios 
kostas.koug...@googlemail.com wrote:


I am still receiving these weird sigterms on the executors. The 
driver claims

it lost the executor, the executor receives a SIGTERM (from whom???)

It doesn't seem a memory related issue though increasing memory 
takes the
job a bit further or completes it. But why? there is no memory 
pressure on

neither driver nor executor. And nothing in the logs indicating so.

driver:

15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task 
14762.0 in
stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069 
bytes)
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task 
14517.0 in
stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified 
(14507/42240)
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver 
terminated

or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost executor 
1 on

cruncher05.stratified: remote Rpc client disassociated
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasks 
for 1

from TaskSet 0.0
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver 
terminated

or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor: 
Association with
remote system 
[akka.tcp://sparkExecutor@cruncher05.stratified:32976] has

failed, address is now gated for [5000] ms. Reason is: [Disassociated].

15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0 
in stage

0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure (executor 1
lost)

gc log for driver, it doesnt look like it run outofmem:

2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
1764131K-1391211K(3393024K), 0.0102839 secs]
2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)
1764971K-1391867K(3405312K), 0.0099062 secs]
2015-07-07T10:46:45.252+0100: [GC (Allocation Failure)
1782011K-1392596K(3401216K), 0.0167572 secs]

executor:

15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in 
stage 0.0

(TID 14750)
15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not
found, computing it
15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
SIGNAL 15: SIGTERM
15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called

executor gc log (no outofmem as it seems):
2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC)
24696750K-23712939K(33523712K), 0.0416640 secs]
2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC)
24700520K-23722043K(33523712K), 0.0391156 secs]
2015-07-07T10:47:02.862+0100: [GC (Allocation Failure)
24709182K-23726510K(33518592K), 0.0390784 secs]





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com http://Nabble.com.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
mailto:user-unsubscr...@spark.apache.org

For additional commands, 

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-12 Thread Jong Wook Kim
Based on my experience, YARN containers can get SIGTERM when 

- it produces too much logs and use up the hard drive
- it uses off-heap memory more than what is given by 
spark.yarn.executor.memoryOverhead configuration. It might be due to too many 
classes loaded (less than MaxPermGen but more than memoryOverhead), or some 
other off-heap memory allocated by networking library, etc.
- it opens too many file descriptors, which you can check on the executor 
node's /proc/executor jvm's pid/fd/

Does any of these apply to your situation?

Jong Wook

 On Jul 7, 2015, at 19:16, Kostas Kougios kostas.koug...@googlemail.com 
 wrote:
 
 I am still receiving these weird sigterms on the executors. The driver claims
 it lost the executor, the executor receives a SIGTERM (from whom???)
 
 It doesn't seem a memory related issue though increasing memory takes the
 job a bit further or completes it. But why? there is no memory pressure on
 neither driver nor executor. And nothing in the logs indicating so.
 
 driver:
 
 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task 14762.0 in
 stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069 bytes)
 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task 14517.0 in
 stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified (14507/42240)
 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
 or disconnected! Shutting down. cruncher05.stratified:32976
 15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost executor 1 on
 cruncher05.stratified: remote Rpc client disassociated
 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasks for 1
 from TaskSet 0.0
 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
 or disconnected! Shutting down. cruncher05.stratified:32976
 15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor: Association with
 remote system [akka.tcp://sparkExecutor@cruncher05.stratified:32976] has
 failed, address is now gated for [5000] ms. Reason is: [Disassociated].
 
 15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0 in stage
 0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure (executor 1
 lost)
 
 gc log for driver, it doesnt look like it run outofmem:
 
 2015-07-07T10:45:19.887+0100: [GC (Allocation Failure) 
 1764131K-1391211K(3393024K), 0.0102839 secs]
 2015-07-07T10:46:00.934+0100: [GC (Allocation Failure) 
 1764971K-1391867K(3405312K), 0.0099062 secs]
 2015-07-07T10:46:45.252+0100: [GC (Allocation Failure) 
 1782011K-1392596K(3401216K), 0.0167572 secs]
 
 executor:
 
 15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in stage 0.0
 (TID 14750)
 15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not
 found, computing it
 15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
 SIGNAL 15: SIGTERM
 15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called
 
 executor gc log (no outofmem as it seems):
 2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC) 
 24696750K-23712939K(33523712K), 0.0416640 secs]
 2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC) 
 24700520K-23722043K(33523712K), 0.0391156 secs]
 2015-07-07T10:47:02.862+0100: [GC (Allocation Failure) 
 24709182K-23726510K(33518592K), 0.0390784 secs]
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 



Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-12 Thread Ruslan Dautkhanov
 the executor receives a SIGTERM (from whom???)

From YARN Resource Manager.

Check if yarn fair scheduler preemption and/or speculative execution are
turned on,
then it's quite possible and not a bug.



-- 
Ruslan Dautkhanov

On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim jongw...@nyu.edu wrote:

 Based on my experience, YARN containers can get SIGTERM when

 - it produces too much logs and use up the hard drive
 - it uses off-heap memory more than what is given by
 spark.yarn.executor.memoryOverhead configuration. It might be due to too
 many classes loaded (less than MaxPermGen but more than memoryOverhead), or
 some other off-heap memory allocated by networking library, etc.
 - it opens too many file descriptors, which you can check on the executor
 node's /proc/executor jvm's pid/fd/

 Does any of these apply to your situation?

 Jong Wook

 On Jul 7, 2015, at 19:16, Kostas Kougios kostas.koug...@googlemail.com
 wrote:

 I am still receiving these weird sigterms on the executors. The driver
 claims
 it lost the executor, the executor receives a SIGTERM (from whom???)

 It doesn't seem a memory related issue though increasing memory takes the
 job a bit further or completes it. But why? there is no memory pressure on
 neither driver nor executor. And nothing in the logs indicating so.

 driver:

 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task 14762.0 in
 stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069 bytes)
 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task 14517.0 in
 stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified (14507/42240)
 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
 or disconnected! Shutting down. cruncher05.stratified:32976
 15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost executor 1 on
 cruncher05.stratified: remote Rpc client disassociated
 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasks for 1
 from TaskSet 0.0
 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
 or disconnected! Shutting down. cruncher05.stratified:32976
 15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor: Association with
 remote system [akka.tcp://sparkExecutor@cruncher05.stratified:32976] has
 failed, address is now gated for [5000] ms. Reason is: [Disassociated].

 15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0 in stage
 0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure (executor 1
 lost)

 gc log for driver, it doesnt look like it run outofmem:

 2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
 1764131K-1391211K(3393024K), 0.0102839 secs]
 2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)
 1764971K-1391867K(3405312K), 0.0099062 secs]
 2015-07-07T10:46:45.252+0100: [GC (Allocation Failure)
 1782011K-1392596K(3401216K), 0.0167572 secs]

 executor:

 15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in stage 0.0
 (TID 14750)
 15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not
 found, computing it
 15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
 SIGNAL 15: SIGTERM
 15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called

 executor gc log (no outofmem as it seems):
 2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC)
 24696750K-23712939K(33523712K), 0.0416640 secs]
 2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC)
 24700520K-23722043K(33523712K), 0.0391156 secs]
 2015-07-07T10:47:02.862+0100: [GC (Allocation Failure)
 24709182K-23726510K(33518592K), 0.0390784 secs]





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org