Re: Spark executor killed without apparent reason

2016-03-03 Thread Saisai Shao
If it is due to heartbeat problem and driver explicitly killed the
executors, there should be some driver logs mentioned about it. So you
could check the driver log about it. Also container (executor) logs are
useful, if this container is killed, then there'll be some signal related
logs, like (SIGTERM).

On Thu, Mar 3, 2016 at 4:00 PM, Nirav Patel  wrote:

> There was nothing in nodemanager logs that indicated why container was
> killed.
>
> Here's the guess: Since killed executors were experiencing high GC
> activities (full GC) before death they most likely failed to respond to
> heart beat to driver or nodemanager and got killed due to it.
>
> This is more relevant to issue:
> 16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to
> maprnode5 has been quiet for 12 ms while there are outstanding
> requests. Assuming connection is dead; please adjust spark.network.timeout
> if this is wrong.
>
> Following has nothing to do with this. It was raised as I manually killed
> application at some point after too many executors were getting killed.
> " ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM"
>
> Thanks
>
> On Wed, Mar 2, 2016 at 8:22 AM, Nirav Patel  wrote:
>
>> I think that was due to manually killing application. ExecutorLost
>> started  around 04:46:21 and application was manually killed around
>> 05:54:41
>>
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,500 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>> Stopping container with container Id:
>> *container_1456722312951_0450_01_01*
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,500 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:
>> USER=xactly IP=10.250.70.119 OPERATION=Stop Container Request
>> TARGET=ContainerManageImpl RESULT=SUCCESS
>> APPID=application_1456722312951_0450 CONTAINERID=
>> *container_1456722312951_0450_01_01*
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,500 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>> Container *container_1456722312951_0450_01_01* transitioned from
>> RUNNING to KILLING
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,501 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>> Cleaning up container *container_1456722312951_0450_01_01*
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,507 WARN
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit
>> code from container *container_1456722312951_0450_01_01* is : 143
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,520 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>> Container *container_1456722312951_0450_01_01* transitioned from
>> KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,557 INFO
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
>> Deleting absolute path :
>> /tmp/hadoop-xactly/nm-local-dir/usercache/xactly/appcache/application_1456722312951_0450/
>> *container_1456722312951_0450_01_01*
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,558 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:
>> USER=xactly OPERATION=Container Finished - Killed TARGET=ContainerImpl
>> RESULT=SUCCESS APPID=application_1456722312951_0450 CONTAINERID=
>> *container_1456722312951_0450_01_01*
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,558 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>> Container *container_1456722312951_0450_01_01* transitioned from
>> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,566 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>> Removing *container_1456722312951_0450_01_01* from application
>> application_1456722312951_0450
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,567 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>> Considering container*container_1456722312951_0450_01_01* for
>> log-aggregation
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:41,567 INFO org.apache.spark.network.yarn.YarnShuffleService:
>> Stopping container *container_1456722312951_0450_01_01*
>>
>> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
>> 05:54:42,504 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: 

Re: Spark executor killed without apparent reason

2016-03-03 Thread Nirav Patel
There was nothing in nodemanager logs that indicated why container was
killed.

Here's the guess: Since killed executors were experiencing high GC
activities (full GC) before death they most likely failed to respond to
heart beat to driver or nodemanager and got killed due to it.

This is more relevant to issue:
16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to
maprnode5 has been quiet for 12 ms while there are outstanding
requests. Assuming connection is dead; please adjust spark.network.timeout
if this is wrong.

Following has nothing to do with this. It was raised as I manually killed
application at some point after too many executors were getting killed.
" ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM"

Thanks

On Wed, Mar 2, 2016 at 8:22 AM, Nirav Patel  wrote:

> I think that was due to manually killing application. ExecutorLost started
>  around 04:46:21 and application was manually killed around 05:54:41
>
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,500 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
> Stopping container with container Id:
> *container_1456722312951_0450_01_01*
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,500 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:
> USER=xactly IP=10.250.70.119 OPERATION=Stop Container Request
> TARGET=ContainerManageImpl RESULT=SUCCESS
> APPID=application_1456722312951_0450 CONTAINERID=
> *container_1456722312951_0450_01_01*
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,500 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container *container_1456722312951_0450_01_01* transitioned from
> RUNNING to KILLING
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,501 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
> Cleaning up container *container_1456722312951_0450_01_01*
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,507 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit
> code from container *container_1456722312951_0450_01_01* is : 143
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,520 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container *container_1456722312951_0450_01_01* transitioned from
> KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,557 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
> Deleting absolute path :
> /tmp/hadoop-xactly/nm-local-dir/usercache/xactly/appcache/application_1456722312951_0450/
> *container_1456722312951_0450_01_01*
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,558 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:
> USER=xactly OPERATION=Container Finished - Killed TARGET=ContainerImpl
> RESULT=SUCCESS APPID=application_1456722312951_0450 CONTAINERID=
> *container_1456722312951_0450_01_01*
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,558 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container *container_1456722312951_0450_01_01* transitioned from
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,566 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
> Removing *container_1456722312951_0450_01_01* from application
> application_1456722312951_0450
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,567 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Considering container*container_1456722312951_0450_01_01* for
> log-aggregation
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:41,567 INFO org.apache.spark.network.yarn.YarnShuffleService:
> Stopping container *container_1456722312951_0450_01_01*
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:42,504 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed
> completed containers from NM context:
> [container_1456722312951_0450_01_27,
> container_1456722312951_0450_01_30,
> *container_1456722312951_0450_01_01*]
>
> yarn-xactly-nodemanager-hdn4.xactlycorporation.local.log:2016-03-01
> 05:54:42,529 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Uploading logs for container *container_1456722312951_0450_01_01*.
> Current good log dirs are 

Re: Spark executor killed without apparent reason

2016-03-01 Thread Ted Yu
Using pastebin seems to be better.

The attachment may not go through.

FYI

On Tue, Mar 1, 2016 at 6:07 PM, Jeff Zhang  wrote:

> Do you mind to attach the whole yarn app log ?
>
> On Wed, Mar 2, 2016 at 10:03 AM, Nirav Patel 
> wrote:
>
>> Hi Ryan,
>>
>> I did search "OutOfMemoryError" earlier and just now but it doesn't
>> indicate anything else.
>>
>> Another thing is Job fails at "saveAsHadoopDataset" call to huge rdd.
>> Most of the executors fails at this stage. I don't understand that as well.
>> Because that should be a straight write job to filesystem. All the complex
>> joins and logic were in previous stages which all ran successfully.
>>
>> Thanks
>> Nirav
>>
>> On Wed, Mar 2, 2016 at 2:22 AM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> Could you search "OutOfMemoryError" in the executor logs? It could be
>>> "OufOfMemoryError: Direct Buffer Memory" or something else.
>>>
>>> On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel 
>>> wrote:
>>>
 Hi,

 We are using spark 1.5.2 or yarn. We have a spark application utilizing
 about 15GB executor memory and 1500 overhead. However, at certain stage we
 notice higher GC time (almost same as task time) spent. These executors are
 bound to get killed at some point. However, nodemanager or resource manager
 logs doesn't indicate failure due to 'beyond physical/virtual memory
 limits' nor I see any 'heap space' or 'gc overhead exceeded' errors in
 executor logs. Some of these high GC executor gets killed eventually but I
 can't seem to find reason. Based on application logs it seems like executor
 didn't respond to driver for long period of time and connection was reset.

 Following are logs from 'yarn logs -applicationId appId_1232_xxx'


 16/02/24 11:09:47 INFO executor.Executor: Finished task 224.0 in stage
 8.0 (TID 15318). 2099 bytes result sent to driver
 16/02/24 11:09:47 INFO executor.CoarseGrainedExecutorBackend: Got
 assigned task 15333
 16/02/24 11:09:47 INFO executor.Executor: Running task 239.0 in stage
 8.0 (TID 15333)
 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Getting 125
 non-empty blocks out of 3007 blocks
 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Started 14
 remote fetches in 10 ms
 16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to
 maprnode5 has been quiet for 12 ms while there are outstanding
 requests. Assuming connection is dead; please adjust spark.network.timeout
 if this is wrong.
 16/02/24 11:11:47 ERROR client.TransportResponseHandler: Still have 1
 requests outstanding when connection from maprnode5 is closed
 16/02/24 11:11:47 ERROR shuffle.OneForOneBlockFetcher: Failed while
 starting block fetches
 java.io.IOException: Connection from maprnode5 closed
 at
 org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
 at
 org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
 at
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
 at
 io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
 at
 io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
 at
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
 at
 io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
 at
 io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
 at
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
 at
 io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
 at
 io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
 at
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
 at
 io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
 at
 io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
 at
 io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
 at
 io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
 

Re: Spark executor killed without apparent reason

2016-03-01 Thread Jeff Zhang
Do you mind to attach the whole yarn app log ?

On Wed, Mar 2, 2016 at 10:03 AM, Nirav Patel  wrote:

> Hi Ryan,
>
> I did search "OutOfMemoryError" earlier and just now but it doesn't
> indicate anything else.
>
> Another thing is Job fails at "saveAsHadoopDataset" call to huge rdd. Most
> of the executors fails at this stage. I don't understand that as well.
> Because that should be a straight write job to filesystem. All the complex
> joins and logic were in previous stages which all ran successfully.
>
> Thanks
> Nirav
>
> On Wed, Mar 2, 2016 at 2:22 AM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Could you search "OutOfMemoryError" in the executor logs? It could be
>> "OufOfMemoryError: Direct Buffer Memory" or something else.
>>
>> On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel 
>> wrote:
>>
>>> Hi,
>>>
>>> We are using spark 1.5.2 or yarn. We have a spark application utilizing
>>> about 15GB executor memory and 1500 overhead. However, at certain stage we
>>> notice higher GC time (almost same as task time) spent. These executors are
>>> bound to get killed at some point. However, nodemanager or resource manager
>>> logs doesn't indicate failure due to 'beyond physical/virtual memory
>>> limits' nor I see any 'heap space' or 'gc overhead exceeded' errors in
>>> executor logs. Some of these high GC executor gets killed eventually but I
>>> can't seem to find reason. Based on application logs it seems like executor
>>> didn't respond to driver for long period of time and connection was reset.
>>>
>>> Following are logs from 'yarn logs -applicationId appId_1232_xxx'
>>>
>>>
>>> 16/02/24 11:09:47 INFO executor.Executor: Finished task 224.0 in stage
>>> 8.0 (TID 15318). 2099 bytes result sent to driver
>>> 16/02/24 11:09:47 INFO executor.CoarseGrainedExecutorBackend: Got
>>> assigned task 15333
>>> 16/02/24 11:09:47 INFO executor.Executor: Running task 239.0 in stage
>>> 8.0 (TID 15333)
>>> 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Getting 125
>>> non-empty blocks out of 3007 blocks
>>> 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Started 14
>>> remote fetches in 10 ms
>>> 16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to
>>> maprnode5 has been quiet for 12 ms while there are outstanding
>>> requests. Assuming connection is dead; please adjust spark.network.timeout
>>> if this is wrong.
>>> 16/02/24 11:11:47 ERROR client.TransportResponseHandler: Still have 1
>>> requests outstanding when connection from maprnode5 is closed
>>> 16/02/24 11:11:47 ERROR shuffle.OneForOneBlockFetcher: Failed while
>>> starting block fetches
>>> java.io.IOException: Connection from maprnode5 closed
>>> at
>>> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
>>> at
>>> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>>> at
>>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>>> at
>>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>>> at
>>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>>> at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>>> at
>>> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
>>> at
>>> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
>>> at
>>> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>>> at
>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>>> at java.lang.Thread.run(Thread.java:744)
>>> 

Re: Spark executor killed without apparent reason

2016-03-01 Thread Nirav Patel
Hi Ryan,

I did search "OutOfMemoryError" earlier and just now but it doesn't
indicate anything else.

Another thing is Job fails at "saveAsHadoopDataset" call to huge rdd. Most
of the executors fails at this stage. I don't understand that as well.
Because that should be a straight write job to filesystem. All the complex
joins and logic were in previous stages which all ran successfully.

Thanks
Nirav

On Wed, Mar 2, 2016 at 2:22 AM, Shixiong(Ryan) Zhu 
wrote:

> Could you search "OutOfMemoryError" in the executor logs? It could be
> "OufOfMemoryError: Direct Buffer Memory" or something else.
>
> On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel  wrote:
>
>> Hi,
>>
>> We are using spark 1.5.2 or yarn. We have a spark application utilizing
>> about 15GB executor memory and 1500 overhead. However, at certain stage we
>> notice higher GC time (almost same as task time) spent. These executors are
>> bound to get killed at some point. However, nodemanager or resource manager
>> logs doesn't indicate failure due to 'beyond physical/virtual memory
>> limits' nor I see any 'heap space' or 'gc overhead exceeded' errors in
>> executor logs. Some of these high GC executor gets killed eventually but I
>> can't seem to find reason. Based on application logs it seems like executor
>> didn't respond to driver for long period of time and connection was reset.
>>
>> Following are logs from 'yarn logs -applicationId appId_1232_xxx'
>>
>>
>> 16/02/24 11:09:47 INFO executor.Executor: Finished task 224.0 in stage
>> 8.0 (TID 15318). 2099 bytes result sent to driver
>> 16/02/24 11:09:47 INFO executor.CoarseGrainedExecutorBackend: Got
>> assigned task 15333
>> 16/02/24 11:09:47 INFO executor.Executor: Running task 239.0 in stage 8.0
>> (TID 15333)
>> 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Getting 125
>> non-empty blocks out of 3007 blocks
>> 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Started 14
>> remote fetches in 10 ms
>> 16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to
>> maprnode5 has been quiet for 12 ms while there are outstanding
>> requests. Assuming connection is dead; please adjust spark.network.timeout
>> if this is wrong.
>> 16/02/24 11:11:47 ERROR client.TransportResponseHandler: Still have 1
>> requests outstanding when connection from maprnode5 is closed
>> 16/02/24 11:11:47 ERROR shuffle.OneForOneBlockFetcher: Failed while
>> starting block fetches
>> java.io.IOException: Connection from maprnode5 closed
>> at
>> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
>> at
>> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>> at
>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>> at
>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>> at
>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>> at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>> at
>> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
>> at
>> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
>> at
>> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>> at
>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>> at java.lang.Thread.run(Thread.java:744)
>> 16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3)
>> for 6 outstanding blocks after 5000 ms
>> 16/02/24 11:11:52 INFO client.TransportClientFactory: Found inactive
>> connection to maprnode5, creating a new 

Re: Spark executor killed without apparent reason

2016-03-01 Thread Shixiong(Ryan) Zhu
Could you search "OutOfMemoryError" in the executor logs? It could be
"OufOfMemoryError: Direct Buffer Memory" or something else.

On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel  wrote:

> Hi,
>
> We are using spark 1.5.2 or yarn. We have a spark application utilizing
> about 15GB executor memory and 1500 overhead. However, at certain stage we
> notice higher GC time (almost same as task time) spent. These executors are
> bound to get killed at some point. However, nodemanager or resource manager
> logs doesn't indicate failure due to 'beyond physical/virtual memory
> limits' nor I see any 'heap space' or 'gc overhead exceeded' errors in
> executor logs. Some of these high GC executor gets killed eventually but I
> can't seem to find reason. Based on application logs it seems like executor
> didn't respond to driver for long period of time and connection was reset.
>
> Following are logs from 'yarn logs -applicationId appId_1232_xxx'
>
>
> 16/02/24 11:09:47 INFO executor.Executor: Finished task 224.0 in stage 8.0
> (TID 15318). 2099 bytes result sent to driver
> 16/02/24 11:09:47 INFO executor.CoarseGrainedExecutorBackend: Got assigned
> task 15333
> 16/02/24 11:09:47 INFO executor.Executor: Running task 239.0 in stage 8.0
> (TID 15333)
> 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Getting 125
> non-empty blocks out of 3007 blocks
> 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Started 14
> remote fetches in 10 ms
> 16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to
> maprnode5 has been quiet for 12 ms while there are outstanding
> requests. Assuming connection is dead; please adjust spark.network.timeout
> if this is wrong.
> 16/02/24 11:11:47 ERROR client.TransportResponseHandler: Still have 1
> requests outstanding when connection from maprnode5 is closed
> 16/02/24 11:11:47 ERROR shuffle.OneForOneBlockFetcher: Failed while
> starting block fetches
> java.io.IOException: Connection from maprnode5 closed
> at
> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
> at
> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:744)
> 16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3)
> for 6 outstanding blocks after 5000 ms
> 16/02/24 11:11:52 INFO client.TransportClientFactory: Found inactive
> connection to maprnode5, creating a new one.
> 16/02/24 11:12:16 WARN server.TransportChannelHandler: Exception in
> connection from maprnode5
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> at
> 

Spark executor killed without apparent reason

2016-03-01 Thread Nirav Patel
Hi,

We are using spark 1.5.2 or yarn. We have a spark application utilizing
about 15GB executor memory and 1500 overhead. However, at certain stage we
notice higher GC time (almost same as task time) spent. These executors are
bound to get killed at some point. However, nodemanager or resource manager
logs doesn't indicate failure due to 'beyond physical/virtual memory
limits' nor I see any 'heap space' or 'gc overhead exceeded' errors in
executor logs. Some of these high GC executor gets killed eventually but I
can't seem to find reason. Based on application logs it seems like executor
didn't respond to driver for long period of time and connection was reset.

Following are logs from 'yarn logs -applicationId appId_1232_xxx'


16/02/24 11:09:47 INFO executor.Executor: Finished task 224.0 in stage 8.0
(TID 15318). 2099 bytes result sent to driver
16/02/24 11:09:47 INFO executor.CoarseGrainedExecutorBackend: Got assigned
task 15333
16/02/24 11:09:47 INFO executor.Executor: Running task 239.0 in stage 8.0
(TID 15333)
16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Getting 125
non-empty blocks out of 3007 blocks
16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Started 14
remote fetches in 10 ms
16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to
maprnode5 has been quiet for 12 ms while there are outstanding
requests. Assuming connection is dead; please adjust spark.network.timeout
if this is wrong.
16/02/24 11:11:47 ERROR client.TransportResponseHandler: Still have 1
requests outstanding when connection from maprnode5 is closed
16/02/24 11:11:47 ERROR shuffle.OneForOneBlockFetcher: Failed while
starting block fetches
java.io.IOException: Connection from maprnode5 closed
at
org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
at
org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at
io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at
io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at
io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at
io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
at
io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:744)
16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3)
for 6 outstanding blocks after 5000 ms
16/02/24 11:11:52 INFO client.TransportClientFactory: Found inactive
connection to maprnode5, creating a new one.
16/02/24 11:12:16 WARN server.TransportChannelHandler: Exception in
connection from maprnode5
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at