Re: Accessing log for lost executors

2016-12-02 Thread Benyi Wang
Usually your executors were killed by YARN due to exceeding the memory. You
can change NodeManager's log to see if your application got killed. or use
command "yarn logs -applicationId " to download the logs.

On Thu, Dec 1, 2016 at 10:30 PM, Nisrina Luthfiyati <
nisrina.luthfiy...@gmail.com> wrote:

> Hi all,
>
> I'm trying to troubleshoot an ExecutorLostFailure issue.
> In Spark UI I noticed that executors tab only list active executors, is
> there any way that I can see the log for dead executors so that I can find
> out why it's dead/lost?
> I'm using Spark 1.5.2 on YARN 2.7.1.
>
> Thanks!
> Nisrina
>


Accessing log for lost executors

2016-12-01 Thread Nisrina Luthfiyati
Hi all,

I'm trying to troubleshoot an ExecutorLostFailure issue.
In Spark UI I noticed that executors tab only list active executors, is
there any way that I can see the log for dead executors so that I can find
out why it's dead/lost?
I'm using Spark 1.5.2 on YARN 2.7.1.

Thanks!
Nisrina


Lost executors failed job unable to execute spark examples Triangle Count (Analytics triangles)

2016-02-16 Thread Ovidiu-Cristian MARCU
Hi,

I am able to run the Triangle Count example with some smaller graphs but when I 
am using http://snap.stanford.edu/data/com-Friendster.html 

I am not able to get the job finished ok. For some reason Spark loses its 
executors.
No matter what I use to configure spark (1.5) I just receive errors, the last 
configuration I’ve used was running for some time than it gave executors lost 
errors.

Some exceptions/errors I got:

ERROR LiveListenerBus: Listener JobProgressListener threw an exception
java.lang.NullPointerException
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:362)
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at 
org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:361)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)

ERROR MapOutputTracker: Missing an output location for shuffle 1

ERROR TaskSchedulerImpl: Lost executor 29 on 172.16.96.49: worker lost
ERROR TaskSchedulerImpl: Lost executor 24 on 172.16.96.39: worker lost

16/02/16 12:41:47 WARN HeartbeatReceiver: Removing executor 8 with no recent 
heartbeats: 168312 ms exceeds timeout 12 ms
16/02/16 12:41:47 ERROR TaskSchedulerImpl: Lost executor 8 on 172.16.96.53: 
Executor heartbeat timed out after 168312 ms

16/02/16 12:41:47 ERROR TaskSchedulerImpl: Lost executor 9 on 172.16.96.9: 
Executor heartbeat timed out after 163671 ms

16/02/16 12:53:53 ERROR TaskSetManager: Task 9 in stage 6.2 failed 4 times; 
aborting job

16/02/16 12:54:42 ERROR ContextCleaner: Error cleaning broadcast 19
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 
seconds]. This timeout is controlled by spark.rpc.askTimeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
at 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
at 
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:214)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:170)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:161)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:161)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
at 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala

Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Thank you Ted and Sandy for getting me pointed in the right direction. From the 
logs:

WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 
25.4 GB of 25.3 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.


On Nov 19, 2015, at 12:20 PM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

Here are the parameters related to log aggregation :


  yarn.log-aggregation-enable
  true



  yarn.log-aggregation.retain-seconds
  2592000


  yarn.nodemanager.log-aggregation.compression-type
  gz



  yarn.nodemanager.log-aggregation.debug-enabled
  false



  yarn.nodemanager.log-aggregation.num-log-files-per-app
  30



  
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
  -1


On Thu, Nov 19, 2015 at 8:14 AM, 
mailto:ross.cramb...@thomsonreuters.com>> 
wrote:
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any 
log files.’ Where can I enable log aggregation?
On Nov 19, 2015, at 11:07 AM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

Do you have YARN log aggregation enabled ?

You can try retrieving log for the container using the following command:

yarn logs -applicationId application_1445957755572_0176 -containerId 
container_1445957755572_0176_01_03

Cheers

On Thu, Nov 19, 2015 at 8:02 AM, 
mailto:ross.cramb...@thomsonreuters.com>> 
wrote:
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL 
transforms on a JSON data set that I load into a data frame. The data set is 
not large (~100GB) and most stages execute without any issues. However, some 
more complex stages tend to lose executors/nodes regularly. What would cause 
this to happen? The logs don’t give too much information -

15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on 
ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container 
container_1445957755572_0176_01_03)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID 8331, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID 8322, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID 8268, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID 8330, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID 8312, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID 8351, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID 8342, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID 8309, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID 8338, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID 8323, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
[Stage 33:===> (117 + 50) / 
200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

 - Followed by a list of lost tasks on each executor.






Re: PySpark Lost Executors

2015-11-19 Thread Ted Yu
Here are the parameters related to log aggregation :


  yarn.log-aggregation-enable
  true



  yarn.log-aggregation.retain-seconds
  2592000


  yarn.nodemanager.log-aggregation.compression-type
  gz



  yarn.nodemanager.log-aggregation.debug-enabled
  false



  yarn.nodemanager.log-aggregation.num-log-files-per-app
  30




yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
  -1


On Thu, Nov 19, 2015 at 8:14 AM,  wrote:

> Hmm I guess I do not - I get 'application_1445957755572_0176 does not
> have any log files.’ Where can I enable log aggregation?
>
> On Nov 19, 2015, at 11:07 AM, Ted Yu  wrote:
>
> Do you have YARN log aggregation enabled ?
>
> You can try retrieving log for the container using the following command:
>
> yarn logs -applicationId application_1445957755572_0176
>  -containerId container_1445957755572_0176_01_03
>
> Cheers
>
> On Thu, Nov 19, 2015 at 8:02 AM,  wrote:
>
>> I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL
>> transforms on a JSON data set that I load into a data frame. The data set
>> is not large (~100GB) and most stages execute without any issues. However,
>> some more complex stages tend to lose executors/nodes regularly. What would
>> cause this to happen? The logs don’t give too much information -
>>
>> 15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on
>> ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container
>> container_1445957755572_0176_01_03)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID
>> 8331, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID
>> 8322, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID
>> 8268, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID
>> 8330, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID
>> 8312, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID
>> 8351, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID
>> 8342, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID
>> 8309, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID
>> 8338, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID
>> 8323, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> [Stage 33:===> (117 + 50)
>> / 200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with
>> remote system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275]
>> has failed, address is now gated for [5000] ms. Reason: [Disassociated]
>>
>>  - Followed by a list of lost tasks on each executor.
>
>
>
>


Re: PySpark Lost Executors

2015-11-19 Thread Sandy Ryza
Hi Ross,

This is most likely occurring because YARN is killing containers for
exceeding physical memory limits.  You can make this less likely to happen
by bumping spark.yarn.executor.memoryOverhead to something higher than 10%
of your spark.executor.memory.

-Sandy

On Thu, Nov 19, 2015 at 8:14 AM,  wrote:

> Hmm I guess I do not - I get 'application_1445957755572_0176 does not
> have any log files.’ Where can I enable log aggregation?
>
> On Nov 19, 2015, at 11:07 AM, Ted Yu  wrote:
>
> Do you have YARN log aggregation enabled ?
>
> You can try retrieving log for the container using the following command:
>
> yarn logs -applicationId application_1445957755572_0176
>  -containerId container_1445957755572_0176_01_03
>
> Cheers
>
> On Thu, Nov 19, 2015 at 8:02 AM,  wrote:
>
>> I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL
>> transforms on a JSON data set that I load into a data frame. The data set
>> is not large (~100GB) and most stages execute without any issues. However,
>> some more complex stages tend to lose executors/nodes regularly. What would
>> cause this to happen? The logs don’t give too much information -
>>
>> 15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on
>> ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container
>> container_1445957755572_0176_01_03)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID
>> 8331, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID
>> 8322, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID
>> 8268, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID
>> 8330, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID
>> 8312, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID
>> 8351, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID
>> 8342, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID
>> 8309, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID
>> 8338, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID
>> 8323, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
>> [Stage 33:===> (117 + 50)
>> / 200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with
>> remote system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275]
>> has failed, address is now gated for [5000] ms. Reason: [Disassociated]
>>
>>  - Followed by a list of lost tasks on each executor.
>
>
>
>


Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any 
log files.’ Where can I enable log aggregation?
On Nov 19, 2015, at 11:07 AM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

Do you have YARN log aggregation enabled ?

You can try retrieving log for the container using the following command:

yarn logs -applicationId application_1445957755572_0176 -containerId 
container_1445957755572_0176_01_03

Cheers

On Thu, Nov 19, 2015 at 8:02 AM, 
mailto:ross.cramb...@thomsonreuters.com>> 
wrote:
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL 
transforms on a JSON data set that I load into a data frame. The data set is 
not large (~100GB) and most stages execute without any issues. However, some 
more complex stages tend to lose executors/nodes regularly. What would cause 
this to happen? The logs don’t give too much information -

15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on 
ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container 
container_1445957755572_0176_01_03)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID 8331, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID 8322, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID 8268, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID 8330, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID 8312, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID 8351, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID 8342, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID 8309, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID 8338, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID 8323, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
[Stage 33:===> (117 + 50) / 
200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

 - Followed by a list of lost tasks on each executor.




Re: PySpark Lost Executors

2015-11-19 Thread Ted Yu
Do you have YARN log aggregation enabled ?

You can try retrieving log for the container using the following command:

yarn logs -applicationId application_1445957755572_0176
 -containerId container_1445957755572_0176_01_03

Cheers

On Thu, Nov 19, 2015 at 8:02 AM,  wrote:

> I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL
> transforms on a JSON data set that I load into a data frame. The data set
> is not large (~100GB) and most stages execute without any issues. However,
> some more complex stages tend to lose executors/nodes regularly. What would
> cause this to happen? The logs don’t give too much information -
>
> 15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on
> ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container
> container_1445957755572_0176_01_03)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID
> 8331, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID
> 8322, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID
> 8268, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID
> 8330, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID
> 8312, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID
> 8351, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID
> 8342, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID
> 8309, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID
> 8338, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID
> 8323, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> [Stage 33:===> (117 + 50)
> / 200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with
> remote system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275]
> has failed, address is now gated for [5000] ms. Reason: [Disassociated]
>
>  - Followed by a list of lost tasks on each executor.


PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL 
transforms on a JSON data set that I load into a data frame. The data set is 
not large (~100GB) and most stages execute without any issues. However, some 
more complex stages tend to lose executors/nodes regularly. What would cause 
this to happen? The logs don’t give too much information - 

15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on 
ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container 
container_1445957755572_0176_01_03)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID 8331, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID 8322, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID 8268, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID 8330, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID 8312, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID 8351, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID 8342, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID 8309, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID 8338, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID 8323, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
[Stage 33:===> (117 + 50) / 
200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

 - Followed by a list of lost tasks on each executor.

Re: Lost executors

2014-11-20 Thread Pala M Muthaia
.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at org.apache.spark.scheduler.Task.run(Task.scala:51)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:724)
>
> 14/11/17 23:58:00 WARN ShortCircuitCache: ShortCircuitCache(0x71a8053d): 
> failed to load 1276010498_BP-1416824317-172.22.48.2-1387241776581
>
>
> However, in some of the nodes, it seems execution proceeded after the
> error, so the above could just be a transient error.
>
> Finally, in the driver logs, i was looking for hint on the decision to
> kill many executors, around the 00:18:25 timestamp when many tasks were
> killed across many executors, but i didn't find anything different.
>
>
>
> On Tue, Nov 18, 2014 at 1:59 PM, Sandy Ryza 
> wrote:
>
>> Hi Pala,
>>
>> Do you have access to your YARN NodeManager logs?  Are you able to check
>> whether they report killing any containers for exceeding memory limits?
>>
>> -Sandy
>>
>> On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia <
>> mchett...@rocketfuelinc.com> wrote:
>>
>>> Hi,
>>>
>>> I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
>>> shell.
>>>
>>> I am running a job that essentially reads a bunch of HBase keys, looks
>>> up HBase data, and performs some filtering and aggregation. The job works
>>> fine in smaller datasets, but when i try to execute on the full dataset,
>>> the job never completes. The few symptoms i notice are:
>>>
>>> a. The job shows progress for a while and then starts throwing lots of
>>> the following errors:
>>>
>>> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
>>>  org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
>>> 906 disconnected, so removing it*
>>> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
>>> org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
>>> executor 906 on : remote Akka client disassociated*
>>>
>>> 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
>>>  org.apache.spark.storage.BlockManagerMasterActor - *Removing
>>> BlockManager BlockManagerId(9186, , 54600, 0) with no recent
>>> heart beats: 82313ms exceeds 45000ms*
>>>
>>> Looking at the logs, the job never recovers from these errors, and
>>> continues to show errors about lost executors and launching new executors,
>>> and this just continues for a long time.
>>>
>>> Could this be because the executors are running out of memory?
>>>
>>> In terms of memory usage, the intermediate data could be large (after
>>> the HBase lookup), but partial and fully aggregated data set size should be
>>> quite small - essentially a bunch of ids and counts (< 1 mil in total).
>>>
>>>
>>>
>>> b. In the Spark UI, i am seeing the following errors (redacted for
>>> brevity), not sure if they are transient or real issue:
>>>
>>> java.net.Soc

Re: Lost executors

2014-11-18 Thread Pala M Muthaia
(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

14/11/17 23:58:00 WARN ShortCircuitCache:
ShortCircuitCache(0x71a8053d): failed to load
1276010498_BP-1416824317-172.22.48.2-1387241776581


However, in some of the nodes, it seems execution proceeded after the
error, so the above could just be a transient error.

Finally, in the driver logs, i was looking for hint on the decision to kill
many executors, around the 00:18:25 timestamp when many tasks were killed
across many executors, but i didn't find anything different.



On Tue, Nov 18, 2014 at 1:59 PM, Sandy Ryza  wrote:

> Hi Pala,
>
> Do you have access to your YARN NodeManager logs?  Are you able to check
> whether they report killing any containers for exceeding memory limits?
>
> -Sandy
>
> On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia <
> mchett...@rocketfuelinc.com> wrote:
>
>> Hi,
>>
>> I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
>> shell.
>>
>> I am running a job that essentially reads a bunch of HBase keys, looks up
>> HBase data, and performs some filtering and aggregation. The job works fine
>> in smaller datasets, but when i try to execute on the full dataset, the job
>> never completes. The few symptoms i notice are:
>>
>> a. The job shows progress for a while and then starts throwing lots of
>> the following errors:
>>
>> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
>>  org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
>> 906 disconnected, so removing it*
>> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
>> org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
>> executor 906 on : remote Akka client disassociated*
>>
>> 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
>>  org.apache.spark.storage.BlockManagerMasterActor - *Removing
>> BlockManager BlockManagerId(9186, , 54600, 0) with no recent
>> heart beats: 82313ms exceeds 45000ms*
>>
>> Looking at the logs, the job never recovers from these errors, and
>> continues to show errors about lost executors and launching new executors,
>> and this just continues for a long time.
>>
>> Could this be because the executors are running out of memory?
>>
>> In terms of memory usage, the intermediate data could be large (after the
>> HBase lookup), but partial and fully aggregated data set size should be
>> quite small - essentially a bunch of ids and counts (< 1 mil in total).
>>
>>
>>
>> b. In the Spark UI, i am seeing the following errors (redacted for
>> brevity), not sure if they are transient or real issue:
>>
>> java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed 
>> out}
>> ...
>> org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>> ...
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> java.lang.Thread.run(Thread.java:724)
>>
>>
>>
>>
>> I was trying to get more data to investigate but haven't been able to
>> figure out how to enable logging on the executors. The Spark UI appears
>> stuck and i only see driver side logs in the jobhistory directory specified
>> in the job.
>>
>>
>> Thanks,
>> pala
>>
>>
>>
>


Re: Lost executors

2014-11-18 Thread Sandy Ryza
Hi Pala,

Do you have access to your YARN NodeManager logs?  Are you able to check
whether they report killing any containers for exceeding memory limits?

-Sandy

On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia  wrote:

> Hi,
>
> I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
> shell.
>
> I am running a job that essentially reads a bunch of HBase keys, looks up
> HBase data, and performs some filtering and aggregation. The job works fine
> in smaller datasets, but when i try to execute on the full dataset, the job
> never completes. The few symptoms i notice are:
>
> a. The job shows progress for a while and then starts throwing lots of the
> following errors:
>
> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
>  org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
> 906 disconnected, so removing it*
> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
> org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
> executor 906 on : remote Akka client disassociated*
>
> 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
>  org.apache.spark.storage.BlockManagerMasterActor - *Removing
> BlockManager BlockManagerId(9186, , 54600, 0) with no recent
> heart beats: 82313ms exceeds 45000ms*
>
> Looking at the logs, the job never recovers from these errors, and
> continues to show errors about lost executors and launching new executors,
> and this just continues for a long time.
>
> Could this be because the executors are running out of memory?
>
> In terms of memory usage, the intermediate data could be large (after the
> HBase lookup), but partial and fully aggregated data set size should be
> quite small - essentially a bunch of ids and counts (< 1 mil in total).
>
>
>
> b. In the Spark UI, i am seeing the following errors (redacted for
> brevity), not sure if they are transient or real issue:
>
> java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed 
> out}
> ...
> org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> ...
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:724)
>
>
>
>
> I was trying to get more data to investigate but haven't been able to
> figure out how to enable logging on the executors. The Spark UI appears
> stuck and i only see driver side logs in the jobhistory directory specified
> in the job.
>
>
> Thanks,
> pala
>
>
>


Lost executors

2014-11-18 Thread Pala M Muthaia
Hi,

I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
shell.

I am running a job that essentially reads a bunch of HBase keys, looks up
HBase data, and performs some filtering and aggregation. The job works fine
in smaller datasets, but when i try to execute on the full dataset, the job
never completes. The few symptoms i notice are:

a. The job shows progress for a while and then starts throwing lots of the
following errors:

2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
906 disconnected, so removing it*
2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
executor 906 on : remote Akka client disassociated*

2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
 org.apache.spark.storage.BlockManagerMasterActor - *Removing BlockManager
BlockManagerId(9186, , 54600, 0) with no recent heart beats:
82313ms exceeds 45000ms*

Looking at the logs, the job never recovers from these errors, and
continues to show errors about lost executors and launching new executors,
and this just continues for a long time.

Could this be because the executors are running out of memory?

In terms of memory usage, the intermediate data could be large (after the
HBase lookup), but partial and fully aggregated data set size should be
quite small - essentially a bunch of ids and counts (< 1 mil in total).



b. In the Spark UI, i am seeing the following errors (redacted for
brevity), not sure if they are transient or real issue:

java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read
timed out}
...
org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
...
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:724)




I was trying to get more data to investigate but haven't been able to
figure out how to enable logging on the executors. The Spark UI appears
stuck and i only see driver side logs in the jobhistory directory specified
in the job.


Thanks,
pala


Re: Lost executors

2014-08-13 Thread Andrew Or
Hi Ravi,

Setting SPARK_MEMORY doesn't do anything. I believe you confused it with
SPARK_MEM, which is now deprecated. You should set SPARK_EXECUTOR_MEMORY
instead, or "spark.executor.memory" as a config in
conf/spark-defaults.conf. Assuming you haven't set the executor memory
through a different mechanism, your executors will quickly run out of
memory with the default of 512m.

Let me know if setting this does the job. If so, you can even persist the
RDDs to memory as well to get better performance, though this depends on
your workload.

-Andrew


2014-08-13 11:38 GMT-07:00 rpandya :

> I'm running Spark 1.0.1 with SPARK_MEMORY=60g, so 4 executors at that size
> would indeed run out of memory (the machine has 110GB). And in fact they
> would get repeatedly restarted and killed until eventually Spark gave up.
>
> I'll try with a smaller limit, but it'll be a while - somehow my HDFS got
> seriously corrupted so I need to rebuild my HDP cluster...
>
> Thanks,
>
> Ravi
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12050.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Lost executors

2014-08-13 Thread rpandya
I'm running Spark 1.0.1 with SPARK_MEMORY=60g, so 4 executors at that size
would indeed run out of memory (the machine has 110GB). And in fact they
would get repeatedly restarted and killed until eventually Spark gave up.

I'll try with a smaller limit, but it'll be a while - somehow my HDFS got
seriously corrupted so I need to rebuild my HDP cluster...

Thanks,

Ravi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12050.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Lost executors

2014-08-13 Thread Andrew Or
To add to the pile of information we're asking you to provide, what version
of Spark are you running?


2014-08-13 11:11 GMT-07:00 Shivaram Venkataraman :

> If the JVM heap size is close to the memory limit the OS sometimes kills
> the process under memory pressure. I've usually found that lowering the
> executor memory size helps.
>
> Shivaram
>
>
> On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia 
> wrote:
>
>> What is your Spark executor memory set to? (You can see it in Spark's web
>> UI at http://:4040 under the executors tab). One thing to be
>> aware of is that the JVM never really releases memory back to the OS, so it
>> will keep filling up to the maximum heap size you set. Maybe 4 executors
>> with that much heap are taking a lot of the memory.
>>
>> Persist as DISK_ONLY should indeed stream data from disk, so I don't
>> think that will be a problem.
>>
>> Matei
>>
>> On August 13, 2014 at 6:49:11 AM, rpandya (r...@iecommerce.com) wrote:
>>
>> After a lot of grovelling through logs, I found out that the Nagios
>> monitor
>> process detected that the machine was almost out of memory, and killed
>> the
>> SNAP executor process.
>>
>> So why is the machine running out of memory? Each node has 128GB of RAM,
>> 4
>> executors, about 40GB of data. It did run out of memory if I tried to
>> cache() the RDD, but I would hope that persist() is implemented so that
>> it
>> would stream to disk without trying to materialize too much data in RAM.
>>
>> Ravi
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Lost executors

2014-08-13 Thread Shivaram Venkataraman
If the JVM heap size is close to the memory limit the OS sometimes kills
the process under memory pressure. I've usually found that lowering the
executor memory size helps.

Shivaram


On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia 
wrote:

> What is your Spark executor memory set to? (You can see it in Spark's web
> UI at http://:4040 under the executors tab). One thing to be
> aware of is that the JVM never really releases memory back to the OS, so it
> will keep filling up to the maximum heap size you set. Maybe 4 executors
> with that much heap are taking a lot of the memory.
>
> Persist as DISK_ONLY should indeed stream data from disk, so I don't think
> that will be a problem.
>
> Matei
>
> On August 13, 2014 at 6:49:11 AM, rpandya (r...@iecommerce.com) wrote:
>
> After a lot of grovelling through logs, I found out that the Nagios
> monitor
> process detected that the machine was almost out of memory, and killed the
> SNAP executor process.
>
> So why is the machine running out of memory? Each node has 128GB of RAM, 4
> executors, about 40GB of data. It did run out of memory if I tried to
> cache() the RDD, but I would hope that persist() is implemented so that it
> would stream to disk without trying to materialize too much data in RAM.
>
> Ravi
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Lost executors

2014-08-13 Thread Matei Zaharia
What is your Spark executor memory set to? (You can see it in Spark's web UI at 
http://:4040 under the executors tab). One thing to be aware of is that 
the JVM never really releases memory back to the OS, so it will keep filling up 
to the maximum heap size you set. Maybe 4 executors with that much heap are 
taking a lot of the memory.

Persist as DISK_ONLY should indeed stream data from disk, so I don't think that 
will be a problem.

Matei

On August 13, 2014 at 6:49:11 AM, rpandya (r...@iecommerce.com) wrote:

After a lot of grovelling through logs, I found out that the Nagios monitor 
process detected that the machine was almost out of memory, and killed the 
SNAP executor process. 

So why is the machine running out of memory? Each node has 128GB of RAM, 4 
executors, about 40GB of data. It did run out of memory if I tried to 
cache() the RDD, but I would hope that persist() is implemented so that it 
would stream to disk without trying to materialize too much data in RAM. 

Ravi 



-- 
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html
 
Sent from the Apache Spark User List mailing list archive at Nabble.com. 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 



Re: Lost executors

2014-08-13 Thread rpandya
After a lot of grovelling through logs, I found out that the Nagios monitor
process detected that the machine was almost out of memory, and killed the
SNAP executor process.

So why is the machine running out of memory? Each node has 128GB of RAM, 4
executors, about 40GB of data. It did run out of memory if I tried to
cache() the RDD, but I would hope that persist() is implemented so that it
would stream to disk without trying to materialize too much data in RAM.

Ravi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Lost executors

2014-08-08 Thread rpandya
Hi Avishek,

I'm running on a manual cluster setup, and all the code is Scala. The load
averages don't seem high when I see these failures (about 12 on a 16-core
machine).

Ravi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p11819.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Lost executors

2014-08-08 Thread Avishek Saha
Same here Ravi. See my post on a similar thread.

Are you running on YARN client?
On Aug 7, 2014 2:56 PM, "rpandya"  wrote:

> I'm running into a problem with executors failing, and it's not clear
> what's
> causing it. Any suggestions on how to diagnose & fix it would be
> appreciated.
>
> There are a variety of errors in the logs, and I don't see a consistent
> triggering error. I've tried varying the number of executors per machine
> (1/4/16 per 16-core/128GB machine w/200GB free disk) and it still fails.
>
> The relevant code is:
> val reads = fastqAsText.mapPartitionsWithIndex(runner.mapReads(_, _,
> seqDictBcast.value))
> val result = reads.coalesce(numMachines * coresPerMachine * 4,
> true).persist(StorageLevel.DISK_ONLY_2)
> log.info("SNAP output DebugString:\n" + result.toDebugString)
> log.info("produced " + result.count + " reads")
>
> The toDebugString output is:
> 2014-08-07 18:50:43 INFO  SnapInputStage:198 - SNAP output DebugString:
> MappedRDD[10] at coalesce at SnapInputStage.scala:197 (640 partitions)
>   CoalescedRDD[9] at coalesce at SnapInputStage.scala:197 (640 partitions)
> ShuffledRDD[8] at coalesce at SnapInputStage.scala:197 (640 partitions)
>   MapPartitionsRDD[7] at coalesce at SnapInputStage.scala:197 (10
> partitions)
> MapPartitionsRDD[6] at mapPartitionsWithIndex at
> SnapInputStage.scala:195 (10 partitions)
>   MappedRDD[4] at map at SnapInputStage.scala:188 (10 partitions)
> CoalescedRDD[3] at coalesce at SnapInputStage.scala:188 (10
> partitions)
>   NewHadoopRDD[2] at newAPIHadoopFile at
> SnapInputStage.scala:182 (3003 partitions)
>
> The 10-partition stage works fine, takes about 1.4 hours, reads 40GB and
> writes 25GB per task. The next 640-partition stage is where the failures
> occur.
>
> Here are the first few errors from a recent run (sorted by time):
> work/hpcraviplvm10/app-20140807185713-/14/stderr:   14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvm10/app-20140807185713-/27/stderr:   14/08/07 20:32:18
> ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get
> block(s)
> from ConnectionManagerId(hpcraviplvm1,49545)
> work/hpcraviplvm1/app-20140807185713-/9/stderr: 14/08/07 20:32:18
>   ERROR
> ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvm2/app-20140807185713-/24/stderr:14/08/07 20:32:18
>   ERROR
> ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvm2/app-20140807185713-/36/stderr:14/08/07 20:32:18
>   ERROR
> SendingConnection: Exception while reading SendingConnection to
> ConnectionManagerId(hpcraviplvm1,49545)
> work/hpcraviplvma1/app-20140807185713-/26/stderr:   14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvma2/app-20140807185713-/15/stderr:   14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvma2/app-20140807185713-/18/stderr:   14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvma2/app-20140807185713-/23/stderr:   14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
> work/hpcraviplvma2/app-20140807185713-/33/stderr:   14/08/07 20:32:18
> ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
>
> Thanks,
>
> Ravi Pandya
> Microsoft Research
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Lost executors

2014-08-07 Thread rpandya
I'm running into a problem with executors failing, and it's not clear what's
causing it. Any suggestions on how to diagnose & fix it would be
appreciated.

There are a variety of errors in the logs, and I don't see a consistent
triggering error. I've tried varying the number of executors per machine
(1/4/16 per 16-core/128GB machine w/200GB free disk) and it still fails.

The relevant code is:
val reads = fastqAsText.mapPartitionsWithIndex(runner.mapReads(_, _,
seqDictBcast.value))
val result = reads.coalesce(numMachines * coresPerMachine * 4,
true).persist(StorageLevel.DISK_ONLY_2)
log.info("SNAP output DebugString:\n" + result.toDebugString)
log.info("produced " + result.count + " reads")

The toDebugString output is:
2014-08-07 18:50:43 INFO  SnapInputStage:198 - SNAP output DebugString:
MappedRDD[10] at coalesce at SnapInputStage.scala:197 (640 partitions)
  CoalescedRDD[9] at coalesce at SnapInputStage.scala:197 (640 partitions)
ShuffledRDD[8] at coalesce at SnapInputStage.scala:197 (640 partitions)
  MapPartitionsRDD[7] at coalesce at SnapInputStage.scala:197 (10
partitions)
MapPartitionsRDD[6] at mapPartitionsWithIndex at
SnapInputStage.scala:195 (10 partitions)
  MappedRDD[4] at map at SnapInputStage.scala:188 (10 partitions)
CoalescedRDD[3] at coalesce at SnapInputStage.scala:188 (10
partitions)
  NewHadoopRDD[2] at newAPIHadoopFile at
SnapInputStage.scala:182 (3003 partitions)

The 10-partition stage works fine, takes about 1.4 hours, reads 40GB and
writes 25GB per task. The next 640-partition stage is where the failures
occur.

Here are the first few errors from a recent run (sorted by time):
work/hpcraviplvm10/app-20140807185713-/14/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvm10/app-20140807185713-/27/stderr:   14/08/07 20:32:18
ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s)
from ConnectionManagerId(hpcraviplvm1,49545)
work/hpcraviplvm1/app-20140807185713-/9/stderr: 14/08/07 20:32:18   
ERROR
ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvm2/app-20140807185713-/24/stderr:14/08/07 20:32:18   
ERROR
ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvm2/app-20140807185713-/36/stderr:14/08/07 20:32:18   
ERROR
SendingConnection: Exception while reading SendingConnection to
ConnectionManagerId(hpcraviplvm1,49545)
work/hpcraviplvma1/app-20140807185713-/26/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvma2/app-20140807185713-/15/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvma2/app-20140807185713-/18/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvma2/app-20140807185713-/23/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
work/hpcraviplvma2/app-20140807185713-/33/stderr:   14/08/07 20:32:18
ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found

Thanks,

Ravi Pandya
Microsoft Research



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Lost executors

2014-07-23 Thread Eric Friedman
And... PEBCAK

I mistakenly believed I had set PYSPARK_PYTHON to a python 2.7 install, but
it was on a python 2.6 install on the remote nodes, hence incompatible with
what the master was sending.  Have set this to point to the correct version
everywhere and it works.

Apologies for the false alarm.


On Wed, Jul 23, 2014 at 8:40 PM, Eric Friedman 
wrote:

> hi Andrew,
>
> Thanks for your note.  Yes, I see a stack trace now.  It seems to be an
> issue with python interpreting a function I wish to apply to an RDD.  The
> stack trace is below.  The function is a simple factorial:
>
> def f(n):
>   if n == 1: return 1
>   return n * f(n-1)
>
> and I'm trying to use it like this:
>
> tf = sc.textFile(...)
> tf.map(lambda line: line and len(line)).map(f).collect()
>
> I get the following error, which does not occur if I use a built-in
> function, like math.sqrt
>
>  TypeError: __import__() argument 1 must be string, not X#
>
> stacktrace follows
>
>
>
> WARN TaskSetManager: Loss was due to
> org.apache.spark.api.python.PythonException
>
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
>
>   File
> "/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/worker.py",
> line 77, in main
>
> serializer.dump_stream(func(split_index, iterator), outfile)
>
>   File
> "/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py",
> line 191, in dump_stream
>
> self.serializer.dump_stream(self._batched(iterator), stream)
>
>   File
> "/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py",
> line 123, in dump_stream
>
> for obj in iterator:
>
>   File
> "/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py",
> line 180, in _batched
>
> for item in iterator:
>
>   File "", line 2, in f
>
> TypeError: __import__() argument 1 must be string, not X#
>
>
>
>  at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
>
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:145)
>
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
>
>
>
>
> On Wed, Jul 23, 2014 at 11:07 AM, Andrew Or  wrote:
>
>> Hi Eric,
>>
>> Have you checked the executor logs? It is possible they died because of
>> some exception, and the message you see is just a side effect.
>>
>> Andrew
>>
>>
>> 2014-07-23 8:27 GMT-07:00 Eric Friedman :
>>
>> I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
>>>  Cluster resources are available to me via Yarn and I am seeing these
>>> errors quite often.
>>>
>>> ERROR YarnClientClusterScheduler: Lost executor 63 on : remote
>>> Akka client disassociated
>>>
>>>
>>> This is in an interactive shell session.  I don't know a lot about Yarn
>>> plumbing and am wondering if there's some constraint in play -- executors
>>> can't be idle for too long or they get cleared out.
>>>
>>>
>>> Any insights here?
>>>
>>
>>
>


Re: Lost executors

2014-07-23 Thread Eric Friedman
hi Andrew,

Thanks for your note.  Yes, I see a stack trace now.  It seems to be an
issue with python interpreting a function I wish to apply to an RDD.  The
stack trace is below.  The function is a simple factorial:

def f(n):
  if n == 1: return 1
  return n * f(n-1)

and I'm trying to use it like this:

tf = sc.textFile(...)
tf.map(lambda line: line and len(line)).map(f).collect()

I get the following error, which does not occur if I use a built-in
function, like math.sqrt

 TypeError: __import__() argument 1 must be string, not X#

stacktrace follows



WARN TaskSetManager: Loss was due to
org.apache.spark.api.python.PythonException

org.apache.spark.api.python.PythonException: Traceback (most recent call
last):

  File
"/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/worker.py",
line 77, in main

serializer.dump_stream(func(split_index, iterator), outfile)

  File
"/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py",
line 191, in dump_stream

self.serializer.dump_stream(self._batched(iterator), stream)

  File
"/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py",
line 123, in dump_stream

for obj in iterator:

  File
"/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py",
line 180, in _batched

for item in iterator:

  File "", line 2, in f

TypeError: __import__() argument 1 must be string, not X#



 at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)

at org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:145)

at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)





On Wed, Jul 23, 2014 at 11:07 AM, Andrew Or  wrote:

> Hi Eric,
>
> Have you checked the executor logs? It is possible they died because of
> some exception, and the message you see is just a side effect.
>
> Andrew
>
>
> 2014-07-23 8:27 GMT-07:00 Eric Friedman :
>
> I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
>>  Cluster resources are available to me via Yarn and I am seeing these
>> errors quite often.
>>
>> ERROR YarnClientClusterScheduler: Lost executor 63 on : remote Akka
>> client disassociated
>>
>>
>> This is in an interactive shell session.  I don't know a lot about Yarn
>> plumbing and am wondering if there's some constraint in play -- executors
>> can't be idle for too long or they get cleared out.
>>
>>
>> Any insights here?
>>
>
>


Re: Lost executors

2014-07-23 Thread Andrew Or
Hi Eric,

Have you checked the executor logs? It is possible they died because of
some exception, and the message you see is just a side effect.

Andrew


2014-07-23 8:27 GMT-07:00 Eric Friedman :

> I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
>  Cluster resources are available to me via Yarn and I am seeing these
> errors quite often.
>
> ERROR YarnClientClusterScheduler: Lost executor 63 on : remote Akka
> client disassociated
>
>
> This is in an interactive shell session.  I don't know a lot about Yarn
> plumbing and am wondering if there's some constraint in play -- executors
> can't be idle for too long or they get cleared out.
>
>
> Any insights here?
>


Lost executors

2014-07-23 Thread Eric Friedman
I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
 Cluster resources are available to me via Yarn and I am seeing these
errors quite often.

ERROR YarnClientClusterScheduler: Lost executor 63 on : remote Akka
client disassociated


This is in an interactive shell session.  I don't know a lot about Yarn
plumbing and am wondering if there's some constraint in play -- executors
can't be idle for too long or they get cleared out.


Any insights here?