Re: 1TB shuffle failed with executor lost failure

2016-09-19 Thread Divya Gehlot
The exit code 52 comes from org.apache.spark.util.SparkExitCode, and it is
val OOM=52 - i.e. an OutOfMemoryError
Refer
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/util/SparkExitCode.scala



On 19 September 2016 at 14:57, Cyanny LIANG  wrote:

> My job is 1TB join + 10 GB table on spark1.6.1
> run on yarn mode:
>
> *1. if I open shuffle service, the error is *
> Job aborted due to stage failure: ShuffleMapStage 2 (writeToDirectory at
> NativeMethodAccessorImpl.java:-2) has failed the maximum allowable number
> of times: 4. Most recent failure reason: 
> org.apache.spark.shuffle.FetchFailedException:
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1473819702737_1239,
> execId=52)
> at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.
> getBlockData(ExternalShuffleBlockResolver.java:105)
> at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.
> receive(ExternalShuffleBlockHandler.java:74)
> at org.apache.spark.network.server.TransportRequestHandler.
> processRpcRequest(TransportRequestHandler.java:114)
> at org.apache.spark.network.server.TransportRequestHandler.handle(
> TransportRequestHandler.java:87)
> at org.apache.spark.network.server.TransportChannelHandler.
> channelRead0(TransportChannelHandler.java:101)
>
> *2. if I close shuffle service, *
> *set spark.executor.instances 80*
> the error is :
> ExecutorLostFailure (executor 71 exited caused by one of the running
> tasks) Reason: Container marked as failed: 
> container_1473819702737_1432_01_406847560
> on host: nmg01-spark-a0021.nmg01.baidu.com. Exit status: 52. Diagnostics:
> Exception from container-launch: ExitCodeException exitCode=52:
> ExitCodeException exitCode=52:
>
> These errors are reported on shuffle stage
> My data is skew, some ids have 400million rows, but some ids only have
> 1million rows, is anybody has some ideas to solve the problem?
>
>
> *3. My config is *
> Here is my config
> I use tungsten-sort in off-heap mode, in on-heap mode, the oom problem
> will be more serious
>
> spark.driver.cores 4
>
> spark.driver.memory 8g
>
>
> # use on client mode
>
>
> spark.yarn.am.memory 8g
>
>
> spark.yarn.am.cores 4
>
>
> spark.executor.memory 8g
>
>
> spark.executor.cores 4
>
> spark.yarn.executor.memoryOverhead 6144
>
>
> spark.memory.offHeap.enabled true
>
>
> spark.memory.offHeap.size 40
>
> Best & Regards
> Cyanny LIANG
>


1TB shuffle failed with executor lost failure

2016-09-19 Thread Cyanny LIANG
My job is 1TB join + 10 GB table on spark1.6.1
run on yarn mode:

*1. if I open shuffle service, the error is *
Job aborted due to stage failure: ShuffleMapStage 2 (writeToDirectory at
NativeMethodAccessorImpl.java:-2) has failed the maximum allowable number
of times: 4. Most recent failure reason:
org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException:
Executor is not registered (appId=application_1473819702737_1239, execId=52)
at
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:105)
at
org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74)
at
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:114)
at
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101)

*2. if I close shuffle service, *
*set spark.executor.instances 80*
the error is :
ExecutorLostFailure (executor 71 exited caused by one of the running tasks)
Reason: Container marked as failed:
container_1473819702737_1432_01_406847560 on host:
nmg01-spark-a0021.nmg01.baidu.com. Exit status: 52. Diagnostics: Exception
from container-launch: ExitCodeException exitCode=52:
ExitCodeException exitCode=52:

These errors are reported on shuffle stage
My data is skew, some ids have 400million rows, but some ids only have
1million rows, is anybody has some ideas to solve the problem?


*3. My config is *
Here is my config
I use tungsten-sort in off-heap mode, in on-heap mode, the oom problem will
be more serious

spark.driver.cores 4

spark.driver.memory 8g


# use on client mode


spark.yarn.am.memory 8g


spark.yarn.am.cores 4


spark.executor.memory 8g


spark.executor.cores 4

spark.yarn.executor.memoryOverhead 6144


spark.memory.offHeap.enabled true


spark.memory.offHeap.size 40

Best & Regards
Cyanny LIANG


Re: Executor Lost Failure

2015-09-29 Thread Nithin Asokan
Try increasing memory (--conf spark.executor.memory=3g or
--executor-memory) for executors. Here is something I noted from your logs

15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory
threshold of 1024.0 KB for computing block rdd_2_1813 in memory.
15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache
rdd_2_1813 in memory!
(computed 840.0 B so far)

On Tue, Sep 29, 2015 at 11:02 AM Anup Sawant 
wrote:

> Hi all,
> Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new
> to Spark so I have less knowledge about the internals of it. The job was
> running for a day or so on 102 Gb of data with 40 workers.
> -Best,
> Anup.
>
> 15/09/29 06:32:03 ERROR TaskSchedulerImpl: Lost executor driver on
> localhost: Executor heartbeat timed out after 395987 ms
> 15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory
> threshold of 1024.0 KB for computing block rdd_2_1813 in memory.
> 15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache rdd_2_1813
> in memory! (computed 840.0 B so far)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1782.0 in stage 2713.0
> (TID 9101184, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 ERROR TaskSetManager: Task 1782 in stage 2713.0 failed 1
> times; aborting job
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1791.0 in stage 2713.0
> (TID 9101193, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1800.0 in stage 2713.0
> (TID 9101202, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1764.0 in stage 2713.0
> (TID 9101166, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1773.0 in stage 2713.0
> (TID 9101175, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1809.0 in stage 2713.0
> (TID 9101211, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1794.0 in stage 2713.0
> (TID 9101196, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1740.0 in stage 2713.0
> (TID 9101142, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1803.0 in stage 2713.0
> (TID 9101205, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1812.0 in stage 2713.0
> (TID 9101214, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1785.0 in stage 2713.0
> (TID 9101187, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1767.0 in stage 2713.0
> (TID 9101169, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1776.0 in stage 2713.0
> (TID 9101178, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1797.0 in stage 2713.0
> (TID 9101199, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1779.0 in stage 2713.0
> (TID 9101181, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1806.0 in stage 2713.0
> (TID 9101208, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1788.0 in stage 2713.0
> (TID 9101190, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1761.0 in stage 2713.0
> (TID 9101163, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1755.0 in stage 2713.0
> (TID 9101157, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1796.0 in stage 2713.0
> (TID 9101198, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1778.0 in stage 2713.0
> (TID 9101180, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1787.0 in stage 2713.0
> (TID 9101189, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1805.0 in stage 2713.0
> (TID 9101207, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1790.0 in stage 2713.0
> (TID 9101192, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1781.0 in stage 2713.0
> (TID 9101183, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1808.0 in stage 2713.0
> (TID 9101210, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost 

Re: Executor Lost Failure

2015-09-29 Thread Ted Yu
Can you list the spark-submit command line you used ?

Thanks

On Tue, Sep 29, 2015 at 9:02 AM, Anup Sawant 
wrote:

> Hi all,
> Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new
> to Spark so I have less knowledge about the internals of it. The job was
> running for a day or so on 102 Gb of data with 40 workers.
> -Best,
> Anup.
>
> 15/09/29 06:32:03 ERROR TaskSchedulerImpl: Lost executor driver on
> localhost: Executor heartbeat timed out after 395987 ms
> 15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory
> threshold of 1024.0 KB for computing block rdd_2_1813 in memory.
> 15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache rdd_2_1813
> in memory! (computed 840.0 B so far)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1782.0 in stage 2713.0
> (TID 9101184, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 ERROR TaskSetManager: Task 1782 in stage 2713.0 failed 1
> times; aborting job
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1791.0 in stage 2713.0
> (TID 9101193, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1800.0 in stage 2713.0
> (TID 9101202, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1764.0 in stage 2713.0
> (TID 9101166, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1773.0 in stage 2713.0
> (TID 9101175, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1809.0 in stage 2713.0
> (TID 9101211, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1794.0 in stage 2713.0
> (TID 9101196, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1740.0 in stage 2713.0
> (TID 9101142, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1803.0 in stage 2713.0
> (TID 9101205, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1812.0 in stage 2713.0
> (TID 9101214, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1785.0 in stage 2713.0
> (TID 9101187, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1767.0 in stage 2713.0
> (TID 9101169, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1776.0 in stage 2713.0
> (TID 9101178, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1797.0 in stage 2713.0
> (TID 9101199, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1779.0 in stage 2713.0
> (TID 9101181, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1806.0 in stage 2713.0
> (TID 9101208, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1788.0 in stage 2713.0
> (TID 9101190, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1761.0 in stage 2713.0
> (TID 9101163, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1755.0 in stage 2713.0
> (TID 9101157, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1796.0 in stage 2713.0
> (TID 9101198, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1778.0 in stage 2713.0
> (TID 9101180, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1787.0 in stage 2713.0
> (TID 9101189, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1805.0 in stage 2713.0
> (TID 9101207, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1790.0 in stage 2713.0
> (TID 9101192, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1781.0 in stage 2713.0
> (TID 9101183, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1808.0 in stage 2713.0
> (TID 9101210, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1799.0 in stage 2713.0
> (TID 9101201, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1772.0 in stage 2713.0
> (TID 9101174, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1763.0 in stage 2713.0

Executor Lost Failure

2015-09-29 Thread Anup Sawant
Hi all,
Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new
to Spark so I have less knowledge about the internals of it. The job was
running for a day or so on 102 Gb of data with 40 workers.
-Best,
Anup.

15/09/29 06:32:03 ERROR TaskSchedulerImpl: Lost executor driver on
localhost: Executor heartbeat timed out after 395987 ms
15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory
threshold of 1024.0 KB for computing block rdd_2_1813 in memory.
15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache rdd_2_1813 in
memory! (computed 840.0 B so far)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1782.0 in stage 2713.0
(TID 9101184, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 ERROR TaskSetManager: Task 1782 in stage 2713.0 failed 1
times; aborting job
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1791.0 in stage 2713.0
(TID 9101193, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1800.0 in stage 2713.0
(TID 9101202, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1764.0 in stage 2713.0
(TID 9101166, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1773.0 in stage 2713.0
(TID 9101175, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1809.0 in stage 2713.0
(TID 9101211, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1794.0 in stage 2713.0
(TID 9101196, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1740.0 in stage 2713.0
(TID 9101142, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1803.0 in stage 2713.0
(TID 9101205, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1812.0 in stage 2713.0
(TID 9101214, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1785.0 in stage 2713.0
(TID 9101187, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1767.0 in stage 2713.0
(TID 9101169, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1776.0 in stage 2713.0
(TID 9101178, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1797.0 in stage 2713.0
(TID 9101199, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1779.0 in stage 2713.0
(TID 9101181, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1806.0 in stage 2713.0
(TID 9101208, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1788.0 in stage 2713.0
(TID 9101190, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1761.0 in stage 2713.0
(TID 9101163, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1755.0 in stage 2713.0
(TID 9101157, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1796.0 in stage 2713.0
(TID 9101198, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1778.0 in stage 2713.0
(TID 9101180, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1787.0 in stage 2713.0
(TID 9101189, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1805.0 in stage 2713.0
(TID 9101207, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1790.0 in stage 2713.0
(TID 9101192, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1781.0 in stage 2713.0
(TID 9101183, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1808.0 in stage 2713.0
(TID 9101210, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1799.0 in stage 2713.0
(TID 9101201, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1772.0 in stage 2713.0
(TID 9101174, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1763.0 in stage 2713.0
(TID 9101165, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1802.0 in stage 2713.0
(TID 9101204, localhost): ExecutorLostFailure (executor driver lost)
15/09/29 06:32:03 WARN TaskSetManager: Lost task 1748.0 in stage 2713.0
(TID 

Re: foreachRDD causing executor lost failure

2015-09-09 Thread Akhil Das
If you can look a bit in the executor logs, you would see the exact reason
(mostly a OOM/GC etc). Instead of using foreach, try to use mapPartitions
or foreachPartitions.

Thanks
Best Regards

On Tue, Sep 8, 2015 at 10:45 PM, Priya Ch <learnings.chitt...@gmail.com>
wrote:

> Hello All,
>
>  I am using foreachRDD in my code as -
>
>   dstream.foreachRDD { rdd => rdd.foreach { record => // look up with
> cassandra table
> // save updated rows to cassandra table.
> }
> }
>  This foreachRDD is causing executor lost failure. what is the behavior of
> this foreachRDD ???
>
> Thanks,
> Padma Ch
>


foreachRDD causing executor lost failure

2015-09-08 Thread Priya Ch
Hello All,

 I am using foreachRDD in my code as -

  dstream.foreachRDD { rdd => rdd.foreach { record => // look up with
cassandra table
// save updated rows to cassandra table.
}
}
 This foreachRDD is causing executor lost failure. what is the behavior of
this foreachRDD ???

Thanks,
Padma Ch


Re: Executor lost failure

2015-09-01 Thread Andrew Duffy
If you're using YARN with Spark 1.3.1, you could be running into
https://issues.apache.org/jira/browse/SPARK-8119, although without more
information it's impossible to know.

On Tue, Sep 1, 2015 at 11:28 AM, Priya Ch <learnings.chitt...@gmail.com>
wrote:

> Hi All,
>
>  I have a spark streaming application which writes the processed results
> to cassandra. In local mode, the code seems to work fine. The moment i
> start running in distributed mode using yarn, i see executor lost failure.
> I increased executor memory to occupy entire node's memory which is around
> 12gb/ But still see the same issue.
>
> What could be the possible scenarios for executor lost failure ?
>


Executor lost failure

2015-09-01 Thread Priya Ch
Hi All,

 I have a spark streaming application which writes the processed results to
cassandra. In local mode, the code seems to work fine. The moment i start
running in distributed mode using yarn, i see executor lost failure. I
increased executor memory to occupy entire node's memory which is around
12gb/ But still see the same issue.

What could be the possible scenarios for executor lost failure ?


Re: Fwd: Executor Lost Failure

2014-11-11 Thread Ritesh Kumar Singh
Yes... found the output on web UI of the slave.

Thanks :)

On Tue, Nov 11, 2014 at 2:48 AM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh 
 riteshoneinamill...@gmail.com wrote:
  Tasks are now getting submitted, but many tasks don't happen.
  Like, after opening the spark-shell, I load a text file from disk and try
  printing its contentsas:
 
 sc.textFile(/path/to/file).foreach(println)
 
  It does not give me any output.

 That's because foreach launches tasks on the slaves. When each task tries
 to print its lines, they go to the stdout file on the slave rather than to
 your console at the driver. You should see the file's contents in each of
 the slaves' stdout files in the web UI.

 This only happens when running on a cluster. In local mode, all the tasks
 are running locally and can output to the driver, so foreach(println) is
 more useful.

 Ankur



Re: Executor Lost Failure

2014-11-10 Thread Ritesh Kumar Singh
On Mon, Nov 10, 2014 at 10:52 PM, Ritesh Kumar Singh 
riteshoneinamill...@gmail.com wrote:

 Tasks are now getting submitted, but many tasks don't happen.
 Like, after opening the spark-shell, I load a text file from disk and try
 printing its contentsas:

 sc.textFile(/path/to/file).foreach(println)

 It does not give me any output. While running this:

 sc.textFile(/path/to/file).count

 gives me the right number of lines in the text file.
 Not sure what the error is. But here is the output on the console for
 print case:

 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(215230) called with
 curMem=709528, maxMem=463837593
 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6 stored as values in
 memory (estimated size 210.2 KB, free 441.5 MB)
 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(17239) called with
 curMem=924758, maxMem=463837593
 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6_piece0 stored as
 bytes in memory (estimated size 16.8 KB, free 441.5 MB)
 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in
 memory on gonephishing.local:42648 (size: 16.8 KB, free: 442.3 MB)
 14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
 broadcast_6_piece0
 14/11/10 22:48:02 INFO FileInputFormat: Total input paths to process : 1
 14/11/10 22:48:02 INFO SparkContext: Starting job: foreach at console:13
 14/11/10 22:48:02 INFO DAGScheduler: Got job 3 (foreach at console:13)
 with 2 output partitions (allowLocal=false)
 14/11/10 22:48:02 INFO DAGScheduler: Final stage: Stage 3(foreach at
 console:13)
 14/11/10 22:48:02 INFO DAGScheduler: Parents of final stage: List()
 14/11/10 22:48:02 INFO DAGScheduler: Missing parents: List()
 14/11/10 22:48:02 INFO DAGScheduler: Submitting Stage 3 (Desktop/mnd.txt
 MappedRDD[7] at textFile at console:13), which has no missing parents
 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(2504) called with
 curMem=941997, maxMem=463837593
 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7 stored as values in
 memory (estimated size 2.4 KB, free 441.4 MB)
 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(1602) called with
 curMem=944501, maxMem=463837593
 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7_piece0 stored as
 bytes in memory (estimated size 1602.0 B, free 441.4 MB)
 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in
 memory on gonephishing.local:42648 (size: 1602.0 B, free: 442.3 MB)
 14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
 broadcast_7_piece0
 14/11/10 22:48:02 INFO DAGScheduler: Submitting 2 missing tasks from Stage
 3 (Desktop/mnd.txt MappedRDD[7] at textFile at console:13)
 14/11/10 22:48:02 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
 14/11/10 22:48:02 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
 6, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
 14/11/10 22:48:02 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
 7, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in
 memory on gonephishing.local:48857 (size: 1602.0 B, free: 442.3 MB)
 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in
 memory on gonephishing.local:48857 (size: 16.8 KB, free: 442.3 MB)
 14/11/10 22:48:02 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
 6) in 308 ms on gonephishing.local (1/2)
 14/11/10 22:48:02 INFO DAGScheduler: Stage 3 (foreach at console:13)
 finished in 0.321 s
 14/11/10 22:48:02 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
 7) in 315 ms on gonephishing.local (2/2)
 14/11/10 22:48:02 INFO SparkContext: Job finished: foreach at
 console:13, took 0.376602079 s
 14/11/10 22:48:02 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks
 have all completed, from pool

 ===



 On Mon, Nov 10, 2014 at 8:01 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 ​Try adding the following configurations also, might work.

  spark.rdd.compress true

   spark.storage.memoryFraction 1
   spark.core.connection.ack.wait.timeout 600
   spark.akka.frameSize 50

 Thanks
 Best Regards

 On Mon, Nov 10, 2014 at 6:51 PM, Ritesh Kumar Singh 
 riteshoneinamill...@gmail.com wrote:

 Hi,

 I am trying to submit my application using spark-submit, using following
 spark-default.conf params:

 spark.master spark://master-ip:7077
 spark.eventLog.enabled   true
 spark.serializer
 org.apache.spark.serializer.KryoSerializer
 spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
 -Dnumbers=one two three

 ===
 But every time I am getting this error:

 14/11/10 18:39:17 ERROR TaskSchedulerImpl: Lost executor 1 on aa.local:
 remote Akka client disassociated
 14/11/10 18:39:17 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID
 1, aa.local): ExecutorLostFailure (executor lost)
 14/11/10 18:39:17 

Fwd: Executor Lost Failure

2014-11-10 Thread Ritesh Kumar Singh
-- Forwarded message --
From: Ritesh Kumar Singh riteshoneinamill...@gmail.com
Date: Mon, Nov 10, 2014 at 10:52 PM
Subject: Re: Executor Lost Failure
To: Akhil Das ak...@sigmoidanalytics.com


Tasks are now getting submitted, but many tasks don't happen.
Like, after opening the spark-shell, I load a text file from disk and try
printing its contentsas:

sc.textFile(/path/to/file).foreach(println)

It does not give me any output. While running this:

sc.textFile(/path/to/file).count

gives me the right number of lines in the text file.
Not sure what the error is. But here is the output on the console for print
case:

14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(215230) called with
curMem=709528, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6 stored as values in
memory (estimated size 210.2 KB, free 441.5 MB)
14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(17239) called with
curMem=924758, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6_piece0 stored as
bytes in memory (estimated size 16.8 KB, free 441.5 MB)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory
on gonephishing.local:42648 (size: 16.8 KB, free: 442.3 MB)
14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
broadcast_6_piece0
14/11/10 22:48:02 INFO FileInputFormat: Total input paths to process : 1
14/11/10 22:48:02 INFO SparkContext: Starting job: foreach at console:13
14/11/10 22:48:02 INFO DAGScheduler: Got job 3 (foreach at console:13)
with 2 output partitions (allowLocal=false)
14/11/10 22:48:02 INFO DAGScheduler: Final stage: Stage 3(foreach at
console:13)
14/11/10 22:48:02 INFO DAGScheduler: Parents of final stage: List()
14/11/10 22:48:02 INFO DAGScheduler: Missing parents: List()
14/11/10 22:48:02 INFO DAGScheduler: Submitting Stage 3 (Desktop/mnd.txt
MappedRDD[7] at textFile at console:13), which has no missing parents
14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(2504) called with
curMem=941997, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7 stored as values in
memory (estimated size 2.4 KB, free 441.4 MB)
14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(1602) called with
curMem=944501, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7_piece0 stored as
bytes in memory (estimated size 1602.0 B, free 441.4 MB)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory
on gonephishing.local:42648 (size: 1602.0 B, free: 442.3 MB)
14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
broadcast_7_piece0
14/11/10 22:48:02 INFO DAGScheduler: Submitting 2 missing tasks from Stage
3 (Desktop/mnd.txt MappedRDD[7] at textFile at console:13)
14/11/10 22:48:02 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
14/11/10 22:48:02 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
6, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
14/11/10 22:48:02 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
7, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory
on gonephishing.local:48857 (size: 1602.0 B, free: 442.3 MB)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory
on gonephishing.local:48857 (size: 16.8 KB, free: 442.3 MB)
14/11/10 22:48:02 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
6) in 308 ms on gonephishing.local (1/2)
14/11/10 22:48:02 INFO DAGScheduler: Stage 3 (foreach at console:13)
finished in 0.321 s
14/11/10 22:48:02 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
7) in 315 ms on gonephishing.local (2/2)
14/11/10 22:48:02 INFO SparkContext: Job finished: foreach at console:13,
took 0.376602079 s
14/11/10 22:48:02 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks
have all completed, from pool

===



On Mon, Nov 10, 2014 at 8:01 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 ​Try adding the following configurations also, might work.

  spark.rdd.compress true

   spark.storage.memoryFraction 1
   spark.core.connection.ack.wait.timeout 600
   spark.akka.frameSize 50

 Thanks
 Best Regards

 On Mon, Nov 10, 2014 at 6:51 PM, Ritesh Kumar Singh 
 riteshoneinamill...@gmail.com wrote:

 Hi,

 I am trying to submit my application using spark-submit, using following
 spark-default.conf params:

 spark.master spark://master-ip:7077
 spark.eventLog.enabled   true
 spark.serializer
 org.apache.spark.serializer.KryoSerializer
 spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
 -Dnumbers=one two three

 ===
 But every time I am getting this error:

 14/11/10 18:39:17 ERROR TaskSchedulerImpl: Lost executor 1 on aa.local:
 remote Akka client disassociated
 14/11/10 18:39:17 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1,
 aa.local

Re: Fwd: Executor Lost Failure

2014-11-10 Thread Ankur Dave
At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh 
riteshoneinamill...@gmail.com wrote:
 Tasks are now getting submitted, but many tasks don't happen.
 Like, after opening the spark-shell, I load a text file from disk and try
 printing its contentsas:

sc.textFile(/path/to/file).foreach(println)

 It does not give me any output.

That's because foreach launches tasks on the slaves. When each task tries to 
print its lines, they go to the stdout file on the slave rather than to your 
console at the driver. You should see the file's contents in each of the 
slaves' stdout files in the web UI.

This only happens when running on a cluster. In local mode, all the tasks are 
running locally and can output to the driver, so foreach(println) is more 
useful.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org