Setting Executor memory

2015-09-14 Thread Thomas Gerber
Hello, I was looking for guidelines on what value to set executor memory to (via spark.executor.memory for example). This seems to be important to avoid OOM during tasks, especially in no swap environments (like AWS EMR clusters). This setting is really about the executor JVM heap. Hence, in

Cores per executors

2015-09-09 Thread Thomas Gerber
Hello, I was wondering how Spark was enforcing to use *only* X number of cores per executor. Is it simply running max Y tasks in parallel on each executor where X = Y * spark.task.cpus? (This is what I understood from browsing TaskSchedulerImpl). Which would mean the processing power used

Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Hello, It is my understanding that shuffle are written on disk and that they act as checkpoints. I wonder if this is true only within a job, or across jobs. Please note that I use the words job and stage carefully here. 1. can a shuffle created during JobN be used to skip many stages from

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Ah, for #3, maybe this is what *rdd.checkpoint *does! https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD Thomas On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, It is my understanding that shuffle are written on disk

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
stages in the job UI. They are periodically cleaned up based on available space of the configured spark.local.dirs paths. From: Thomas Gerber Date: Monday, June 29, 2015 at 10:12 PM To: user Subject: Shuffle files lifecycle Hello, It is my understanding that shuffle are written

Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-29 Thread Thomas Gerber
of this RDD Which means the when a job uses that RDD, the DAG stops at that RDD and does not looks at its parents as it doesn't have them anymore. It is very similar to saving your RDD and re-loading it as a fresh RDD. On Fri, Jun 26, 2015 at 9:14 AM, Thomas Gerber thomas.ger...@radius.com wrote

Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-26 Thread Thomas Gerber
Note that this problem is probably NOT caused directly by GraphX, but GraphX reveals it because as you go further down the iterations, you get further and further away of a shuffle you can rely on. On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, We run

Re: Error communicating with MapOutputTracker

2015-05-15 Thread Thomas Gerber
On Wed, Mar 4, 2015 at 12:30 PM, Thomas Gerber thomas.ger...@radius.com wrote: I meant spark.default.parallelism of course. On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber thomas.ger...@radius.com wrote: Follow up: We re-retried, this time after *decreasing* spark.parallelism

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
the total amount of reserved memory (not necessarily resident memory) exceeds the memory of the system it throws an OOM. I'm looking for material to back this up. Sorry for the initial vague response. Matthew On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber thomas.ger...@radius.com wrote

java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
Hello, I am seeing various crashes in spark on large jobs which all share a similar exception: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) I increased nproc (i.e. ulimit -u) 10 fold, but it

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
Additional notes: I did not find anything wrong with the number of threads (ps -u USER -L | wc -l): around 780 on the master and 400 on executors. I am running on 100 r3.2xlarge. On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, I am seeing various crashes

Re: Driver disassociated

2015-03-05 Thread Thomas Gerber
, 1000) Cheers On Wed, Mar 4, 2015 at 4:09 PM, Thomas Gerber thomas.ger...@radius.com wrote: Also, I was experiencing another problem which might be related: Error communicating with MapOutputTracker (see email in the ML today). I just thought I would mention it in case it is relevant

Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
Hello, We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers). We use spark-submit to start an application. We got the following error which leads to a failed stage: Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4 times, most recent failure: Lost task

Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
on the number of tasks it can track? On Wed, Mar 4, 2015 at 8:15 AM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers). We use spark-submit to start an application. We got the following error which leads to a failed

Spark logs in standalone clusters

2015-03-04 Thread Thomas Gerber
Hello, I was wondering where all the logs files were located on a standalone cluster: 1. the executor logs are in the work directory on each slave machine (stdout/stderr) - I've notice that GC information is in stdout, and stage information in stderr - *Could we get more

Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
I meant spark.default.parallelism of course. On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber thomas.ger...@radius.com wrote: Follow up: We re-retried, this time after *decreasing* spark.parallelism. It was set to 16000 before, (5 times the number of cores in our cluster). It is now down

Driver disassociated

2015-03-04 Thread Thomas Gerber
Hello, sometimes, in the *middle* of a job, the job stops (status is then seen as FINISHED in the master). There isn't anything wrong in the shell/submit output. When looking at the executor logs, I see logs like this: 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker

Re: Driver disassociated

2015-03-04 Thread Thomas Gerber
Also, I was experiencing another problem which might be related: Error communicating with MapOutputTracker (see email in the ML today). I just thought I would mention it in case it is relevant. On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber thomas.ger...@radius.com wrote: 1.2.1 Also, I

Re: Driver disassociated

2015-03-04 Thread Thomas Gerber
. Thanks, Thomas On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu yuzhih...@gmail.com wrote: What release are you using ? SPARK-3923 went into 1.2.0 release. Cheers On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, sometimes, in the *middle* of a job, the job

Re: Executors dropping all memory stored RDDs?

2015-02-24 Thread Thomas Gerber
of disk. So, in case someone else notices a behavior like this, make sure you check your cluster monitor (like ganglia). On Wed, Jan 28, 2015 at 5:40 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, I am storing RDDs with the MEMORY_ONLY_SER Storage Level, during the run of a big job

Shuffle Spill

2015-02-20 Thread Thomas Gerber
Hello, I have a few tasks in a stage with lots of tasks that have a large amount of shuffle spill. I scouted the web to understand shuffle spill, and I did not find any simple explanation of the spill mechanism. What I put together is: 1. the shuffle spill can happens when the shuffle is