Hello,
I was looking for guidelines on what value to set executor memory to
(via spark.executor.memory for example).
This seems to be important to avoid OOM during tasks, especially in no swap
environments (like AWS EMR clusters).
This setting is really about the executor JVM heap. Hence, in
Hello,
I was wondering how Spark was enforcing to use *only* X number of cores per
executor.
Is it simply running max Y tasks in parallel on each executor where X = Y
* spark.task.cpus? (This is what I understood from browsing
TaskSchedulerImpl).
Which would mean the processing power used
Hello,
It is my understanding that shuffle are written on disk and that they act
as checkpoints.
I wonder if this is true only within a job, or across jobs. Please note
that I use the words job and stage carefully here.
1. can a shuffle created during JobN be used to skip many stages from
Ah, for #3, maybe this is what *rdd.checkpoint *does!
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
Thomas
On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
Hello,
It is my understanding that shuffle are written on disk
stages in the job UI. They are
periodically cleaned up based on available space of the configured
spark.local.dirs paths.
From: Thomas Gerber
Date: Monday, June 29, 2015 at 10:12 PM
To: user
Subject: Shuffle files lifecycle
Hello,
It is my understanding that shuffle are written
of this RDD
Which means the when a job uses that RDD, the DAG stops at that RDD and
does not looks at its parents as it doesn't have them anymore. It is very
similar to saving your RDD and re-loading it as a fresh RDD.
On Fri, Jun 26, 2015 at 9:14 AM, Thomas Gerber thomas.ger...@radius.com
wrote
Note that this problem is probably NOT caused directly by GraphX, but
GraphX reveals it because as you go further down the iterations, you get
further and further away of a shuffle you can rely on.
On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
Hello,
We run
On Wed, Mar 4, 2015 at 12:30 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
I meant spark.default.parallelism of course.
On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber thomas.ger...@radius.com
wrote:
Follow up:
We re-retried, this time after *decreasing* spark.parallelism
the total amount of reserved memory
(not necessarily resident memory) exceeds the memory of the system it
throws an OOM. I'm looking for material to back this up. Sorry for the
initial vague response.
Matthew
On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber thomas.ger...@radius.com
wrote
Hello,
I am seeing various crashes in spark on large jobs which all share a
similar exception:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
I increased nproc (i.e. ulimit -u) 10 fold, but it
Additional notes:
I did not find anything wrong with the number of threads (ps -u USER -L |
wc -l): around 780 on the master and 400 on executors. I am running on 100
r3.2xlarge.
On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
Hello,
I am seeing various crashes
, 1000)
Cheers
On Wed, Mar 4, 2015 at 4:09 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
Also,
I was experiencing another problem which might be related:
Error communicating with MapOutputTracker (see email in the ML today).
I just thought I would mention it in case it is relevant
Hello,
We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers).
We use spark-submit to start an application.
We got the following error which leads to a failed stage:
Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4
times, most recent failure: Lost task
on the number of tasks it can track?
On Wed, Mar 4, 2015 at 8:15 AM, Thomas Gerber thomas.ger...@radius.com
wrote:
Hello,
We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers).
We use spark-submit to start an application.
We got the following error which leads to a failed
Hello,
I was wondering where all the logs files were located on a standalone
cluster:
1. the executor logs are in the work directory on each slave machine
(stdout/stderr)
- I've notice that GC information is in stdout, and stage information
in stderr
- *Could we get more
I meant spark.default.parallelism of course.
On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber thomas.ger...@radius.com
wrote:
Follow up:
We re-retried, this time after *decreasing* spark.parallelism. It was set
to 16000 before, (5 times the number of cores in our cluster). It is now
down
Hello,
sometimes, in the *middle* of a job, the job stops (status is then seen as
FINISHED in the master).
There isn't anything wrong in the shell/submit output.
When looking at the executor logs, I see logs like this:
15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker
Also,
I was experiencing another problem which might be related:
Error communicating with MapOutputTracker (see email in the ML today).
I just thought I would mention it in case it is relevant.
On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
1.2.1
Also, I
.
Thanks,
Thomas
On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu yuzhih...@gmail.com wrote:
What release are you using ?
SPARK-3923 went into 1.2.0 release.
Cheers
On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
Hello,
sometimes, in the *middle* of a job, the job
of
disk.
So, in case someone else notices a behavior like this, make sure you check
your cluster monitor (like ganglia).
On Wed, Jan 28, 2015 at 5:40 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
Hello,
I am storing RDDs with the MEMORY_ONLY_SER Storage Level, during the run
of a big job
Hello,
I have a few tasks in a stage with lots of tasks that have a large amount
of shuffle spill.
I scouted the web to understand shuffle spill, and I did not find any
simple explanation of the spill mechanism. What I put together is:
1. the shuffle spill can happens when the shuffle is
21 matches
Mail list logo