Hi,
wanted to get some advice regarding tunning spark application
I see for some of the tasks many log entries like this
Executor task launch worker-38 ExternalAppendOnlyMap: Thread 239 spilling
in-memory map of 5.1 MB to disk (272 times so far)
(especially when inputs are considerable)
I understand that this is connected to shuffle and joins, so that data is
spilled into disk to prevent OOM errors
what is the approach to handle this situation, I mean how can I "fix" this
situation - increase parallelism? add memory to the cluster? what else?
any ideas would be welcome

in general my app reads N key-value files and iteratevely fullOuterJoin-s
them(like folding by fullouter join). each key is user id and value is
aggregated statistics for this user represented by simple object. N files
are N days back. so to compute aggregation for today I can "combine" daily
aggregations.
thanks in advance,
Igor



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spilling-in-memory-map-of-5-1-MB-to-disk-272-times-so-far-tp23509.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to