Hey Majd,

I believe Shark sets up data to spill to disk, even though the default storage 
level in Spark is memory-only. In terms of those executors, it looks like data 
distribution was unbalanced across them, possibly due to data locality in HDFS 
(some of the executors may have had more data). One thing you can do to prevent 
that is set Spark's data locality delay for disk to 0 
(spark.locality.wait.node=0 and spark.locality.wait.rack=0). It will still 
respect memory locality but not try to optimize disk locality on HDFS.

Matei

On Jan 13, 2014, at 4:24 AM, mharwida <[email protected]> wrote:

> Hi All,
> 
> I'm creating a cached table in memory via Shark using the command:
> create table tablename_cached as select * from tablename;
> 
> Monitoring this via the Spark UI, I have noticed that data is being written
> to disk when there's clearly enough available memory on 2 of the worker
> nodes. Please refer to attached image. Cass4 and Cass3 have 3GB of available
> memory yet the data is being written to disk on the worker nodes which have
> used all their memory.
> 
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n502/1.jpg> 
> 
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n502/2.jpg> 
> 
> Could anyone shed a light on this please?
> 
> Thanks
> Majd
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-writing-to-disk-when-there-s-enough-memory-tp502.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to