change default storage level

2015-07-09 Thread Michal Čizmazia
Is there a way how to change the default storage level?

If not, how can I properly change the storage level wherever necessary, if
my input and intermediate results do not fit into memory?

In this example:

context.wholeTextFiles(...)
.flatMap(s - ...)
.flatMap(s - ...)

Does persist() need to be called after every transformation?

 context.wholeTextFiles(...)
.persist(StorageLevel.MEMORY_AND_DISK)
.flatMap(s - ...)
.persist(StorageLevel.MEMORY_AND_DISK)
.flatMap(s - ...)
.persist(StorageLevel.MEMORY_AND_DISK)

 Thanks!


Re: change default storage level

2015-07-09 Thread Shixiong Zhu
Spark won't store RDDs to memory unless you use a memory StorageLevel. By
default, your input and intermediate results won't be put into memory. You
can call persist if you want to avoid duplicate computation or reading.
E.g.,

val r1 = context.wholeTextFiles(...)
val r2 = r1.flatMap(s - ...)
val r3 = r2.filter(...)...
r3.saveAsTextFile(...)
val r4 = r2.map(...)...
r4.saveAsTextFile(...)

In the avoid example, r2 will be used twice. To speed up the computation,
you can call r2.persist(StorageLevel.MEMORY) to store r2 into memory. Then
r4 will use the data of r2 in memory directly. E.g.,

val r1 = context.wholeTextFiles(...)
val r2 = r1.flatMap(s - ...)
r2.persist(StorageLevel.MEMORY)
val r3 = r2.filter(...)...
r3.saveAsTextFile(...)
val r4 = r2.map(...)...
r4.saveAsTextFile(...)

See
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence


Best Regards,
Shixiong Zhu

2015-07-09 22:09 GMT+08:00 Michal Čizmazia mici...@gmail.com:

 Is there a way how to change the default storage level?

 If not, how can I properly change the storage level wherever necessary, if
 my input and intermediate results do not fit into memory?

 In this example:

 context.wholeTextFiles(...)
 .flatMap(s - ...)
 .flatMap(s - ...)

 Does persist() need to be called after every transformation?

  context.wholeTextFiles(...)
 .persist(StorageLevel.MEMORY_AND_DISK)
 .flatMap(s - ...)
 .persist(StorageLevel.MEMORY_AND_DISK)
 .flatMap(s - ...)
 .persist(StorageLevel.MEMORY_AND_DISK)

  Thanks!




Re: change default storage level

2015-07-09 Thread Michal Čizmazia
Thanks Shixiong! Your response helped me to understand the role of
persist(). No persist() calls were required indeed. I solved my problem by
setting spark.local.dir to allow more space for Spark temporary folder. It
works automatically. I am seeing logs like this:

Not enough space to cache rdd_0_1 in memory!
Persisting partition rdd_0_1 to disk instead.

Before I was getting:

No space left on device


On 9 July 2015 at 11:57, Shixiong Zhu zsxw...@gmail.com wrote:

 Spark won't store RDDs to memory unless you use a memory StorageLevel. By
 default, your input and intermediate results won't be put into memory. You
 can call persist if you want to avoid duplicate computation or reading.
 E.g.,

 val r1 = context.wholeTextFiles(...)
 val r2 = r1.flatMap(s - ...)
 val r3 = r2.filter(...)...
 r3.saveAsTextFile(...)
 val r4 = r2.map(...)...
 r4.saveAsTextFile(...)

 In the avoid example, r2 will be used twice. To speed up the computation,
 you can call r2.persist(StorageLevel.MEMORY) to store r2 into memory. Then
 r4 will use the data of r2 in memory directly. E.g.,

 val r1 = context.wholeTextFiles(...)
 val r2 = r1.flatMap(s - ...)
 r2.persist(StorageLevel.MEMORY)
 val r3 = r2.filter(...)...
 r3.saveAsTextFile(...)
 val r4 = r2.map(...)...
 r4.saveAsTextFile(...)

 See
 http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence


 Best Regards,
 Shixiong Zhu

 2015-07-09 22:09 GMT+08:00 Michal Čizmazia mici...@gmail.com:

 Is there a way how to change the default storage level?

 If not, how can I properly change the storage level wherever necessary,
 if my input and intermediate results do not fit into memory?

 In this example:

 context.wholeTextFiles(...)
 .flatMap(s - ...)
 .flatMap(s - ...)

 Does persist() need to be called after every transformation?

  context.wholeTextFiles(...)
 .persist(StorageLevel.MEMORY_AND_DISK)
 .flatMap(s - ...)
 .persist(StorageLevel.MEMORY_AND_DISK)
 .flatMap(s - ...)
 .persist(StorageLevel.MEMORY_AND_DISK)

  Thanks!