RE: Intermedate stage will be cached automatically ?

2015-06-17 Thread Mark Tse
I think 
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence 
might shed some light on the behaviour you’re seeing.

Mark

From: canan chen [mailto:ccn...@gmail.com]
Sent: June-17-15 5:57 AM
To: spark users
Subject: Intermedate stage will be cached automatically ?

Here's one simple spark example that I call RDD#count 2 times. The first time 
it would invoke 2 stages, but the second one only need 1 stage. Seems the first 
stage is cached. Is that true ? Any flag can I control whether the cache the 
intermediate stage

val data = sc.parallelize(1 to 10, 2).map(e=(e%2,2)).reduceByKey(_ + _, 2)
println(data.count())
println(data.count())


Intermedate stage will be cached automatically ?

2015-06-17 Thread canan chen
Here's one simple spark example that I call RDD#count 2 times. The first
time it would invoke 2 stages, but the second one only need 1 stage. Seems
the first stage is cached. Is that true ? Any flag can I control whether
the cache the intermediate stage


val data = sc.parallelize(1 to 10, 2).map(e=(e%2,2)).reduceByKey(_ + _, 2)
println(data.count())
println(data.count())


Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread Eugen Cepoi
Cache is more general. ReduceByKey involves a shuffle step where the data
will be in memory and on disk (for what doesn't hold in memory). The
shuffle files will remain around until the end of the job. The blocks from
memory will be dropped if memory is needed for other things. This is an
optimisation so other rdds that depend on the result of this shuffle don't
have to go through all the chain. They just fetch the shuffle blocks from
memory/disk.

Calling cache in this example gives near the same result (I guess there are
some impl. specific differences). But if there wasn't a shuffle step then
cache would explicitly persist this dataset, however not on disk except if
you say it to.

Eugen

2015-06-17 15:10 GMT+02:00 canan chen ccn...@gmail.com:

 Yes, actually on the storage ui, there's no data cached. But the behavior
 confuse me. If I call the cache method as following the behavior is the
 same as without calling cache method, why's that ?


 val data = sc.parallelize(1 to 10, 2).map(e=(e%2,2)).reduceByKey(_ + _, 2)
 data.cache()
 println(data.count())
 println(data.count())



 On Wed, Jun 17, 2015 at 8:45 PM, ayan guha guha.a...@gmail.com wrote:

 Its not cached per se. For example, you will not see this in Storage tab
 in UI. However, spark has read the data and its in memory right now. So,
 the next count call should be very fast.


 Best
 Ayan

 On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse mark@d2l.com wrote:

  I think
 https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
 might shed some light on the behaviour you’re seeing.



 Mark



 *From:* canan chen [mailto:ccn...@gmail.com]
 *Sent:* June-17-15 5:57 AM
 *To:* spark users
 *Subject:* Intermedate stage will be cached automatically ?



 Here's one simple spark example that I call RDD#count 2 times. The first
 time it would invoke 2 stages, but the second one only need 1 stage. Seems
 the first stage is cached. Is that true ? Any flag can I control whether
 the cache the intermediate stage


 *val *data = sc.parallelize(1 to 10, 2).map(e=(e%2,2)).reduceByKey(_ + 
 _, 2)
 *println*(data.count())
 *println*(data.count())




 --
 Best Regards,
 Ayan Guha





Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread canan chen
Yes, actually on the storage ui, there's no data cached. But the behavior
confuse me. If I call the cache method as following the behavior is the
same as without calling cache method, why's that ?


val data = sc.parallelize(1 to 10, 2).map(e=(e%2,2)).reduceByKey(_ + _, 2)
data.cache()
println(data.count())
println(data.count())



On Wed, Jun 17, 2015 at 8:45 PM, ayan guha guha.a...@gmail.com wrote:

 Its not cached per se. For example, you will not see this in Storage tab
 in UI. However, spark has read the data and its in memory right now. So,
 the next count call should be very fast.


 Best
 Ayan

 On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse mark@d2l.com wrote:

  I think
 https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
 might shed some light on the behaviour you’re seeing.



 Mark



 *From:* canan chen [mailto:ccn...@gmail.com]
 *Sent:* June-17-15 5:57 AM
 *To:* spark users
 *Subject:* Intermedate stage will be cached automatically ?



 Here's one simple spark example that I call RDD#count 2 times. The first
 time it would invoke 2 stages, but the second one only need 1 stage. Seems
 the first stage is cached. Is that true ? Any flag can I control whether
 the cache the intermediate stage


 *val *data = sc.parallelize(1 to 10, 2).map(e=(e%2,2)).reduceByKey(_ + 
 _, 2)
 *println*(data.count())
 *println*(data.count())




 --
 Best Regards,
 Ayan Guha



Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread ayan guha
Its not cached per se. For example, you will not see this in Storage tab in
UI. However, spark has read the data and its in memory right now. So, the
next count call should be very fast.


Best
Ayan

On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse mark@d2l.com wrote:

  I think
 https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
 might shed some light on the behaviour you’re seeing.



 Mark



 *From:* canan chen [mailto:ccn...@gmail.com]
 *Sent:* June-17-15 5:57 AM
 *To:* spark users
 *Subject:* Intermedate stage will be cached automatically ?



 Here's one simple spark example that I call RDD#count 2 times. The first
 time it would invoke 2 stages, but the second one only need 1 stage. Seems
 the first stage is cached. Is that true ? Any flag can I control whether
 the cache the intermediate stage


 *val *data = sc.parallelize(1 to 10, 2).map(e=(e%2,2)).reduceByKey(_ + 
 _, 2)
 *println*(data.count())
 *println*(data.count())




-- 
Best Regards,
Ayan Guha