here are the threads that talk about problems we're experiencing. These 
problems exacerbate when we use cache/persist

https://www.mail-archive.com/user@spark.apache.org/msg64987.html
https://www.mail-archive.com/user@spark.apache.org/msg64986.html

So I am looking for a way to reproduce the same effect as in my sample code 
without the use of cache().

If I use myrdd.count() would that be a good alternative?
thanks

________________________________
From: lucas.g...@gmail.com <lucas.g...@gmail.com>
Sent: Tuesday, August 1, 2017 11:23:04 AM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache

Hi Jeff, that looks sane to me.  Do you have additional details?

On 1 August 2017 at 11:05, jeff saremi 
<jeffsar...@hotmail.com<mailto:jeffsar...@hotmail.com>> wrote:

Calling cache/persist fails all our jobs (i have  posted 2 threads on this).

And we're giving up hope in finding a solution.
So I'd like to find a workaround for that:

If I save an RDD to hdfs and read it back, can I use it in more than one 
operation?

Example: (using cache)
// do a whole bunch of transformations on an RDD

myrdd.cache()

val result1 = myrdd.map(op1(_))

val result2 = myrdd.map(op2(_))

// in the above I am assuming that a call to cache will prevent all previous 
transformation from being calculated twice


I'd like to somehow get result1 and result2 without duplicating work. How can I 
do that?

thanks

Jeff

Reply via email to