RDD immutablility

2016-01-19 Thread ddav
Hi,

Certain API's (map, mapValues) give the developer access to the data stored
in RDD's. 
Am I correct in saying that these API's must never modify the data but
always return a new object with a copy of the data if the data needs to be
updated for the returned RDD.

Thanks,
Dave.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-immutablility-tp26007.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD immutablility

2016-01-19 Thread Marco
It depends on what you mean by "write access".  The RDDs are immutable, so
you can't really change them. When you apply a mapping/filter/groupBy
function, you are creating a new RDD starting from the original one.

Kind regards,
Marco

2016-01-19 13:27 GMT+01:00 Dave <dave.davo...@gmail.com>:

> Hi Marco,
>
> Yes, that answers my question. I just wanted to be sure as the API gave me
> write access to the immutable data which means its up to the developer to
> know not to modify the input parameters for these API's.
>
> Thanks for the response.
> Dave.
>
>
> On 19/01/16 12:25, Marco wrote:
>
> Hello,
>
> RDD are immutable by design. The reasons, to quote Sean Owen in this
> answer ( https://www.quora.com/Why-is-a-spark-RDD-immutable ), are the
> following :
>
> Immutability rules out a big set of potential problems due to updates from
>> multiple threads at once. Immutable data is definitely safe to share across
>> processes.
>
> They're not just immutable but a deterministic function of their input.
>> This plus immutability also means the RDD's parts can be recreated at any
>> time. This makes caching, sharing and replication easy.
>> These are significant design wins, at the cost of having to copy data
>> rather than mutate it in place. Generally, that's a decent tradeoff to
>> make: gaining the fault tolerance and correctness with no developer effort
>> is worth spending memory and CPU on, since the latter are cheap.
>> A corollary: immutable data can as easily live in memory as on disk. This
>> makes it reasonable to easily move operations that hit disk to instead use
>> data in memory, and again, adding memory is much easier than adding I/O
>> bandwidth.
>> Of course, an RDD isn't really a collection of data, but just a recipe
>> for making data from other data. It is not literally computed by
>> materializing every RDD completely. That is, a lot of the "copy" can be
>> optimized away too.
>
>
> I hope it answers your question.
>
> Kind regards,
> Marco
>
> 2016-01-19 13:14 GMT+01:00 ddav <dave.davo...@gmail.com>:
>
>> Hi,
>>
>> Certain API's (map, mapValues) give the developer access to the data
>> stored
>> in RDD's.
>> Am I correct in saying that these API's must never modify the data but
>> always return a new object with a copy of the data if the data needs to be
>> updated for the returned RDD.
>>
>> Thanks,
>> Dave.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-immutablility-tp26007.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: RDD immutablility

2016-01-19 Thread Sean Owen
It's a good question. You can easily imagine an RDD of classes that
are mutable. Yes, if you modify these objects, the result is pretty
undefined, so don't do that.

On Tue, Jan 19, 2016 at 12:27 PM, Dave  wrote:
> Hi Marco,
>
> Yes, that answers my question. I just wanted to be sure as the API gave me
> write access to the immutable data which means its up to the developer to
> know not to modify the input parameters for these API's.
>
> Thanks for the response.
> Dave.
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD immutablility

2016-01-19 Thread Dave

Thanks Sean.

On 19/01/16 13:36, Sean Owen wrote:

It's a good question. You can easily imagine an RDD of classes that
are mutable. Yes, if you modify these objects, the result is pretty
undefined, so don't do that.

On Tue, Jan 19, 2016 at 12:27 PM, Dave  wrote:

Hi Marco,

Yes, that answers my question. I just wanted to be sure as the API gave me
write access to the immutable data which means its up to the developer to
know not to modify the input parameters for these API's.

Thanks for the response.
Dave.





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD immutablility

2016-01-19 Thread Marco
Hello,

RDD are immutable by design. The reasons, to quote Sean Owen in this answer
( https://www.quora.com/Why-is-a-spark-RDD-immutable ), are the following :

Immutability rules out a big set of potential problems due to updates from
> multiple threads at once. Immutable data is definitely safe to share across
> processes.

They're not just immutable but a deterministic function of their input.
> This plus immutability also means the RDD's parts can be recreated at any
> time. This makes caching, sharing and replication easy.
> These are significant design wins, at the cost of having to copy data
> rather than mutate it in place. Generally, that's a decent tradeoff to
> make: gaining the fault tolerance and correctness with no developer effort
> is worth spending memory and CPU on, since the latter are cheap.
> A corollary: immutable data can as easily live in memory as on disk. This
> makes it reasonable to easily move operations that hit disk to instead use
> data in memory, and again, adding memory is much easier than adding I/O
> bandwidth.
> Of course, an RDD isn't really a collection of data, but just a recipe for
> making data from other data. It is not literally computed by materializing
> every RDD completely. That is, a lot of the "copy" can be optimized away
> too.


I hope it answers your question.

Kind regards,
Marco

2016-01-19 13:14 GMT+01:00 ddav <dave.davo...@gmail.com>:

> Hi,
>
> Certain API's (map, mapValues) give the developer access to the data stored
> in RDD's.
> Am I correct in saying that these API's must never modify the data but
> always return a new object with a copy of the data if the data needs to be
> updated for the returned RDD.
>
> Thanks,
> Dave.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-immutablility-tp26007.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: RDD immutablility

2016-01-19 Thread Dave

Hi Marco,

Yes, that answers my question. I just wanted to be sure as the API gave 
me write access to the immutable data which means its up to the 
developer to know not to modify the input parameters for these API's.


Thanks for the response.
Dave.

On 19/01/16 12:25, Marco wrote:

Hello,

RDD are immutable by design. The reasons, to quote Sean Owen in this 
answer ( https://www.quora.com/Why-is-a-spark-RDD-immutable ), are the 
following :


Immutability rules out a big set of potential problems due to
updates from multiple threads at once. Immutable data is
definitely safe to share across processes.

They're not just immutable but a deterministic function of their
input. This plus immutability also means the RDD's parts can be
recreated at any time. This makes caching, sharing and replication
easy.
These are significant design wins, at the cost of having to copy
data rather than mutate it in place. Generally, that's a decent
tradeoff to make: gaining the fault tolerance and correctness with
no developer effort is worth spending memory and CPU on, since the
latter are cheap.
A corollary: immutable data can as easily live in memory as on
disk. This makes it reasonable to easily move operations that hit
disk to instead use data in memory, and again, adding memory is
much easier than adding I/O bandwidth.
Of course, an RDD isn't really a collection of data, but just a
recipe for making data from other data. It is not literally
computed by materializing every RDD completely. That is, a lot of
the "copy" can be optimized away too.


I hope it answers your question.

Kind regards,
Marco

2016-01-19 13:14 GMT+01:00 ddav <dave.davo...@gmail.com 
<mailto:dave.davo...@gmail.com>>:


Hi,

Certain API's (map, mapValues) give the developer access to the
data stored
in RDD's.
Am I correct in saying that these API's must never modify the data but
always return a new object with a copy of the data if the data
needs to be
updated for the returned RDD.

Thanks,
Dave.



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/RDD-immutablility-tp26007.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>