Although it feels like you are copying an RDD when you map it, it is not
necessarily literally being copied. Your map function may pass through most
objects unchanged. So there may not be so much overhead as you think.
I don't think you can avoid a scan of the data unless you can somehow know
that
You can not modify one RDD in mapPartitions due to RDD is immutable.
Once you apply transform function on RDDs, they will produce new RDDs.
If you just want to modify only a fraction of the total RDD, try to collect
the new value list to driver or use broadcast variable after each
iteration, not to
RDDs are immutable, so if you want to change the value of an RDD then you
have to create another RDD from it by applying some transformation.
Not sure if this is what you are looking for:
val rdd = sc.parallelize(Range(0,100))
val rdd2 = rdd.map(x => {
println("Value : " +
Hi,
I'd like to make an operation on an RDD that ONLY change the value of
some items, without make a full copy or full scan of each data.
It is useful when I need to handle a large RDD, and each time I need only
to change a little fraction of the data, and keeps other data unchanged.