For example,

val originalRDD: RDD[SomeCaseClass] = ...

// Option 1: objects are copied, setting prop1 in the process
val transformedRDD = originalRDD.map( item => item.copy(prop1 = calculation() )

// Option 2: objects are re-used and modified
val tranformedRDD = originalRDD.map( item => item.prop1 = calculation() )

I did a couple of small tests with option 2 and noticed less time was spent in 
garbage collection.  It didn't add up to much but with a large enough data set 
it would make a difference.  Also, it seems that less memory would be used.

Potential gotchas:

- Objects in originalRDD are being modified, so you can't expect them to have 
not changed
- You also can't rely on objects in originalRDD having the new value because 
originalRDD might be re-caclulated
- If originalRDD was a PairRDD, and you modified the keys, it could cause issues
- more?

Other than the potential gotchas, is there any reason not to reuse objects 
across RDD's?  Is it a recommended practice for reducing memory usage and 
garbage collection or not?

Is it safe to do this in code you expect to work on future versions of Spark?

Thanks in advance,

Todd

Reply via email to