.cache() changes contents of RDD

Yan Yang Fri, 26 Feb 2016 19:42:07 -0800

Hi

I am pretty new to Spark, and after experimentation on our pipelines. I ran
into this weird issue.


The Scala code is as below:

val input = sc.newAPIHadoopRDD(...)
val rdd = input.map(...)
rdd.cache()
rdd.saveAsTextFile(...)

I found rdd to consist of 80+K identical rows. To be more precise, the
number of rows is right, but all are identical.

The truly weird part is if I remove rdd.cache(), everything works just
fine. I have encountered this issue on a few occasions.

Thanks
Yan

.cache() changes contents of RDD

Reply via email to