Hi I am pretty new to Spark, and after experimentation on our pipelines. I ran into this weird issue.
The Scala code is as below: val input = sc.newAPIHadoopRDD(...) val rdd = input.map(...) rdd.cache() rdd.saveAsTextFile(...) I found rdd to consist of 80+K identical rows. To be more precise, the number of rows is right, but all are identical. The truly weird part is if I remove rdd.cache(), everything works just fine. I have encountered this issue on a few occasions. Thanks Yan