Hi

I am pretty new to Spark, and after experimentation on our pipelines. I ran
into this weird issue.

The Scala code is as below:

val input = sc.newAPIHadoopRDD(...)
val rdd = input.map(...)
rdd.cache()
rdd.saveAsTextFile(...)

I found rdd to consist of 80+K identical rows. To be more precise, the
number of rows is right, but all are identical.

The truly weird part is if I remove rdd.cache(), everything works just
fine. I have encountered this issue on a few occasions.

Thanks
Yan

Reply via email to