Hi, we have tried integrating Spark with our existing code and see some issues.
The issue is that when we use the below function (where func is a function to
process elem)
rdd.map{ elem => {func.apply(elem)} }
in the log, I see the apply function are applied a few times for the same
element elem instead of one.
When I execute this in a sequential way (see below), everything works just fine.
sparkContext.parallelize(rdd.toArray.map(elem => proj.apply(elem)))
(the only reason I used sparkContext.parallelize) in the above line is because
the method requires returning RDD[MyDataType]
Why this happens? Does the map function requires some special thing for the rdd?
Thanks,
Xiaobing