Hi, we have tried integrating Spark with our existing code and see some issues.

The issue is that when we use the below function (where func is a function to 
process elem)

rdd.map{ elem => {func.apply(elem)} }

in the log, I see the apply function are applied a few times for the same 
element elem instead of one.

When I execute this in a sequential way (see below), everything works just fine.

sparkContext.parallelize(rdd.toArray.map(elem => proj.apply(elem)))

(the only reason I used sparkContext.parallelize) in the above line is because 
the method requires returning RDD[MyDataType]

Why this happens? Does the map function requires some special thing for the rdd?

Thanks,
Xiaobing

Reply via email to