Hello every one

I have some confusion about parallelism in Spark and Scala. I am running an
experiment in which I have to read many (csv) files from the disk change/
process certain columns and then write it back to the disk.

In my experiments, if I use *SparkContext's parallelize method only then it
does not seem to have any impact on the performance. However simply using
Scala's parallel collections (through par) reduces the time almost to half.*

I am running my experiments in localhost mode with the arguments local[2]
for the spark context.

My question is when should I use scala's parallel collections and when to
use spark context's parallelize?

Secondly, I have been observing that at the start the average time to
process each file is less but it increases considerably for future files.
What can be the reason for that? There is not much of a different in file
size and each file is not more than 5 MB.

Regards

Raza

Reply via email to