Hello every one I have some confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from the disk change/ process certain columns and then write it back to the disk.
In my experiments, if I use *SparkContext's parallelize method only then it does not seem to have any impact on the performance. However simply using Scala's parallel collections (through par) reduces the time almost to half.* I am running my experiments in localhost mode with the arguments local[2] for the spark context. My question is when should I use scala's parallel collections and when to use spark context's parallelize? Secondly, I have been observing that at the start the average time to process each file is less but it increases considerably for future files. What can be the reason for that? There is not much of a different in file size and each file is not more than 5 MB. Regards Raza
