Hello guys,

I'm after some advice on Spark performance.

I've a MapReduce job that read inputs carry out a simple calculation and
write the results into HDFS. I've implemented the same logic in Spark job.

When I tried both jobs on same datasets, I'm getting different execution
time, which is expected.

BUT
......
In my example, MapReduce job is performing much better than Spark.

The difference is that I'm not changing much with the MR job configuration,
e.g., memory, cores, etc...But this is not the case with Spark as it's very
flexible. So I'm sure my configuration isn't correct which is why MR is
outperforming Spark but need your advice.

For example:

Test 1:
4.5GB data -  MR job took ~55 seconds to compute, but Spark took ~3 minutes
and 20 seconds.

Test 2:
25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still
running, and it's already been 15 minutes.


I have a cluster of 15 nodes. The maximum memory that I could allocate to
each executor is 6GB. Therefore, for Test 1, this is the config I used:

--executor-memory 6G --num-executors 4 --driver-memory 6G  --executor-cores
2 (also I set "spark.storage.memoryFraction" to 0.3)


For Test 2:
--executor-memory 6G --num-executors 10 --driver-memory 6G
 --executor-cores 2 (also I set "spark.storage.memoryFraction" to 0.3)

I tried all possible combination but couldn't get better performance. Any
suggestions will be much appreciated.

Reply via email to