Shuffling large amounts of data over the network is expensive, yes. The cost is lower if you are just using a single node where no networking needs to be involved to do the repartition (using Spark as a multithreading engine).
In general you need to do performance testing to see if a repartition is worth the shuffle time. A common model is to repartition the data once after ingest to achieve parallelism and avoid shuffles whenever possible later. From: Zhiliang Zhu [mailto:zchl.j...@yahoo.com.INVALID] Sent: Tuesday, December 08, 2015 5:05 AM To: User <user@spark.apache.org> Subject: is repartition very cost Hi All, I need to do optimize objective function with some linear constraints by genetic algorithm. I would like to make as much parallelism for it by spark. repartition / shuffle may be used sometimes in it, however, is repartition API very cost ? Thanks in advance! Zhiliang