Shuffling large amounts of data over the network is expensive, yes. The cost is 
lower if you are just using a single node where no networking needs to be 
involved to do the repartition (using Spark as a multithreading engine).

In general you need to do performance testing to see if a repartition is worth 
the shuffle time.

A common model is to repartition the data once after ingest to achieve 
parallelism and avoid shuffles whenever possible later.

From: Zhiliang Zhu [mailto:zchl.j...@yahoo.com.INVALID]
Sent: Tuesday, December 08, 2015 5:05 AM
To: User <user@spark.apache.org>
Subject: is repartition very cost


Hi All,

I need to do optimize objective function with some linear constraints by  
genetic algorithm.
I would like to make as much parallelism for it by spark.

repartition / shuffle may be used sometimes in it, however, is repartition API 
very cost ?

Thanks in advance!
Zhiliang


Reply via email to