Re: is repartition very cost
Each node can have any number of partitions. Spark will try to have a node process partitions which are already on the node for best performance (if you look at the list of tasks in the UI, look under the locality level column). As a rule of thumb, you probably want 2-3 times the number of partitions as you have executors. This helps distribute the work evenly. You would need to experiment to find the best number for your own case. If you're reading from a distributed data store (such as HDFS), you should expect the data to already be partitioned. Any time a shuffle is performed the data will be repartitioned into a number of partitions equal to the spark.default.parallelism setting (see http://spark.apache.org/docs/latest/configuration.html), but most operations which cause a shuffle also take an optional parameter to set a different value. If using data frames, use spark.sql.shuffle.partitions. I recommend you do not do any explicit partitioning or mess with these values until you find a need for it. If executors are sitting idle, that's a sign you may need to repartition. On Tue, Dec 8, 2015 at 9:35 PM, Zhiliang Zhu wrote: > Thanks very much for Yong's help. > > Sorry that for one more issue, is it that different partitions must be in > different nodes? that is, each node would only have one partition, in > cluster mode ... > > > > On Wednesday, December 9, 2015 6:41 AM, "Young, Matthew T" < > matthew.t.yo...@intel.com> wrote: > > > Shuffling large amounts of data over the network is expensive, yes. The > cost is lower if you are just using a single node where no networking needs > to be involved to do the repartition (using Spark as a multithreading > engine). > > In general you need to do performance testing to see if a repartition is > worth the shuffle time. > > A common model is to repartition the data once after ingest to achieve > parallelism and avoid shuffles whenever possible later. > > *From:* Zhiliang Zhu [mailto:zchl.j...@yahoo.com.INVALID] > *Sent:* Tuesday, December 08, 2015 5:05 AM > *To:* User > *Subject:* is repartition very cost > > > Hi All, > > I need to do optimize objective function with some linear constraints by > genetic algorithm. > I would like to make as much parallelism for it by spark. > > repartition / shuffle may be used sometimes in it, however, is repartition > API very cost ? > > Thanks in advance! > Zhiliang > > > > >
Re: is repartition very cost
Thanks very much for Yong's help. Sorry that for one more issue, is it that different partitions must be in different nodes? that is, each node would only have one partition, in cluster mode ... On Wednesday, December 9, 2015 6:41 AM, "Young, Matthew T" wrote: #yiv1938266569 #yiv1938266569 -- _filtered #yiv1938266569 {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv1938266569 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv1938266569 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv1938266569 {font-family:Cambria;panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv1938266569 #yiv1938266569 p.yiv1938266569MsoNormal, #yiv1938266569 li.yiv1938266569MsoNormal, #yiv1938266569 div.yiv1938266569MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv1938266569 a:link, #yiv1938266569 span.yiv1938266569MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv1938266569 a:visited, #yiv1938266569 span.yiv1938266569MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv1938266569 p.yiv1938266569msonormal0, #yiv1938266569 li.yiv1938266569msonormal0, #yiv1938266569 div.yiv1938266569msonormal0 {margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv1938266569 span.yiv1938266569EmailStyle18 {color:windowtext;font-weight:normal;font-style:normal;text-decoration:none none;}#yiv1938266569 .yiv1938266569MsoChpDefault {font-size:10.0pt;} _filtered #yiv1938266569 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv1938266569 div.yiv1938266569WordSection1 {}#yiv1938266569 Shuffling large amounts of data over the network is expensive, yes. The cost is lower if you are just using a single node where no networking needs to be involved to do the repartition (using Spark as a multithreading engine). In general you need to do performance testing to see if a repartition is worth the shuffle time. A common model is to repartition the data once after ingest to achieve parallelism and avoid shuffles whenever possible later. From: Zhiliang Zhu [mailto:zchl.j...@yahoo.com.INVALID] Sent: Tuesday, December 08, 2015 5:05 AM To: User Subject: is repartition very cost Hi All, I need to do optimize objective function with some linear constraints by genetic algorithm. I would like to make as much parallelism for it by spark. repartition / shuffle may be used sometimes in it, however, is repartition API very cost ? Thanks in advance! Zhiliang
RE: is repartition very cost
Shuffling large amounts of data over the network is expensive, yes. The cost is lower if you are just using a single node where no networking needs to be involved to do the repartition (using Spark as a multithreading engine). In general you need to do performance testing to see if a repartition is worth the shuffle time. A common model is to repartition the data once after ingest to achieve parallelism and avoid shuffles whenever possible later. From: Zhiliang Zhu [mailto:zchl.j...@yahoo.com.INVALID] Sent: Tuesday, December 08, 2015 5:05 AM To: User Subject: is repartition very cost Hi All, I need to do optimize objective function with some linear constraints by genetic algorithm. I would like to make as much parallelism for it by spark. repartition / shuffle may be used sometimes in it, however, is repartition API very cost ? Thanks in advance! Zhiliang