Re: is repartition very cost

2015-12-09 Thread Daniel Siegmann
Each node can have any number of partitions. Spark will try to have a node
process partitions which are already on the node for best performance (if
you look at the list of tasks in the UI, look under the locality level
column).

As a rule of thumb, you probably want 2-3 times the number of partitions as
you have executors. This helps distribute the work evenly. You would need
to experiment to find the best number for your own case.

If you're reading from a distributed data store (such as HDFS), you should
expect the data to already be partitioned. Any time a shuffle is performed
the data will be repartitioned into a number of partitions equal to the
spark.default.parallelism setting (see
http://spark.apache.org/docs/latest/configuration.html), but most
operations which cause a shuffle also take an optional parameter to set a
different value. If using data frames, use spark.sql.shuffle.partitions.

I recommend you do not do any explicit partitioning or mess with these
values until you find a need for it. If executors are sitting idle, that's
a sign you may need to repartition.


On Tue, Dec 8, 2015 at 9:35 PM, Zhiliang Zhu 
wrote:

> Thanks very much for Yong's help.
>
> Sorry that for one more issue, is it that different partitions must be in
> different nodes? that is, each node would only have one partition, in
> cluster mode ...
>
>
>
> On Wednesday, December 9, 2015 6:41 AM, "Young, Matthew T" <
> matthew.t.yo...@intel.com> wrote:
>
>
> Shuffling large amounts of data over the network is expensive, yes. The
> cost is lower if you are just using a single node where no networking needs
> to be involved to do the repartition (using Spark as a multithreading
> engine).
>
> In general you need to do performance testing to see if a repartition is
> worth the shuffle time.
>
> A common model is to repartition the data once after ingest to achieve
> parallelism and avoid shuffles whenever possible later.
>
> *From:* Zhiliang Zhu [mailto:zchl.j...@yahoo.com.INVALID]
> *Sent:* Tuesday, December 08, 2015 5:05 AM
> *To:* User 
> *Subject:* is repartition very cost
>
>
> Hi All,
>
> I need to do optimize objective function with some linear constraints by
>  genetic algorithm.
> I would like to make as much parallelism for it by spark.
>
> repartition / shuffle may be used sometimes in it, however, is repartition
> API very cost ?
>
> Thanks in advance!
> Zhiliang
>
>
>
>
>


Re: is repartition very cost

2015-12-08 Thread Zhiliang Zhu
Thanks very much for Yong's help.
Sorry that for one more issue, is it that different partitions must be in 
different nodes? that is, each node would only have one partition, in cluster 
mode ...  


On Wednesday, December 9, 2015 6:41 AM, "Young, Matthew T" 
 wrote:
 

 #yiv1938266569 #yiv1938266569 -- _filtered #yiv1938266569 
{font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv1938266569 
{panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv1938266569 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv1938266569 
{font-family:Cambria;panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv1938266569 
#yiv1938266569 p.yiv1938266569MsoNormal, #yiv1938266569 
li.yiv1938266569MsoNormal, #yiv1938266569 div.yiv1938266569MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv1938266569 a:link, 
#yiv1938266569 span.yiv1938266569MsoHyperlink 
{color:#0563C1;text-decoration:underline;}#yiv1938266569 a:visited, 
#yiv1938266569 span.yiv1938266569MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv1938266569 
p.yiv1938266569msonormal0, #yiv1938266569 li.yiv1938266569msonormal0, 
#yiv1938266569 div.yiv1938266569msonormal0 
{margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv1938266569 
span.yiv1938266569EmailStyle18 
{color:windowtext;font-weight:normal;font-style:normal;text-decoration:none 
none;}#yiv1938266569 .yiv1938266569MsoChpDefault {font-size:10.0pt;} _filtered 
#yiv1938266569 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv1938266569 
div.yiv1938266569WordSection1 {}#yiv1938266569 Shuffling large amounts of data 
over the network is expensive, yes. The cost is lower if you are just using a 
single node where no networking needs to be involved to do the repartition 
(using Spark as a multithreading engine).    In general you need to do 
performance testing to see if a repartition is worth the shuffle time.    A 
common model is to repartition the data once after ingest to achieve 
parallelism and avoid shuffles whenever possible later.    From: Zhiliang Zhu 
[mailto:zchl.j...@yahoo.com.INVALID]
Sent: Tuesday, December 08, 2015 5:05 AM
To: User 
Subject: is repartition very cost       Hi All,    I need to do optimize 
objective function with some linear constraints by  genetic algorithm.  I would 
like to make as much parallelism for it by spark.    repartition / shuffle may 
be used sometimes in it, however, is repartition API very cost ?    Thanks in 
advance! Zhiliang       

  

RE: is repartition very cost

2015-12-08 Thread Young, Matthew T
Shuffling large amounts of data over the network is expensive, yes. The cost is 
lower if you are just using a single node where no networking needs to be 
involved to do the repartition (using Spark as a multithreading engine).

In general you need to do performance testing to see if a repartition is worth 
the shuffle time.

A common model is to repartition the data once after ingest to achieve 
parallelism and avoid shuffles whenever possible later.

From: Zhiliang Zhu [mailto:zchl.j...@yahoo.com.INVALID]
Sent: Tuesday, December 08, 2015 5:05 AM
To: User 
Subject: is repartition very cost


Hi All,

I need to do optimize objective function with some linear constraints by  
genetic algorithm.
I would like to make as much parallelism for it by spark.

repartition / shuffle may be used sometimes in it, however, is repartition API 
very cost ?

Thanks in advance!
Zhiliang