Re: is repartition very cost

Daniel Siegmann Wed, 09 Dec 2015 07:19:38 -0800

Each node can have any number of partitions. Spark will try to have a node
process partitions which are already on the node for best performance (if
you look at the list of tasks in the UI, look under the locality level
column).

As a rule of thumb, you probably want 2-3 times the number of partitions as
you have executors. This helps distribute the work evenly. You would need
to experiment to find the best number for your own case.

If you're reading from a distributed data store (such as HDFS), you should
expect the data to already be partitioned. Any time a shuffle is performed
the data will be repartitioned into a number of partitions equal to the
spark.default.parallelism setting (see
http://spark.apache.org/docs/latest/configuration.html), but most
operations which cause a shuffle also take an optional parameter to set a
different value. If using data frames, use spark.sql.shuffle.partitions.

I recommend you do not do any explicit partitioning or mess with these
values until you find a need for it. If executors are sitting idle, that's
a sign you may need to repartition.

On Tue, Dec 8, 2015 at 9:35 PM, Zhiliang Zhu <zchl.j...@yahoo.com.invalid>
wrote:

> Thanks very much for Yong's help.
>
> Sorry that for one more issue, is it that different partitions must be in
> different nodes? that is, each node would only have one partition, in
> cluster mode ...
>
>
>
> On Wednesday, December 9, 2015 6:41 AM, "Young, Matthew T" <
> matthew.t.yo...@intel.com> wrote:
>
>
> Shuffling large amounts of data over the network is expensive, yes. The
> cost is lower if you are just using a single node where no networking needs
> to be involved to do the repartition (using Spark as a multithreading
> engine).
>
> In general you need to do performance testing to see if a repartition is
> worth the shuffle time.
>
> A common model is to repartition the data once after ingest to achieve
> parallelism and avoid shuffles whenever possible later.
>
> *From:* Zhiliang Zhu [mailto:zchl.j...@yahoo.com.INVALID]
> *Sent:* Tuesday, December 08, 2015 5:05 AM
> *To:* User <user@spark.apache.org>
> *Subject:* is repartition very cost
>
>
> Hi All,
>
> I need to do optimize objective function with some linear constraints by
>  genetic algorithm.
> I would like to make as much parallelism for it by spark.
>
> repartition / shuffle may be used sometimes in it, however, is repartition
> API very cost ?
>
> Thanks in advance!
> Zhiliang
>
>
>
>
>

Re: is repartition very cost

Reply via email to