Not sure it would help and answer your question at 100%, but number of
partitions is supposed to be at least roughly double of your number of
cores (surprised to not see this point in your list), and can easily grow
up to 10x, until you may notice a too large overhead – but that's not
programmatic, as you asked ;-).

We did lots of experiences and I'd say that the ideal impact is hit when
"number of records of input data * maximum record size / number of
partitions" is far smaller than the available memory per core (to give
Spark enough headroom to properly work, for serialization among others, as
you mentioned in #6).

Since most of the time, we'll have many cores sharing not (and never)
enough memory, it looks interested to partition aggressively. Shuffling
overhead looks easier to observe and debug than the memory efficiency.

Interested in more accurate answers as well :-)

On 4 September 2015 at 05:20, Isabelle Phan <nlip...@gmail.com> wrote:

> +1
>
> I had the exact same question as I am working on my first Spark
> applications.
> Hope someone can share some best practices.
>
> Thanks!
>
> Isabelle
>
> On Tue, Sep 1, 2015 at 2:17 AM, Romi Kuntsman <r...@totango.com> wrote:
>
>> Hi all,
>>
>> The number of partition greatly affect the speed and efficiency of
>> calculation, in my case in DataFrames/SparkSQL on Spark 1.4.0.
>>
>> Too few partitions with large data cause OOM exceptions.
>> Too many partitions on small data cause a delay due to overhead.
>>
>> How do you programmatically determine the optimal number of partitions
>> and cores in Spark, as a function of:
>>
>>    1. available memory per core
>>    2. number of records in input data
>>    3. average/maximum record size
>>    4. cache configuration
>>    5. shuffle configuration
>>    6. serialization
>>    7. etc?
>>
>> Any general best practices?
>>
>> Thanks!
>>
>> Romi K.
>>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.moge...@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Reply via email to