Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Imran Rashid
You can set the number of partitions dynamically -- its just a parameter to
a method, so you can compute it however you want, it doesn't need to be
some static constant:

val dataSizeEstimate = yourMagicFunctionToEstimateDataSize()
val numberOfPartitions =
yourConversionFromDataSizeToNumPartitions(dataSizeEstimate)


val reducedRDD = someInputRDD.reduceByKey(f, numberOfPartitions) //or
whatever else that needs to know number of partitions

of course this means you need to do the work of figuring out those magic
functions, but its certainly possible.

I agree with all of Sean's recommendations, but I guess I might put a bit
more emphasis on "The one exception are operations that tend to pull data
into memory."  For me, I've found that to be a very important exception,
that can come up a lot.  And though in general a lot of partitions makes
sense, there have been recent questions on the user list about folks going
to far, using eg. 100K partitions and then having the bookkeeping overhead
dominating.  But thats a pretty big number -- you should still be able to
err on the side of too many partitions w/out going that far, I'd imagine.



On Wed, Mar 4, 2015 at 4:17 AM, Jeff Zhang  wrote:

> Hi Sean,
>
>  > If you know a stage needs unusually high parallelism for example you
> can repartition further for that stage.
>
> The problem is we may don't know whether high parallelism is needed. e.g.
> for the join operator, high parallelism may only be necessary for some
> dataset that lots of data can join together while for other dataset high
> parallelism may not be necessary if only a few data can join together.
>
> So my question is that unable changing parallelism at runtime dynamically
> may not be flexible.
>
>
>
> On Wed, Mar 4, 2015 at 5:36 PM, Sean Owen  wrote:
>
>> Hm, what do you mean? You can control, to some extent, the number of
>> partitions when you read the data, and can repartition if needed.
>>
>> You can set the default parallelism too so that it takes effect for most
>> ops thay create an RDD. One # of partitions is usually about right for all
>> work (2x or so the number of execution slots).
>>
>> If you know a stage needs unusually high parallelism for example you can
>> repartition further for that stage.
>>  On Mar 4, 2015 1:50 AM, "Jeff Zhang"  wrote:
>>
>>> Thanks Sean.
>>>
>>> But if the partitions of RDD is determined before hand, it would not be
>>> flexible to run the same program on the different dataset. Although for the
>>> first stage the partitions can be determined by the input data set, for the
>>> intermediate stage it is not possible. Users have to create policy to
>>> repartition or coalesce based on the data set size.
>>>
>>>
>>> On Tue, Mar 3, 2015 at 6:29 PM, Sean Owen  wrote:
>>>
 An RDD has a certain fixed number of partitions, yes. You can't change
 an RDD. You can repartition() or coalese() and RDD to make a new one
 with a different number of RDDs, possibly requiring a shuffle.

 On Tue, Mar 3, 2015 at 10:21 AM, Jeff Zhang  wrote:
 > I mean is it possible to change the partition number at runtime.
 Thanks
 >
 >
 > --
 > Best Regards
 >
 > Jeff Zhang

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Sean Owen
Parallelism doesn't really affect the throughput as long as it's:

- not less than the number of available execution slots,
- ... and probably some low multiple of them to even out task size effects
- not so high that the bookkeeping overhead dominates

Although you may need to select different scales of parallelism for
different stages (like a join), you shouldn't in general have to
change it according to data size.

However you could count the input size and make parallelism some
function of that if you found that was consistently better.

The one exception are operations that tend to pull data into memory.
You may need more parallelism as scale increases to keep in-memory
data size small enough. There again you usually just err on the side
of 'too much' parallelism, or avoid patterns that can pull a lot of
data into memory, but this is usually the pain point if there is one.

The problem I run into when thinking about this is that I don't think
Spark can do much better, since it doesn't have the info above needed
to decide these things in general. The calling program has to tell it.

On Wed, Mar 4, 2015 at 10:17 AM, Jeff Zhang  wrote:
> Hi Sean,
>
>  > If you know a stage needs unusually high parallelism for example you can
> repartition further for that stage.
>
> The problem is we may don't know whether high parallelism is needed. e.g.
> for the join operator, high parallelism may only be necessary for some
> dataset that lots of data can join together while for other dataset high
> parallelism may not be necessary if only a few data can join together.
>
> So my question is that unable changing parallelism at runtime dynamically
> may not be flexible.
>
>
>
> On Wed, Mar 4, 2015 at 5:36 PM, Sean Owen  wrote:
>>
>> Hm, what do you mean? You can control, to some extent, the number of
>> partitions when you read the data, and can repartition if needed.
>>
>> You can set the default parallelism too so that it takes effect for most
>> ops thay create an RDD. One # of partitions is usually about right for all
>> work (2x or so the number of execution slots).
>>
>> If you know a stage needs unusually high parallelism for example you can
>> repartition further for that stage.
>>
>> On Mar 4, 2015 1:50 AM, "Jeff Zhang"  wrote:
>>>
>>> Thanks Sean.
>>>
>>> But if the partitions of RDD is determined before hand, it would not be
>>> flexible to run the same program on the different dataset. Although for the
>>> first stage the partitions can be determined by the input data set, for the
>>> intermediate stage it is not possible. Users have to create policy to
>>> repartition or coalesce based on the data set size.
>>>
>>>
>>> On Tue, Mar 3, 2015 at 6:29 PM, Sean Owen  wrote:

 An RDD has a certain fixed number of partitions, yes. You can't change
 an RDD. You can repartition() or coalese() and RDD to make a new one
 with a different number of RDDs, possibly requiring a shuffle.

 On Tue, Mar 3, 2015 at 10:21 AM, Jeff Zhang  wrote:
 > I mean is it possible to change the partition number at runtime.
 > Thanks
 >
 >
 > --
 > Best Regards
 >
 > Jeff Zhang
>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>
>
>
>
> --
> Best Regards
>
> Jeff Zhang

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Jeff Zhang
Hi Sean,

 > If you know a stage needs unusually high parallelism for example you can
repartition further for that stage.

The problem is we may don't know whether high parallelism is needed. e.g.
for the join operator, high parallelism may only be necessary for some
dataset that lots of data can join together while for other dataset high
parallelism may not be necessary if only a few data can join together.

So my question is that unable changing parallelism at runtime dynamically
may not be flexible.



On Wed, Mar 4, 2015 at 5:36 PM, Sean Owen  wrote:

> Hm, what do you mean? You can control, to some extent, the number of
> partitions when you read the data, and can repartition if needed.
>
> You can set the default parallelism too so that it takes effect for most
> ops thay create an RDD. One # of partitions is usually about right for all
> work (2x or so the number of execution slots).
>
> If you know a stage needs unusually high parallelism for example you can
> repartition further for that stage.
>  On Mar 4, 2015 1:50 AM, "Jeff Zhang"  wrote:
>
>> Thanks Sean.
>>
>> But if the partitions of RDD is determined before hand, it would not be
>> flexible to run the same program on the different dataset. Although for the
>> first stage the partitions can be determined by the input data set, for the
>> intermediate stage it is not possible. Users have to create policy to
>> repartition or coalesce based on the data set size.
>>
>>
>> On Tue, Mar 3, 2015 at 6:29 PM, Sean Owen  wrote:
>>
>>> An RDD has a certain fixed number of partitions, yes. You can't change
>>> an RDD. You can repartition() or coalese() and RDD to make a new one
>>> with a different number of RDDs, possibly requiring a shuffle.
>>>
>>> On Tue, Mar 3, 2015 at 10:21 AM, Jeff Zhang  wrote:
>>> > I mean is it possible to change the partition number at runtime. Thanks
>>> >
>>> >
>>> > --
>>> > Best Regards
>>> >
>>> > Jeff Zhang
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>


-- 
Best Regards

Jeff Zhang


Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Sean Owen
Hm, what do you mean? You can control, to some extent, the number of
partitions when you read the data, and can repartition if needed.

You can set the default parallelism too so that it takes effect for most
ops thay create an RDD. One # of partitions is usually about right for all
work (2x or so the number of execution slots).

If you know a stage needs unusually high parallelism for example you can
repartition further for that stage.
 On Mar 4, 2015 1:50 AM, "Jeff Zhang"  wrote:

> Thanks Sean.
>
> But if the partitions of RDD is determined before hand, it would not be
> flexible to run the same program on the different dataset. Although for the
> first stage the partitions can be determined by the input data set, for the
> intermediate stage it is not possible. Users have to create policy to
> repartition or coalesce based on the data set size.
>
>
> On Tue, Mar 3, 2015 at 6:29 PM, Sean Owen  wrote:
>
>> An RDD has a certain fixed number of partitions, yes. You can't change
>> an RDD. You can repartition() or coalese() and RDD to make a new one
>> with a different number of RDDs, possibly requiring a shuffle.
>>
>> On Tue, Mar 3, 2015 at 10:21 AM, Jeff Zhang  wrote:
>> > I mean is it possible to change the partition number at runtime. Thanks
>> >
>> >
>> > --
>> > Best Regards
>> >
>> > Jeff Zhang
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Is the RDD's Partitions determined before hand ?

2015-03-03 Thread Jeff Zhang
Thanks Sean.

But if the partitions of RDD is determined before hand, it would not be
flexible to run the same program on the different dataset. Although for the
first stage the partitions can be determined by the input data set, for the
intermediate stage it is not possible. Users have to create policy to
repartition or coalesce based on the data set size.


On Tue, Mar 3, 2015 at 6:29 PM, Sean Owen  wrote:

> An RDD has a certain fixed number of partitions, yes. You can't change
> an RDD. You can repartition() or coalese() and RDD to make a new one
> with a different number of RDDs, possibly requiring a shuffle.
>
> On Tue, Mar 3, 2015 at 10:21 AM, Jeff Zhang  wrote:
> > I mean is it possible to change the partition number at runtime. Thanks
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
>



-- 
Best Regards

Jeff Zhang


Re: Is the RDD's Partitions determined before hand ?

2015-03-03 Thread Sean Owen
An RDD has a certain fixed number of partitions, yes. You can't change
an RDD. You can repartition() or coalese() and RDD to make a new one
with a different number of RDDs, possibly requiring a shuffle.

On Tue, Mar 3, 2015 at 10:21 AM, Jeff Zhang  wrote:
> I mean is it possible to change the partition number at runtime. Thanks
>
>
> --
> Best Regards
>
> Jeff Zhang

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is the RDD's Partitions determined before hand ?

2015-03-03 Thread Jeff Zhang
I mean is it possible to change the partition number at runtime. Thanks


-- 
Best Regards

Jeff Zhang