Re: dataframes and numPartitions

2015-10-18 Thread Jorge Sánchez
Alex,

If not, you can try using the functions coalesce(n) or repartition(n).

As per the API, coalesce will not make a shuffle but repartition will.

Regards.

2015-10-16 0:52 GMT+01:00 Mohammed Guller :

> You may find the spark.sql.shuffle.partitions property useful. The default
> value is 200.
>
>
>
> Mohammed
>
>
>
> *From:* Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
> *Sent:* Wednesday, October 14, 2015 8:14 PM
> *To:* user
> *Subject:* dataframes and numPartitions
>
>
>
> A lot of RDD methods take a numPartitions parameter that lets you specify
> the number of partitions in the result. For example, groupByKey.
>
>
>
> The DataFrame counterparts don't have a numPartitions parameter, e.g.
> groupBy only takes a bunch of Columns as params.
>
>
>
> I understand that the DataFrame API is supposed to be smarter and go
> through a LogicalPlan, and perhaps determine the number of optimal
> partitions for you, but sometimes you want to specify the number of
> partitions yourself. One such use case is when you are preparing to do a
> "merge" join with another dataset that is similarly partitioned with the
> same number of partitions.
>


RE: dataframes and numPartitions

2015-10-15 Thread Mohammed Guller
You may find the spark.sql.shuffle.partitions property useful. The default 
value is 200.

Mohammed

From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
Sent: Wednesday, October 14, 2015 8:14 PM
To: user
Subject: dataframes and numPartitions

A lot of RDD methods take a numPartitions parameter that lets you specify the 
number of partitions in the result. For example, groupByKey.

The DataFrame counterparts don't have a numPartitions parameter, e.g. groupBy 
only takes a bunch of Columns as params.

I understand that the DataFrame API is supposed to be smarter and go through a 
LogicalPlan, and perhaps determine the number of optimal partitions for you, 
but sometimes you want to specify the number of partitions yourself. One such 
use case is when you are preparing to do a "merge" join with another dataset 
that is similarly partitioned with the same number of partitions.