SparkDataFrame.repartition() uses hash partitioning, it can guarantee that all 
rows of the same column value go to the same partition, but it does not 
guarantee that each partition contain only single column value.

Fortunately, Spark 2.0 comes with gapply() in SparkR. You can apply an R 
function to all groups grouped by the column.

> On Jul 26, 2016, at 06:46, Neil Chang <iam...@gmail.com> wrote:
> 
> Hi,
>   This is a question regarding SparkR in spark 2.0.
> 
> Given that I have a SparkDataFrame and I want to partition it using one 
> column's values. Each value corresponds to a partition, all rows that having 
> the same column value shall go to the same partition, no more no less. 
> 
>    Seems the function repartition() doesn't do this, I have 394 unique 
> values, it just partitions my DataFrame into 200. If I specify the 
> numPartitions to 394, some mismatch happens.
> 
> Is it possible to do what I described in sparkR?
> GroupBy doesn't work with udf at all.
> 
> Or can we split the DataFrame into list of small ones first, if so, what can 
> I use?
> 
> Thanks,
> Neil



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to