Re: PartitionBy and SortWithinPartitions

2022-06-03 Thread Nikhil Goyal
Hi Enrico,
Thanks for replying. I want to partition by a column and then be able to
sort within those partitions based on another column. DataframeWriter has
sortBy and bucketBy but it requires creating a new table (Can only use
`saveAsTable` but not just `save`). I can write another job on top which
does the sorting but that complicates the code. So is there a clever way to
sort records after they have been partitioned?

Thanks
Nikhil

On Fri, Jun 3, 2022 at 9:38 AM Enrico Minack  wrote:

> Nikhil,
>
> What are you trying to achieve with this in the first place? What are your
> goals? What is the problem with your approach?
>
> Are you concerned about the 1000 files in each written col2-partition?
>
> The write.partitionBy is something different that df.repartition or
> df.coalesce.
>
> The df partitions are sorted *before* partitionBy-writing them.
>
> Enrico
>
>
> Am 03.06.22 um 16:13 schrieb Nikhil Goyal:
>
> Hi folks,
>
> We are trying to do
> `
> df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...)
> `
>
> I do see that coalesce 1000 is applied for every sub partition. But I
> wanted to know if sortWithinPartitions(col1) works after applying
> partitionBy or before? Basically would spark first partitionBy col2 and
> then sort by col1 or sort first and then partition?
>
> Thanks
> Nikhil
>
>
>


Re: PartitionBy and SortWithinPartitions

2022-06-03 Thread Enrico Minack

Nikhil,

What are you trying to achieve with this in the first place? What are 
your goals? What is the problem with your approach?


Are you concerned about the 1000 files in each written col2-partition?

The write.partitionBy is something different that df.repartition or 
df.coalesce.


The df partitions are sorted *before* partitionBy-writing them.

Enrico


Am 03.06.22 um 16:13 schrieb Nikhil Goyal:

Hi folks,

We are trying to do
`df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...)`

I do see that coalesceĀ 1000 is applied for every sub partition. But I 
wanted to know if sortWithinPartitions(col1) works after applying 
partitionBy or before? Basically would spark first partitionBy col2 
and then sort by col1 or sort first and then partition?


Thanks
Nikhil




PartitionBy and SortWithinPartitions

2022-06-03 Thread Nikhil Goyal
Hi folks,

We are trying to do
`
df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...)
`

I do see that coalesce 1000 is applied for every sub partition. But I
wanted to know if sortWithinPartitions(col1) works after applying
partitionBy or before? Basically would spark first partitionBy col2 and
then sort by col1 or sort first and then partition?

Thanks
Nikhil