date:20220603

Re: PartitionBy and SortWithinPartitions

2022-06-03 Thread Nikhil Goyal

Hi Enrico,
Thanks for replying. I want to partition by a column and then be able to
sort within those partitions based on another column. DataframeWriter has
sortBy and bucketBy but it requires creating a new table (Can only use
`saveAsTable` but not just `save`). I can write another job on top which
does the sorting but that complicates the code. So is there a clever way to
sort records after they have been partitioned?

Thanks
Nikhil

On Fri, Jun 3, 2022 at 9:38 AM Enrico Minack  wrote:

> Nikhil,
>
> What are you trying to achieve with this in the first place? What are your
> goals? What is the problem with your approach?
>
> Are you concerned about the 1000 files in each written col2-partition?
>
> The write.partitionBy is something different that df.repartition or
> df.coalesce.
>
> The df partitions are sorted *before* partitionBy-writing them.
>
> Enrico
>
>
> Am 03.06.22 um 16:13 schrieb Nikhil Goyal:
>
> Hi folks,
>
> We are trying to do
> `
> df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...)
> `
>
> I do see that coalesce 1000 is applied for every sub partition. But I
> wanted to know if sortWithinPartitions(col1) works after applying
> partitionBy or before? Basically would spark first partitionBy col2 and
> then sort by col1 or sort first and then partition?
>
> Thanks
> Nikhil
>
>
>

Re: PartitionBy and SortWithinPartitions

2022-06-03 Thread Enrico Minack


Nikhil,

What are you trying to achieve with this in the first place? What are 
your goals? What is the problem with your approach?


Are you concerned about the 1000 files in each written col2-partition?

The write.partitionBy is something different that df.repartition or 
df.coalesce.


The df partitions are sorted *before* partitionBy-writing them.

Enrico


Am 03.06.22 um 16:13 schrieb Nikhil Goyal:

Hi folks,

We are trying to do
`df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...)`

I do see that coalesce 1000 is applied for every sub partition. But I 
wanted to know if sortWithinPartitions(col1) works after applying 
partitionBy or before? Basically would spark first partitionBy col2 
and then sort by col1 or sort first and then partition?


Thanks
Nikhil

PartitionBy and SortWithinPartitions

2022-06-03 Thread Nikhil Goyal

Hi folks,

We are trying to do
`
df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...)
`

I do see that coalesce 1000 is applied for every sub partition. But I
wanted to know if sortWithinPartitions(col1) works after applying
partitionBy or before? Basically would spark first partitionBy col2 and
then sort by col1 or sort first and then partition?

Thanks
Nikhil

Re: PartitionBy and SortWithinPartitions

Re: PartitionBy and SortWithinPartitions

PartitionBy and SortWithinPartitions

3 matches

Site Navigation

Mail list logo

Footer information