Re: Partition by on dataframe causing a Sort

Nikhil Goyal Thu, 20 Apr 2023 14:42:40 -0700

Is it possible to use MultipleOutputs and define a custom OutputFormat and
then use `saveAsHadoopFile` to be able to achieve this?


On Thu, Apr 20, 2023 at 1:29 PM Nikhil Goyal <[email protected]> wrote:

> Hi folks,
>
> We are writing a dataframe and doing a partitionby() on it.
> df.write.partitionBy('col').parquet('output')
>
> Job is running super slow because internally per partition it is doing a
> sort before starting to output to the final location. This sort isn't
> useful in any way since # of files will remain the same. I was wondering if
> we can have spark just open multiple file pointers and keep appending data
> as it receives and close all the pointers when it's done. This will reduce
> the memory footprint and will speed up the performance as we will
> eliminate a sort. We can implement a custom source but unable to see if we
> can really control this behavior in the sink. If anyone has any suggestions
> please let me know.
>
> Thanks
> Nikhil
>

Re: Partition by on dataframe causing a Sort

Reply via email to