Is it possible to use MultipleOutputs and define a custom OutputFormat and then use `saveAsHadoopFile` to be able to achieve this?
On Thu, Apr 20, 2023 at 1:29 PM Nikhil Goyal <[email protected]> wrote: > Hi folks, > > We are writing a dataframe and doing a partitionby() on it. > df.write.partitionBy('col').parquet('output') > > Job is running super slow because internally per partition it is doing a > sort before starting to output to the final location. This sort isn't > useful in any way since # of files will remain the same. I was wondering if > we can have spark just open multiple file pointers and keep appending data > as it receives and close all the pointers when it's done. This will reduce > the memory footprint and will speed up the performance as we will > eliminate a sort. We can implement a custom source but unable to see if we > can really control this behavior in the sink. If anyone has any suggestions > please let me know. > > Thanks > Nikhil >
