Re: Merging Parquet Files

Jörn Franke Mon, 31 Aug 2020 07:53:40 -0700

Why only one file?
I would go more for files of specific size, eg data is split in 1gb files. The 
reason is also that if you need to transfer it (eg to other clouds etc) - 
having a large file of several terabytes is bad.


It depends on your use case but you might look also at partitions etc.

> Am 31.08.2020 um 16:17 schrieb Tzahi File <tzahi.f...@ironsrc.com>:
> 
> 
> Hi, 
> 
> I would like to develop a process that merges parquet files. 
> My first intention was to develop it with PySpark using coalesce(1) -  to 
> create only 1 file. 
> This process is going to run on a huge amount of files.
> I wanted your advice on what is the best way to implement it (PySpark isn't a 
> must).  
> 
> 
> Thanks,
> Tzahi

Re: Merging Parquet Files

Reply via email to