Try limit the partitions. spark.sql.shuffle.partitions This control the number of files generated.
On 28 Nov 2016 8:29 p.m., "Kevin Tran" <kevin...@gmail.com> wrote: > Hi Denny, > Thank you for your inputs. I also use 128 MB but still too many files > generated by Spark app which is only ~14 KB each ! That's why I'm asking if > there is a solution for this if some one has same issue. > > Cheers, > Kevin. > > On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee <denny.g....@gmail.com> wrote: > >> Generally, yes - you should try to have larger data sizes due to the >> overhead of opening up files. Typical guidance is between 64MB-1GB; >> personally I usually stick with 128MB-512MB with the default of snappy >> codec compression with parquet. A good reference is Vida Ha's presentation >> Data >> Storage Tips for Optimal Spark Performance >> <https://spark-summit.org/2015/events/data-storage-tips-for-optimal-spark-performance/>. >> >> >> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran <kevin...@gmail.com> wrote: >> >>> Hi Everyone, >>> Does anyone know what is the best practise of writing parquet file from >>> Spark ? >>> >>> As Spark app write data to parquet and it shows that under that >>> directory there are heaps of very small parquet file (such as >>> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is >>> only 15KB >>> >>> Should it write each chunk of bigger data size (such as 128 MB) with >>> proper number of files ? >>> >>> Does anyone find out any performance changes when changing data size of >>> each parquet file ? >>> >>> Thanks, >>> Kevin. >>> >> >