So I should have done some back of the napkin math before all of this. You are writing out 800 files, each < 128 MB. If they were 128 MB then it would be 100GB of data being written, I'm not sure how much hardware you have but, but the fact that you can shuffle about 100GB to a single thread and write it out in 13 extra mins feels actually really good for spark. You are writing out roughly about 130 MB/sec of compressed parquet data. It has been a little while since I benchmarked it, but that feels about the right order of magnitude. I would suggest that you try repartitioning it to 10 threads or 100 threads instead.
On Tue, Jun 23, 2020 at 4:54 PM Hichki <harish.vs...@gmail.com> wrote: > Hello Team, > > > > I am new to Spark environment. I have converted Hive query to Spark Scala. > Now I am loading data and doing performance testing. Below are details on > loading 3 weeks data. Cluster level small file avg size is set to 128 MB. > > > > 1. New temp table where I am loading data is ORC formatted as current Hive > table is ORC stored. > > 2. Hive table each partition folder size is 200 MB. > > 3. I am using repartition(1) in spark code so that it will create one 200MB > part file in each partition folder(to avoid small file issue). With this > job > is completing in 23 to 26 mins. > > 4. If I don't use repartition(), job is completing in 12 to 13 mins. But > problem with this approach is, it is creating 800 part files (size <128MB) > in each partition folder. > > > > I am quite not sure on how to reduce processing time and not create small > files at the same time. Could anyone please help me in this situation. > > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >