Hi, When reducing partitions is better to use coalesce because it doesn't need to shuffle the data.
dataframe.coalesce(1) El mar., 23 jun. 2020 23:54, Hichki <harish.vs...@gmail.com> escribió: > Hello Team, > > > > I am new to Spark environment. I have converted Hive query to Spark Scala. > Now I am loading data and doing performance testing. Below are details on > loading 3 weeks data. Cluster level small file avg size is set to 128 MB. > > > > 1. New temp table where I am loading data is ORC formatted as current Hive > table is ORC stored. > > 2. Hive table each partition folder size is 200 MB. > > 3. I am using repartition(1) in spark code so that it will create one 200MB > part file in each partition folder(to avoid small file issue). With this > job > is completing in 23 to 26 mins. > > 4. If I don't use repartition(), job is completing in 12 to 13 mins. But > problem with this approach is, it is creating 800 part files (size <128MB) > in each partition folder. > > > > I am quite not sure on how to reduce processing time and not create small > files at the same time. Could anyone please help me in this situation. > > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >