Is there a join involved in your sql? Have a look at spark.sql.shuffle.partitions?
Srikanth On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi Srikant thanks for the response. I have the following code: > > hiveContext.sql("insert into... ").coalesce(6) > > Above code does not create 6 part files it creates around 200 small files. > > Please guide. Thanks. > On Jul 8, 2015 4:07 AM, "Srikanth" <srikanth...@gmail.com> wrote: > >> Did you do >> >> yourRdd.coalesce(6).saveAsTextFile() >> >> or >> >> yourRdd.coalesce(6) >> yourRdd.saveAsTextFile() >> ? >> >> Srikanth >> >> On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <umesh.ka...@gmail.com> >> wrote: >> >>> Hi I tried both approach using df. repartition(6) and df.coalesce(6) it >>> doesn't reduce part-xxxxx files. Even after calling above method I still >>> see around 200 small part files of size 20 mb each which is again orc files. >>> >>> >>> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu < >>> vsathishkuma...@gmail.com> wrote: >>> >>>> Try coalesce function to limit no of part files >>>> On Mon, Jul 6, 2015 at 1:23 PM kachau <umesh.ka...@gmail.com> wrote: >>>> >>>>> Hi I am having couple of Spark jobs which processes thousands of files >>>>> every >>>>> day. File size may very from MBs to GBs. After finishing job I usually >>>>> save >>>>> using the following code >>>>> >>>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR >>>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC >>>>> file as >>>>> of Spark 1.4 >>>>> >>>>> Spark job creates plenty of small part files in final output >>>>> directory. As >>>>> far as I understand Spark creates part file for each partition/task >>>>> please >>>>> correct me if I am wrong. How do we control amount of part files Spark >>>>> creates? Finally I would like to create Hive table using these >>>>> parquet/orc >>>>> directory and I heard Hive is slow when we have large no of small >>>>> files. >>>>> Please guide I am new to Spark. Thanks in advance. >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>> >>