The following should work as long as your tables are created using Spark SQL
event_wk.repartition(2).write.partitionBy("eventDate").format("parquet" ).insertInto("event) If you want to stick to using "insert overwrite" for Hive compatibility, then you can repartition twice, instead of setting the global spark.sql.shuffle.partition parameter df eventwk = sqlContext.sql("some joins") // this should use the global shuffle partition parameter df eventwkRepartitioned = eventwk.repartition(2) eventwkRepartitioned.registerTempTable("event_wk_repartitioned") and use this in your insert statement. registering temp table is cheap HTH On 29 January 2016 at 20:26, Benyi Wang <bewang.t...@gmail.com> wrote: > I want to insert into a partition table using dynamic partition, but I > don’t want to have 200 files for a partition because the files will be > small for my case. > > sqlContext.sql( """ > |insert overwrite table event > |partition(eventDate) > |select > | user, > | detail, > | eventDate > |from event_wk > """.stripMargin) > > the table “event_wk” is created from a dataframe by registerTempTable, > which is built with some joins. If I set spark.sql.shuffle.partition=2, the > join’s performance will be bad because that property seems global. > > I can do something like this: > > event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path) > > but I have to handle adding partitions by myself. > > Is there a way you can control the number of files just for this last > insert step? > >