I want to insert into a partition table using dynamic partition, but I
don’t want to have 200 files for a partition because the files will be
small for my case.
sqlContext.sql( """
|insert overwrite table event
|partition(eventDate)
|select
| user,
| detail,
| eventDate
|from event_wk
""".stripMargin)
the table “event_wk” is created from a dataframe by registerTempTable,
which is built with some joins. If I set spark.sql.shuffle.partition=2, the
join’s performance will be bad because that property seems global.
I can do something like this:
event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path)
but I have to handle adding partitions by myself.
Is there a way you can control the number of files just for this last
insert step?