How to control the number of files for dynamic partition in Spark SQL?

Benyi Wang Fri, 29 Jan 2016 12:26:49 -0800

I want to insert into a partition table using dynamic partition, but I
don’t want to have 200 files for a partition because the files will be
small for my case.


sqlContext.sql(  """
    |insert overwrite table event
    |partition(eventDate)
    |select
    | user,
    | detail,
    | eventDate
    |from event_wk
  """.stripMargin)

the table “event_wk” is created from a dataframe by registerTempTable,
which is built with some joins. If I set spark.sql.shuffle.partition=2, the
join’s performance will be bad because that property seems global.

I can do something like this:

event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path)

but I have to handle adding partitions by myself.

Is there a way you can control the number of files just for this last
insert step?

How to control the number of files for dynamic partition in Spark SQL?

Reply via email to