Re: Write only one output file in Spark SQL

2017-08-11 Thread Chetan Khatri
What you can do is at hive creates partitioned column for example date and use Val finalDf = repartition(data frame.col("date-column")) and later say insert overwrite tablename partition(date-column) select * from tempview Would work as expected On 11-Aug-2017 11:03 PM, "KhajaAsmath Mohammed" wro

Re: Write only one output file in Spark SQL

2017-08-11 Thread KhajaAsmath Mohammed
we had spark.sql.partitions as 4 but in hdfs it is ending up with 200 files and 4 files are actually having data and rest of them are having zero bytes. My only requirement is to run fast for hive insert overwrite query from spark temporary table and end up having less files instead of more files

Re: Write only one output file in Spark SQL

2017-08-11 Thread Lukas Bradley
Please show the write() call, and the results in HDFS. What are all the files you see? On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > tempTable = union_df.registerTempTable("tempRaw") > > create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (v

Re: Write only one output file in Spark SQL

2017-08-11 Thread KhajaAsmath Mohammed
tempTable = union_df.registerTempTable("tempRaw") create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin string, utctime timestamp, description string, descriptionuom string, providerdesc string, dt_map string, islocation string, latitude double, longitude double, speed double, value

Re: Write only one output file in Spark SQL

2017-08-11 Thread Daniel van der Ende
Hi Asmath, Could you share the code you're running? Daniel On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, wrote: > Hi, > > > > I am using spark sql to write data back to hdfs and it is resulting in > multiple output files. > > > > I tried changing number spark.sql.shuffle.partitions=1 but it

Write only one output file in Spark SQL

2017-08-11 Thread KhajaAsmath Mohammed
Hi, I am using spark sql to write data back to hdfs and it is resulting in multiple output files. I tried changing number spark.sql.shuffle.partitions=1 but it resulted in very slow performance. Also tried coalesce and repartition still the same issue. any suggestions? Thanks, Asmath