Re: Write only one output file in Spark SQL
What you can do is at hive creates partitioned column for example date and use Val finalDf = repartition(data frame.col("date-column")) and later say insert overwrite tablename partition(date-column) select * from tempview Would work as expected On 11-Aug-2017 11:03 PM, "KhajaAsmath Mohammed" wrote: > we had spark.sql.partitions as 4 but in hdfs it is ending up with 200 > files and 4 files are actually having data and rest of them are having zero > bytes. > > My only requirement is to run fast for hive insert overwrite query from > spark temporary table and end up having less files instead of more files > with zero bytes. > > I am using spark sql query of hive insert overwite not the write method on > dataframe as it is not supported in 1.6 version of spark for kerberos > cluster. > > > On Fri, Aug 11, 2017 at 12:23 PM, Lukas Bradley > wrote: > >> Please show the write() call, and the results in HDFS. What are all the >> files you see? >> >> On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed < >> mdkhajaasm...@gmail.com> wrote: >> >>> tempTable = union_df.registerTempTable("tempRaw") >>> >>> create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin >>> string, utctime timestamp, description string, descriptionuom string, >>> providerdesc string, dt_map string, islocation string, latitude double, >>> longitude double, speed double, value string)') >>> >>> insert = hc.sql('INSERT OVERWRITE TABLE blab.pyspark_dpprq SELECT * FROM >>> tempRaw') >>> >>> >>> >>> >>> On Fri, Aug 11, 2017 at 11:00 AM, Daniel van der Ende < >>> daniel.vandere...@gmail.com> wrote: >>> Hi Asmath, Could you share the code you're running? Daniel On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, < mdkhajaasm...@gmail.com> wrote: > Hi, > > > > I am using spark sql to write data back to hdfs and it is resulting in > multiple output files. > > > > I tried changing number spark.sql.shuffle.partitions=1 but it > resulted in very slow performance. > > > > Also tried coalesce and repartition still the same issue. any > suggestions? > > > > Thanks, > > Asmath > >>> >> >
Re: Write only one output file in Spark SQL
we had spark.sql.partitions as 4 but in hdfs it is ending up with 200 files and 4 files are actually having data and rest of them are having zero bytes. My only requirement is to run fast for hive insert overwrite query from spark temporary table and end up having less files instead of more files with zero bytes. I am using spark sql query of hive insert overwite not the write method on dataframe as it is not supported in 1.6 version of spark for kerberos cluster. On Fri, Aug 11, 2017 at 12:23 PM, Lukas Bradley wrote: > Please show the write() call, and the results in HDFS. What are all the > files you see? > > On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> tempTable = union_df.registerTempTable("tempRaw") >> >> create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin >> string, utctime timestamp, description string, descriptionuom string, >> providerdesc string, dt_map string, islocation string, latitude double, >> longitude double, speed double, value string)') >> >> insert = hc.sql('INSERT OVERWRITE TABLE blab.pyspark_dpprq SELECT * FROM >> tempRaw') >> >> >> >> >> On Fri, Aug 11, 2017 at 11:00 AM, Daniel van der Ende < >> daniel.vandere...@gmail.com> wrote: >> >>> Hi Asmath, >>> >>> Could you share the code you're running? >>> >>> Daniel >>> >>> On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, < >>> mdkhajaasm...@gmail.com> wrote: >>> Hi, I am using spark sql to write data back to hdfs and it is resulting in multiple output files. I tried changing number spark.sql.shuffle.partitions=1 but it resulted in very slow performance. Also tried coalesce and repartition still the same issue. any suggestions? Thanks, Asmath >>> >> >
Re: Write only one output file in Spark SQL
Please show the write() call, and the results in HDFS. What are all the files you see? On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > tempTable = union_df.registerTempTable("tempRaw") > > create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin > string, utctime timestamp, description string, descriptionuom string, > providerdesc string, dt_map string, islocation string, latitude double, > longitude double, speed double, value string)') > > insert = hc.sql('INSERT OVERWRITE TABLE blab.pyspark_dpprq SELECT * FROM > tempRaw') > > > > > On Fri, Aug 11, 2017 at 11:00 AM, Daniel van der Ende < > daniel.vandere...@gmail.com> wrote: > >> Hi Asmath, >> >> Could you share the code you're running? >> >> Daniel >> >> On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, >> wrote: >> >>> Hi, >>> >>> >>> >>> I am using spark sql to write data back to hdfs and it is resulting in >>> multiple output files. >>> >>> >>> >>> I tried changing number spark.sql.shuffle.partitions=1 but it resulted >>> in very slow performance. >>> >>> >>> >>> Also tried coalesce and repartition still the same issue. any >>> suggestions? >>> >>> >>> >>> Thanks, >>> >>> Asmath >>> >> >
Re: Write only one output file in Spark SQL
tempTable = union_df.registerTempTable("tempRaw") create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin string, utctime timestamp, description string, descriptionuom string, providerdesc string, dt_map string, islocation string, latitude double, longitude double, speed double, value string)') insert = hc.sql('INSERT OVERWRITE TABLE blab.pyspark_dpprq SELECT * FROM tempRaw') On Fri, Aug 11, 2017 at 11:00 AM, Daniel van der Ende < daniel.vandere...@gmail.com> wrote: > Hi Asmath, > > Could you share the code you're running? > > Daniel > > On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, > wrote: > >> Hi, >> >> >> >> I am using spark sql to write data back to hdfs and it is resulting in >> multiple output files. >> >> >> >> I tried changing number spark.sql.shuffle.partitions=1 but it resulted >> in very slow performance. >> >> >> >> Also tried coalesce and repartition still the same issue. any suggestions? >> >> >> >> Thanks, >> >> Asmath >> >
Re: Write only one output file in Spark SQL
Hi Asmath, Could you share the code you're running? Daniel On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, wrote: > Hi, > > > > I am using spark sql to write data back to hdfs and it is resulting in > multiple output files. > > > > I tried changing number spark.sql.shuffle.partitions=1 but it resulted in > very slow performance. > > > > Also tried coalesce and repartition still the same issue. any suggestions? > > > > Thanks, > > Asmath >