Re: Write only one output file in Spark SQL

Chetan Khatri Fri, 11 Aug 2017 10:41:33 -0700

What you can do is at hive creates partitioned column for example date and
use Val finalDf = repartition(data frame.col("date-column")) and later say
insert overwrite tablename partition(date-column) select * from tempview


Would work as expected
On 11-Aug-2017 11:03 PM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com>
wrote:

> we had spark.sql.partitions as 4 but in hdfs it is ending up with 200
> files and 4 files are actually having data and rest of them are having zero
> bytes.
>
> My only requirement is to run fast for hive insert overwrite query from
> spark temporary table and end up having less files instead of more files
> with zero bytes.
>
> I am using spark sql query of hive insert overwite not the write method on
> dataframe as it is not supported in 1.6 version of spark for kerberos
> cluster.
>
>
> On Fri, Aug 11, 2017 at 12:23 PM, Lukas Bradley <lukasbrad...@gmail.com>
> wrote:
>
>> Please show the write() call, and the results in HDFS.  What are all the
>> files you see?
>>
>> On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> tempTable = union_df.registerTempTable("tempRaw")
>>>
>>> create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin
>>> string, utctime timestamp, description string, descriptionuom string,
>>> providerdesc string, dt_map string, islocation string, latitude double,
>>> longitude double, speed double, value string)')
>>>
>>> insert = hc.sql('INSERT OVERWRITE TABLE blab.pyspark_dpprq SELECT * FROM
>>> tempRaw')
>>>
>>>
>>>
>>>
>>> On Fri, Aug 11, 2017 at 11:00 AM, Daniel van der Ende <
>>> daniel.vandere...@gmail.com> wrote:
>>>
>>>> Hi Asmath,
>>>>
>>>> Could you share the code you're running?
>>>>
>>>> Daniel
>>>>
>>>> On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, <
>>>> mdkhajaasm...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I am using spark sql to write data back to hdfs and it is resulting in
>>>>> multiple output files.
>>>>>
>>>>>
>>>>>
>>>>> I tried changing number spark.sql.shuffle.partitions=1 but it
>>>>> resulted in very slow performance.
>>>>>
>>>>>
>>>>>
>>>>> Also tried coalesce and repartition still the same issue. any
>>>>> suggestions?
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Asmath
>>>>>
>>>>
>>>
>>
>

Re: Write only one output file in Spark SQL

Reply via email to