Re: Write only one output file in Spark SQL

2017-08-11 Thread Chetan Khatri
What you can do is at hive creates partitioned column for example date and
use Val finalDf = repartition(data frame.col("date-column")) and later say
insert overwrite tablename partition(date-column) select * from tempview

Would work as expected
On 11-Aug-2017 11:03 PM, "KhajaAsmath Mohammed" 
wrote:

> we had spark.sql.partitions as 4 but in hdfs it is ending up with 200
> files and 4 files are actually having data and rest of them are having zero
> bytes.
>
> My only requirement is to run fast for hive insert overwrite query from
> spark temporary table and end up having less files instead of more files
> with zero bytes.
>
> I am using spark sql query of hive insert overwite not the write method on
> dataframe as it is not supported in 1.6 version of spark for kerberos
> cluster.
>
>
> On Fri, Aug 11, 2017 at 12:23 PM, Lukas Bradley 
> wrote:
>
>> Please show the write() call, and the results in HDFS.  What are all the
>> files you see?
>>
>> On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> tempTable = union_df.registerTempTable("tempRaw")
>>>
>>> create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin
>>> string, utctime timestamp, description string, descriptionuom string,
>>> providerdesc string, dt_map string, islocation string, latitude double,
>>> longitude double, speed double, value string)')
>>>
>>> insert = hc.sql('INSERT OVERWRITE TABLE blab.pyspark_dpprq SELECT * FROM
>>> tempRaw')
>>>
>>>
>>>
>>>
>>> On Fri, Aug 11, 2017 at 11:00 AM, Daniel van der Ende <
>>> daniel.vandere...@gmail.com> wrote:
>>>
 Hi Asmath,

 Could you share the code you're running?

 Daniel

 On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, <
 mdkhajaasm...@gmail.com> wrote:

> Hi,
>
>
>
> I am using spark sql to write data back to hdfs and it is resulting in
> multiple output files.
>
>
>
> I tried changing number spark.sql.shuffle.partitions=1 but it
> resulted in very slow performance.
>
>
>
> Also tried coalesce and repartition still the same issue. any
> suggestions?
>
>
>
> Thanks,
>
> Asmath
>

>>>
>>
>


Re: Write only one output file in Spark SQL

2017-08-11 Thread KhajaAsmath Mohammed
we had spark.sql.partitions as 4 but in hdfs it is ending up with 200 files
and 4 files are actually having data and rest of them are having zero bytes.

My only requirement is to run fast for hive insert overwrite query from
spark temporary table and end up having less files instead of more files
with zero bytes.

I am using spark sql query of hive insert overwite not the write method on
dataframe as it is not supported in 1.6 version of spark for kerberos
cluster.


On Fri, Aug 11, 2017 at 12:23 PM, Lukas Bradley 
wrote:

> Please show the write() call, and the results in HDFS.  What are all the
> files you see?
>
> On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> tempTable = union_df.registerTempTable("tempRaw")
>>
>> create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin
>> string, utctime timestamp, description string, descriptionuom string,
>> providerdesc string, dt_map string, islocation string, latitude double,
>> longitude double, speed double, value string)')
>>
>> insert = hc.sql('INSERT OVERWRITE TABLE blab.pyspark_dpprq SELECT * FROM
>> tempRaw')
>>
>>
>>
>>
>> On Fri, Aug 11, 2017 at 11:00 AM, Daniel van der Ende <
>> daniel.vandere...@gmail.com> wrote:
>>
>>> Hi Asmath,
>>>
>>> Could you share the code you're running?
>>>
>>> Daniel
>>>
>>> On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, <
>>> mdkhajaasm...@gmail.com> wrote:
>>>
 Hi,



 I am using spark sql to write data back to hdfs and it is resulting in
 multiple output files.



 I tried changing number spark.sql.shuffle.partitions=1 but it resulted
 in very slow performance.



 Also tried coalesce and repartition still the same issue. any
 suggestions?



 Thanks,

 Asmath

>>>
>>
>


Re: Write only one output file in Spark SQL

2017-08-11 Thread Lukas Bradley
Please show the write() call, and the results in HDFS.  What are all the
files you see?

On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> tempTable = union_df.registerTempTable("tempRaw")
>
> create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin
> string, utctime timestamp, description string, descriptionuom string,
> providerdesc string, dt_map string, islocation string, latitude double,
> longitude double, speed double, value string)')
>
> insert = hc.sql('INSERT OVERWRITE TABLE blab.pyspark_dpprq SELECT * FROM
> tempRaw')
>
>
>
>
> On Fri, Aug 11, 2017 at 11:00 AM, Daniel van der Ende <
> daniel.vandere...@gmail.com> wrote:
>
>> Hi Asmath,
>>
>> Could you share the code you're running?
>>
>> Daniel
>>
>> On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> I am using spark sql to write data back to hdfs and it is resulting in
>>> multiple output files.
>>>
>>>
>>>
>>> I tried changing number spark.sql.shuffle.partitions=1 but it resulted
>>> in very slow performance.
>>>
>>>
>>>
>>> Also tried coalesce and repartition still the same issue. any
>>> suggestions?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Asmath
>>>
>>
>


Re: Write only one output file in Spark SQL

2017-08-11 Thread KhajaAsmath Mohammed
tempTable = union_df.registerTempTable("tempRaw")

create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin string,
utctime timestamp, description string, descriptionuom string, providerdesc
string, dt_map string, islocation string, latitude double, longitude
double, speed double, value string)')

insert = hc.sql('INSERT OVERWRITE TABLE blab.pyspark_dpprq SELECT * FROM
tempRaw')




On Fri, Aug 11, 2017 at 11:00 AM, Daniel van der Ende <
daniel.vandere...@gmail.com> wrote:

> Hi Asmath,
>
> Could you share the code you're running?
>
> Daniel
>
> On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, 
> wrote:
>
>> Hi,
>>
>>
>>
>> I am using spark sql to write data back to hdfs and it is resulting in
>> multiple output files.
>>
>>
>>
>> I tried changing number spark.sql.shuffle.partitions=1 but it resulted
>> in very slow performance.
>>
>>
>>
>> Also tried coalesce and repartition still the same issue. any suggestions?
>>
>>
>>
>> Thanks,
>>
>> Asmath
>>
>


Re: Write only one output file in Spark SQL

2017-08-11 Thread Daniel van der Ende
Hi Asmath,

Could you share the code you're running?

Daniel

On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, 
wrote:

> Hi,
>
>
>
> I am using spark sql to write data back to hdfs and it is resulting in
> multiple output files.
>
>
>
> I tried changing number spark.sql.shuffle.partitions=1 but it resulted in
> very slow performance.
>
>
>
> Also tried coalesce and repartition still the same issue. any suggestions?
>
>
>
> Thanks,
>
> Asmath
>