Re: Union of multiple data frames

2018-04-06 Thread Alessandro Solimando
Hello Cesar,
can you add some details like: number of columns, avg number of rows in the
DFs, time spent to compute the plan with all the unions, and the time
needed to perform the action?

Thanks,
Alessandro

On 5 April 2018 at 23:22, Cesar <ces...@gmail.com> wrote:

> Thanks for your answers.
>
> The suggested method works when the number of Data Frames is small.
>
> However, I am trying to union >30 Data Frames, and the time to create the
> plan is taking longer than the execution, which should not be the case.
>
> Thanks!
> --
> Cesar
>
> On Thu, Apr 5, 2018 at 1:29 PM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>
>>
>> Hi Ceasar
>>
>> I have used Brandson approach in the past with out any problem
>>
>> Andy
>> From: Brandon Geise <brandonge...@gmail.com>
>> Date: Thursday, April 5, 2018 at 11:23 AM
>> To: Cesar <ces...@gmail.com>, "user @spark" <user@spark.apache.org>
>> Subject: Re: Union of multiple data frames
>>
>> Maybe something like
>>
>>
>>
>> var finalDF = spark.sqlContext.emptyDataFrame
>>
>> for (df <- dfs){
>>
>> finalDF = finalDF.union(df)
>>
>> }
>>
>>
>>
>>
>>
>> Where dfs is a Seq of dataframes.
>>
>>
>>
>> *From: *Cesar <ces...@gmail.com>
>> *Date: *Thursday, April 5, 2018 at 2:17 PM
>> *To: *user <user@spark.apache.org>
>> *Subject: *Union of multiple data frames
>>
>>
>>
>>
>>
>> The following code works for small n, but not for large n (>20):
>>
>>
>>
>> val dfUnion = Seq(df1,df2,df3,...dfn).reduce(_ union _)
>>
>> dfUnion.show()
>>
>>
>>
>> By not working, I mean that Spark takes a lot of time to create the
>> execution plan.
>>
>>
>>
>> *Is there a more optimal way to perform a union of multiple data frames?*
>>
>>
>>
>>
>> thanks
>>
>> --
>>
>> Cesar Flores
>>
>>
>
>
> --
> Cesar Flores
>


Re: Union of multiple data frames

2018-04-05 Thread Cesar
Thanks for your answers.

The suggested method works when the number of Data Frames is small.

However, I am trying to union >30 Data Frames, and the time to create the
plan is taking longer than the execution, which should not be the case.

Thanks!
--
Cesar

On Thu, Apr 5, 2018 at 1:29 PM, Andy Davidson <a...@santacruzintegration.com
> wrote:

>
> Hi Ceasar
>
> I have used Brandson approach in the past with out any problem
>
> Andy
> From: Brandon Geise <brandonge...@gmail.com>
> Date: Thursday, April 5, 2018 at 11:23 AM
> To: Cesar <ces...@gmail.com>, "user @spark" <user@spark.apache.org>
> Subject: Re: Union of multiple data frames
>
> Maybe something like
>
>
>
> var finalDF = spark.sqlContext.emptyDataFrame
>
> for (df <- dfs){
>
> finalDF = finalDF.union(df)
>
> }
>
>
>
>
>
> Where dfs is a Seq of dataframes.
>
>
>
> *From: *Cesar <ces...@gmail.com>
> *Date: *Thursday, April 5, 2018 at 2:17 PM
> *To: *user <user@spark.apache.org>
> *Subject: *Union of multiple data frames
>
>
>
>
>
> The following code works for small n, but not for large n (>20):
>
>
>
> val dfUnion = Seq(df1,df2,df3,...dfn).reduce(_ union _)
>
> dfUnion.show()
>
>
>
> By not working, I mean that Spark takes a lot of time to create the
> execution plan.
>
>
>
> *Is there a more optimal way to perform a union of multiple data frames?*
>
>
>
>
> thanks
>
> --
>
> Cesar Flores
>
>


-- 
Cesar Flores


Re: Union of multiple data frames

2018-04-05 Thread Andy Davidson

Hi Ceasar

I have used Brandson approach in the past with out any problem

Andy
From:  Brandon Geise <brandonge...@gmail.com>
Date:  Thursday, April 5, 2018 at 11:23 AM
To:  Cesar <ces...@gmail.com>, "user @spark" <user@spark.apache.org>
Subject:  Re: Union of multiple data frames

> Maybe something like
>  
> var finalDF = spark.sqlContext.emptyDataFrame
> for (df <- dfs){
> finalDF = finalDF.union(df)
> }
>  
>  
> Where dfs is a Seq of dataframes.
>  
> 
> From: Cesar <ces...@gmail.com>
> Date: Thursday, April 5, 2018 at 2:17 PM
> To: user <user@spark.apache.org>
> Subject: Union of multiple data frames
> 
>  
> 
>  
> 
> The following code works for small n, but not for large n (>20):
> 
>  
> 
> val dfUnion = Seq(df1,df2,df3,...dfn).reduce(_ union _)
> 
> dfUnion.show()
> 
>  
> 
> By not working, I mean that Spark takes a lot of time to create the execution
> plan.
> 
>  
> 
> Is there a more optimal way to perform a union of multiple data frames?
> 
>  
> 
> thanks
> -- 
> 
> Cesar Flores




Re: Union of multiple data frames

2018-04-05 Thread Brandon Geise
Maybe something like

 

var finalDF = spark.sqlContext.emptyDataFrame

for (df <- dfs){

    finalDF = finalDF.union(df)

}

 

 

Where dfs is a Seq of dataframes.

 

From: Cesar <ces...@gmail.com>
Date: Thursday, April 5, 2018 at 2:17 PM
To: user <user@spark.apache.org>
Subject: Union of multiple data frames

 

 

The following code works for small n, but not for large n (>20):

 

val dfUnion = Seq(df1,df2,df3,...dfn).reduce(_ union _)

dfUnion.show()

 

By not working, I mean that Spark takes a lot of time to create the execution 
plan.

 

Is there a more optimal way to perform a union of multiple data frames?


 

thanks

-- 

Cesar Flores



Union of multiple data frames

2018-04-05 Thread Cesar
The following code works for small n, but not for large n (>20):

val dfUnion = Seq(df1,df2,df3,...dfn).reduce(_ union _)
dfUnion.show()

By not working, I mean that Spark takes a lot of time to create the
execution plan.

*Is there a more optimal way to perform a union of multiple data frames?*


thanks
-- 
Cesar Flores