Re: Union of multiple data frames
Hello Cesar, can you add some details like: number of columns, avg number of rows in the DFs, time spent to compute the plan with all the unions, and the time needed to perform the action? Thanks, Alessandro On 5 April 2018 at 23:22, Cesar <ces...@gmail.com> wrote: > Thanks for your answers. > > The suggested method works when the number of Data Frames is small. > > However, I am trying to union >30 Data Frames, and the time to create the > plan is taking longer than the execution, which should not be the case. > > Thanks! > -- > Cesar > > On Thu, Apr 5, 2018 at 1:29 PM, Andy Davidson < > a...@santacruzintegration.com> wrote: > >> >> Hi Ceasar >> >> I have used Brandson approach in the past with out any problem >> >> Andy >> From: Brandon Geise <brandonge...@gmail.com> >> Date: Thursday, April 5, 2018 at 11:23 AM >> To: Cesar <ces...@gmail.com>, "user @spark" <user@spark.apache.org> >> Subject: Re: Union of multiple data frames >> >> Maybe something like >> >> >> >> var finalDF = spark.sqlContext.emptyDataFrame >> >> for (df <- dfs){ >> >> finalDF = finalDF.union(df) >> >> } >> >> >> >> >> >> Where dfs is a Seq of dataframes. >> >> >> >> *From: *Cesar <ces...@gmail.com> >> *Date: *Thursday, April 5, 2018 at 2:17 PM >> *To: *user <user@spark.apache.org> >> *Subject: *Union of multiple data frames >> >> >> >> >> >> The following code works for small n, but not for large n (>20): >> >> >> >> val dfUnion = Seq(df1,df2,df3,...dfn).reduce(_ union _) >> >> dfUnion.show() >> >> >> >> By not working, I mean that Spark takes a lot of time to create the >> execution plan. >> >> >> >> *Is there a more optimal way to perform a union of multiple data frames?* >> >> >> >> >> thanks >> >> -- >> >> Cesar Flores >> >> > > > -- > Cesar Flores >
Re: Union of multiple data frames
Thanks for your answers. The suggested method works when the number of Data Frames is small. However, I am trying to union >30 Data Frames, and the time to create the plan is taking longer than the execution, which should not be the case. Thanks! -- Cesar On Thu, Apr 5, 2018 at 1:29 PM, Andy Davidson <a...@santacruzintegration.com > wrote: > > Hi Ceasar > > I have used Brandson approach in the past with out any problem > > Andy > From: Brandon Geise <brandonge...@gmail.com> > Date: Thursday, April 5, 2018 at 11:23 AM > To: Cesar <ces...@gmail.com>, "user @spark" <user@spark.apache.org> > Subject: Re: Union of multiple data frames > > Maybe something like > > > > var finalDF = spark.sqlContext.emptyDataFrame > > for (df <- dfs){ > > finalDF = finalDF.union(df) > > } > > > > > > Where dfs is a Seq of dataframes. > > > > *From: *Cesar <ces...@gmail.com> > *Date: *Thursday, April 5, 2018 at 2:17 PM > *To: *user <user@spark.apache.org> > *Subject: *Union of multiple data frames > > > > > > The following code works for small n, but not for large n (>20): > > > > val dfUnion = Seq(df1,df2,df3,...dfn).reduce(_ union _) > > dfUnion.show() > > > > By not working, I mean that Spark takes a lot of time to create the > execution plan. > > > > *Is there a more optimal way to perform a union of multiple data frames?* > > > > > thanks > > -- > > Cesar Flores > > -- Cesar Flores
Re: Union of multiple data frames
Hi Ceasar I have used Brandson approach in the past with out any problem Andy From: Brandon Geise <brandonge...@gmail.com> Date: Thursday, April 5, 2018 at 11:23 AM To: Cesar <ces...@gmail.com>, "user @spark" <user@spark.apache.org> Subject: Re: Union of multiple data frames > Maybe something like > > var finalDF = spark.sqlContext.emptyDataFrame > for (df <- dfs){ > finalDF = finalDF.union(df) > } > > > Where dfs is a Seq of dataframes. > > > From: Cesar <ces...@gmail.com> > Date: Thursday, April 5, 2018 at 2:17 PM > To: user <user@spark.apache.org> > Subject: Union of multiple data frames > > > > > > The following code works for small n, but not for large n (>20): > > > > val dfUnion = Seq(df1,df2,df3,...dfn).reduce(_ union _) > > dfUnion.show() > > > > By not working, I mean that Spark takes a lot of time to create the execution > plan. > > > > Is there a more optimal way to perform a union of multiple data frames? > > > > thanks > -- > > Cesar Flores
Re: Union of multiple data frames
Maybe something like var finalDF = spark.sqlContext.emptyDataFrame for (df <- dfs){ finalDF = finalDF.union(df) } Where dfs is a Seq of dataframes. From: Cesar <ces...@gmail.com> Date: Thursday, April 5, 2018 at 2:17 PM To: user <user@spark.apache.org> Subject: Union of multiple data frames The following code works for small n, but not for large n (>20): val dfUnion = Seq(df1,df2,df3,...dfn).reduce(_ union _) dfUnion.show() By not working, I mean that Spark takes a lot of time to create the execution plan. Is there a more optimal way to perform a union of multiple data frames? thanks -- Cesar Flores
Union of multiple data frames
The following code works for small n, but not for large n (>20): val dfUnion = Seq(df1,df2,df3,...dfn).reduce(_ union _) dfUnion.show() By not working, I mean that Spark takes a lot of time to create the execution plan. *Is there a more optimal way to perform a union of multiple data frames?* thanks -- Cesar Flores