Re: what is the optimized way to combine multiple dataframes into one dataframe ?

2016-11-16 Thread Deepak Sharma
Can you try caching the individual dataframes and then union them?
It may save you time.

Thanks
Deepak

On Wed, Nov 16, 2016 at 12:35 PM, Devi P.V  wrote:

> Hi all,
>
> I have 4 data frames with three columns,
>
> client_id,product_id,interest
>
> I want to combine these 4 dataframes into one dataframe.I used union like
> following
>
> df1.union(df2).union(df3).union(df4)
>
> But it is time consuming for bigdata.what is the optimized way for doing
> this using spark 2.0 & scala
>
>
> Thanks
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


RE: what is the optimized way to combine multiple dataframes into one dataframe ?

2016-11-15 Thread Shreya Agarwal
If you are reading all these datasets from files in persistent storage, 
functions like sc.textFile can take folders/patterns as input and read all of 
the files matching into the same RDD. Then you can convert it to a dataframe.

When you say it is time consuming with union, how are you measuring that? Did 
you try having all of them in one DF in comparison to having them broken down? 
Are you seeing a non-linear slowdown in operations after union with linear 
increase in data size?
Sent from my Windows 10 phone

From: Devi P.V
Sent: Tuesday, November 15, 2016 11:06 PM
To: user @spark
Subject: what is the optimized way to combine multiple dataframes into one 
dataframe ?

Hi all,

I have 4 data frames with three columns,

client_id,product_id,interest

I want to combine these 4 dataframes into one dataframe.I used union like 
following

df1.union(df2).union(df3).union(df4)

But it is time consuming for bigdata.what is the optimized way for doing this 
using spark 2.0 & scala


Thanks