No, You don’t need 30 dataframes and self joins. Convert a list of columns to a
list of functions, and then pass the list of functions to the agg function
From: "ckgppl_...@sina.cn"
Reply-To: "ckgppl_...@sina.cn"
Date: Wednesday, March 16, 2022 at 8:16 AM
To: Enrico Minack , Sean Owen
Cc: user
Subject: [EXTERNAL] 回复:Re: 回复:Re: calculate correlation
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
CAUTION: This email originated from outside of the organization. Do not click
links or open attachments unless you can confirm the sender and know the
content is safe.
Thanks, Enrico.
I just found that I need to group the data frame then calculate the
correlation. So I will get a list of dataframe, not columns.
So I used following solution:
1. use following codes to create a mutable data frame df_all. I used the
first datacol to calculate correlation.
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")
2. iterate all remaining datacol columns, create a temp data frame for
this iteration. In this iteration, use df_all to join the temp data frame on
the groupid column, then drop duplicated groupid column.
3. after the iteration, I will get the dataframe which contains all
correlation data.
I need to verify the data to make sure it is valid.
Liang
- 原始邮件 -
发件人:Enrico Minack
收件人:ckgppl_...@sina.cn, Sean Owen
抄送人:user
主题:Re: 回复:Re: calculate correlation
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期:2022年03月16日 19点53分
If you have a list of Columns called `columns`, you can pass them to the `agg`
method as:
agg(columns.head, columns.tail: _*)
Enrico
Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn>:
Thanks, Sean. I modified the codes and have generated a list of columns.
I am working on convert a list of columns to a new data frame. It seems that
there is no direct API to do this.
- 原始邮件 -
发件人:Sean Owen <mailto:sro...@gmail.com>
收件人:ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn>
抄送人:user <mailto:user@spark.apache.org>
主题:Re: calculate correlation between multiple columns and one specific column
after groupby the spark data frame
日期:2022年03月16日 11点55分
Are you just trying to avoid writing the function call 30 times? Just put this
in a loop over all the columns instead, which adds a new corr col every time to
a list.
On Tue, Mar 15, 2022, 10:30 PM mailto:ckgppl_...@sina.cn>>
wrote:
Hi all,
I am stuck at a correlation calculation problem. I have a dataframe like below:
groupid
datacol1
datacol2
datacol3
datacol*
corr_co
1
1
2
3
4
5
1
2
3
4
6
5
2
4
2
1
7
5
2
8
9
3
2
5
3
7
1
2
3
5
3
3
5
3
1
5
I want to calculate the correlation between all datacol columns and corr_col
column by each groupid.
So I used the following spark scala-api codes:
df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
This is very inefficient. If I have 30 data_col columns, I need to input 30
times functions.corr to calculate correlation.
I have searched, it seems that functions.corr doesn't accept a List/Array
parameter, and df.agg doesn't accept a function to be parameter.
So any spark scala API codes can do this job efficiently?
Thanks
Liang