Thanks, Enrico.I just found that I need to group the data frame then calculate
the correlation. So I will get a list of dataframe, not columns. So I used
following solution:use following codes to create a mutable data frame df_all. I
used the first datacol to calculate correlation.
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")iterate all
remaining datacol columns, create a temp data frame for this iteration. In this
iteration, use df_all to join the temp data frame on the groupid column, then
drop duplicated groupid column.after the iteration, I will get the dataframe
which contains all correlation data.
I need to verify the data to make sure it is valid.
Liang----- 原始邮件 -----
发件人:Enrico Minack <[email protected]>
收件人:[email protected], Sean Owen <[email protected]>
抄送人:user <[email protected]>
主题:Re: 回复:Re: calculate correlation
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期:2022年03月16日 19点53分
If you have a list of Columns called
`columns`, you can pass them to the `agg` method as:
agg(columns.head, columns.tail: _*)
Enrico
Am 16.03.22 um 08:02 schrieb
[email protected]:
Thanks, Sean. I modified the codes and have generated a list
of columns.
I am working on convert a list of columns to a new data
frame. It seems that there is no direct API to do this.
----- 原始邮件 -----
发件人:Sean Owen <[email protected]>
收件人:[email protected]
抄送人:user <[email protected]>
主题:Re: calculate correlation between multiple columns and one
specific column after groupby the spark data frame
日期:2022年03月16日 11点55分
Are you just trying to avoid writing the function call 30
times? Just put this in a loop over all the columns instead,
which adds a new corr col every time to a list.
On Tue, Mar 15, 2022, 10:30 PM
<[email protected]>
wrote:
Hi all,
I am stuck at
a correlation calculation problem. I have a
dataframe like below:
groupid
datacol1
datacol2
datacol3
datacol*
corr_co
00001
1
2
3
4
5
00001
2
3
4
6
5
00002
4
2
1
7
5
00002
8
9
3
2
5
00003
7
1
2
3
5
00003
3
5
3
1
5
I want to calculate the
correlation between all datacol columns and
corr_col column by each groupid.
So I used the following spark
scala-api codes:
df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
This is very inefficient. If I
have 30 data_col columns, I need to input 30 times
functions.corr to calculate correlation.
I have searched, it seems
that functions.corr doesn't accept a List/Array
parameter, and df.agg doesn't accept a function to
be parameter.
So any spark scala API codes can do this job
efficiently?
Thanks
Liang