If you have a list of Columns called `columns`, you can pass them to the
`agg` method as:
agg(columns.head, columns.tail: _*)
Enrico
Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn:
Thanks, Sean. I modified the codes and have generated a list of columns.
I am working on convert a list of columns to a new data frame. It
seems that there is no direct API to do this.
----- 原始邮件 -----
发件人:Sean Owen <sro...@gmail.com>
收件人:ckgppl_...@sina.cn
抄送人:user <user@spark.apache.org>
主题:Re: calculate correlation between multiple columns and one specific
column after groupby the spark data frame
日期:2022年03月16日 11点55分
Are you just trying to avoid writing the function call 30 times? Just
put this in a loop over all the columns instead, which adds a new corr
col every time to a list.
On Tue, Mar 15, 2022, 10:30 PM <ckgppl_...@sina.cn> wrote:
Hi all,
I am stuck at a correlation calculation problem. I have a
dataframe like below:
groupid datacol1 datacol2 datacol3 datacol*
corr_co
00001 1 2 3 4 5
00001 2 3 4 6 5
00002 4 2 1 7 5
00002 8 9 3 2 5
00003 7 1 2 3 5
00003 3 5 3 1 5
I want to calculate the correlation between all datacol columns
and corr_col column by each groupid.
So I used the following spark scala-api codes:
df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
This is very inefficient. If I have 30 data_col columns, I need to
input 30 times functions.corr to calculate correlation.
I have searched, it seems that functions.corr doesn't accept a
List/Array parameter, and df.agg doesn't accept a function to be
parameter.
So any spark scala API codes can do this job efficiently?
Thanks
Liang