Re: 回复:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

2022-03-16 Thread Enrico Minack
If you have a list of Columns called `columns`, you can pass them to the 
`agg` method as:


  agg(columns.head, columns.tail: _*)

Enrico


Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn:

Thanks, Sean. I modified the codes and have generated a list of columns.
I am working on convert a list of columns to a new data frame. It 
seems that there is no direct  API to do this.


- 原始邮件 -
发件人:Sean Owen 
收件人:ckgppl_...@sina.cn
抄送人:user 
主题:Re: calculate correlation between multiple columns and one specific 
column after groupby the spark data frame

日期:2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just 
put this in a loop over all the columns instead, which adds a new corr 
col every time to a list.


On Tue, Mar 15, 2022, 10:30 PM  wrote:

Hi all,

I am stuck at  a correlation calculation problem. I have a
dataframe like below:

groupid datacol1datacol2datacol3datacol*
corr_co
1   1   2   3   4   5
1   2   3   4   6   5
2   4   2   1   7   5
2   8   9   3   2   5
3   7   1   2   3   5
3   3   5   3   1   5

I want to calculate the correlation between all datacol columns
and corr_col column by each groupid.
So I used the following spark scala-api codes:

df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))

This is very inefficient. If I have 30 data_col columns, I need to
input 30 times functions.corr to calculate correlation.

I have searched, it seems that functions.corr doesn't accept a
List/Array parameter, and df.agg doesn't accept a function to be
parameter.

So any  spark scala API codes can do this job efficiently?

Thanks

Liang



回复:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

2022-03-16 Thread ckgppl_yan
Thanks, Sean. I modified the codes and have generated a list of columns.I am 
working on convert a list of columns to a new data frame. It seems that there 
is no direct  API to do this.
- 原始邮件 -
发件人:Sean Owen 
收件人:ckgppl_...@sina.cn
抄送人:user 
主题:Re: calculate correlation between multiple columns and one specific column 
after groupby the spark data frame
日期:2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just put this 
in a loop over all the columns instead, which adds a new corr col every time to 
a list. 

On Tue, Mar 15, 2022, 10:30 PM   wrote:
Hi all,
I am stuck at  a correlation calculation problem. I have a dataframe like 
below:groupiddatacol1datacol2datacol3datacol*corr_co112345123465242175289325371235335315I
 want to calculate the correlation between all datacol columns and corr_col 
column by each groupid.So I used the following spark scala-api 
codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
This is very inefficient. If I have 30 data_col columns, I need to input 30 
times functions.corr to calculate correlation.I have searched, it seems that 
functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept 
a function to be parameter.So any  spark scala API codes can do this job 
efficiently?
Thanks
Liang