subject:"回复：Re\: calculate correlation between multiple columns and one specific column after groupby the spark data frame"

Re: 回复：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

2022-03-16 Thread Enrico Minack

If you have a list of Columns called `columns`, you can pass them to the 
`agg` method as:


  agg(columns.head, columns.tail: _*)

Enrico


Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn:

Thanks, Sean. I modified the codes and have generated a list of columns.
I am working on convert a list of columns to a new data frame. It 
seems that there is no direct  API to do this.


- 原始邮件 -
发件人：Sean Owen 
收件人：ckgppl_...@sina.cn
抄送人：user 
主题：Re: calculate correlation between multiple columns and one specific 
column after groupby the spark data frame

日期：2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just 
put this in a loop over all the columns instead, which adds a new corr 
col every time to a list.


On Tue, Mar 15, 2022, 10:30 PM  wrote:

Hi all,

I am stuck at  a correlation calculation problem. I have a
dataframe like below:

groupid datacol1datacol2datacol3datacol*
corr_co
1   1   2   3   4   5
1   2   3   4   6   5
2   4   2   1   7   5
2   8   9   3   2   5
3   7   1   2   3   5
3   3   5   3   1   5

I want to calculate the correlation between all datacol columns
and corr_col column by each groupid.
So I used the following spark scala-api codes:

df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))

This is very inefficient. If I have 30 data_col columns, I need to
input 30 times functions.corr to calculate correlation.

I have searched, it seems that functions.corr doesn't accept a
List/Array parameter, and df.agg doesn't accept a function to be
parameter.

So any  spark scala API codes can do this job efficiently?

Thanks

Liang

回复：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

2022-03-16 Thread ckgppl_yan

Thanks, Sean. I modified the codes and have generated a list of columns.I am 
working on convert a list of columns to a new data frame. It seems that there 
is no direct  API to do this.
- 原始邮件 -
发件人：Sean Owen 
收件人：ckgppl_...@sina.cn
抄送人：user 
主题：Re: calculate correlation between multiple columns and one specific column 
after groupby the spark data frame
日期：2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just put this 
in a loop over all the columns instead, which adds a new corr col every time to 
a list. 

On Tue, Mar 15, 2022, 10:30 PM   wrote:
Hi all,
I am stuck at  a correlation calculation problem. I have a dataframe like 
below:groupiddatacol1datacol2datacol3datacol*corr_co112345123465242175289325371235335315I
 want to calculate the correlation between all datacol columns and corr_col 
column by each groupid.So I used the following spark scala-api 
codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
This is very inefficient. If I have 30 data_col columns, I need to input 30 
times functions.corr to calculate correlation.I have searched, it seems that 
functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept 
a function to be parameter.So any  spark scala API codes can do this job 
efficiently?
Thanks
Liang

Re: 回复：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

回复：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

2 matches

Site Navigation

Mail list logo

Footer information