Re: 回复:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame
If you have a list of Columns called `columns`, you can pass them to the `agg` method as: agg(columns.head, columns.tail: _*) Enrico Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn: Thanks, Sean. I modified the codes and have generated a list of columns. I am working on convert a list of columns to a new data frame. It seems that there is no direct API to do this. - 原始邮件 - 发件人:Sean Owen 收件人:ckgppl_...@sina.cn 抄送人:user 主题:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame 日期:2022年03月16日 11点55分 Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. On Tue, Mar 15, 2022, 10:30 PM wrote: Hi all, I am stuck at a correlation calculation problem. I have a dataframe like below: groupid datacol1datacol2datacol3datacol* corr_co 1 1 2 3 4 5 1 2 3 4 6 5 2 4 2 1 7 5 2 8 9 3 2 5 3 7 1 2 3 5 3 3 5 3 1 5 I want to calculate the correlation between all datacol columns and corr_col column by each groupid. So I used the following spark scala-api codes: df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col")) This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation. I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter. So any spark scala API codes can do this job efficiently? Thanks Liang
回复:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame
Thanks, Sean. I modified the codes and have generated a list of columns.I am working on convert a list of columns to a new data frame. It seems that there is no direct API to do this. - 原始邮件 - 发件人:Sean Owen 收件人:ckgppl_...@sina.cn 抄送人:user 主题:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame 日期:2022年03月16日 11点55分 Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list. On Tue, Mar 15, 2022, 10:30 PM wrote: Hi all, I am stuck at a correlation calculation problem. I have a dataframe like below:groupiddatacol1datacol2datacol3datacol*corr_co112345123465242175289325371235335315I want to calculate the correlation between all datacol columns and corr_col column by each groupid.So I used the following spark scala-api codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col")) This is very inefficient. If I have 30 data_col columns, I need to input 30 times functions.corr to calculate correlation.I have searched, it seems that functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept a function to be parameter.So any spark scala API codes can do this job efficiently? Thanks Liang