Re: 回复：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

Enrico Minack Wed, 16 Mar 2022 04:54:12 -0700

If you have a list of Columns called `columns`, you can pass them to the`agg` method as:


  agg(columns.head, columns.tail: _*)


Enrico


Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn:

Thanks, Sean. I modified the codes and have generated a list of columns.

I am working on convert a list of columns to a new data frame. Itseems that there is no direct API to do this.


----- 原始邮件 -----
发件人：Sean Owen <sro...@gmail.com>
收件人：ckgppl_...@sina.cn
抄送人：user <user@spark.apache.org>

主题：Re: calculate correlation between multiple columns and one specificcolumn after groupby the spark data frame

日期：2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Justput this in a loop over all the columns instead, which adds a new corrcol every time to a list.


On Tue, Mar 15, 2022, 10:30 PM <ckgppl_...@sina.cn> wrote:

    Hi all,

    I am stuck at  a correlation calculation problem. I have a
    dataframe like below:

    groupid     datacol1        datacol2        datacol3        datacol*        
corr_co
    00001       1       2       3       4       5
    00001       2       3       4       6       5
    00002       4       2       1       7       5
    00002       8       9       3       2       5
    00003       7       1       2       3       5
    00003       3       5       3       1       5

    I want to calculate the correlation between all datacol columns
    and corr_col column by each groupid.
    So I used the following spark scala-api codes:
    
df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))

    This is very inefficient. If I have 30 data_col columns, I need to
    input 30 times functions.corr to calculate correlation.

    I have searched, it seems that functions.corr doesn't accept a
    List/Array parameter, and df.agg doesn't accept a function to be
    parameter.

    So any  spark scala API codes can do this job efficiently?

    Thanks

    Liang

Re: 回复：Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame

Reply via email to