If you have a list of Columns called `columns`, you can pass them to the `agg` method as:

  agg(columns.head, columns.tail: _*)


Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn:
Thanks, Sean. I modified the codes and have generated a list of columns.
I am working on convert a list of columns to a new data frame. It seems that there is no direct  API to do this.

----- 原始邮件 -----
发件人:Sean Owen <sro...@gmail.com>
抄送人:user <user@spark.apache.org>
主题:Re: calculate correlation between multiple columns and one specific column after groupby the spark data frame
日期:2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just put this in a loop over all the columns instead, which adds a new corr col every time to a list.

On Tue, Mar 15, 2022, 10:30 PM <ckgppl_...@sina.cn> wrote:

    Hi all,

    I am stuck at  a correlation calculation problem. I have a
    dataframe like below:

    groupid     datacol1        datacol2        datacol3        datacol*        
    00001       1       2       3       4       5
    00001       2       3       4       6       5
    00002       4       2       1       7       5
    00002       8       9       3       2       5
    00003       7       1       2       3       5
    00003       3       5       3       1       5

    I want to calculate the correlation between all datacol columns
    and corr_col column by each groupid.
    So I used the following spark scala-api codes:

    This is very inefficient. If I have 30 data_col columns, I need to
    input 30 times functions.corr to calculate correlation.

    I have searched, it seems that functions.corr doesn't accept a
    List/Array parameter, and df.agg doesn't accept a function to be

    So any  spark scala API codes can do this job efficiently?



Reply via email to