> On Mar 16, 2022, at 7:38 AM, <ckgppl_...@sina.cn> <ckgppl_...@sina.cn> wrote:
>
> Thanks, Jayesh and all. I finally get the correlation data frame using agg
> with list of functions.
> I think the list of functions which generate a column should be more detailed
> description.
>
> Liang
>
> ----- 原始邮件 -----
> 发件人:"Lalwani, Jayesh" <jlalw...@amazon.com>
> 收件人:"ckgppl_...@sina.cn" <ckgppl_...@sina.cn>, Enrico Minack
> <i...@enrico.minack.dev>, Sean Owen <sro...@gmail.com>
> 抄送人:user <user@spark.apache.org>
> 主题:Re: 回复:Re: 回复:Re: calculate
> correlation_between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
> 日期:2022年03月16日 20点49分
>
> No, You don’t need 30 dataframes and self joins. Convert a list of columns to
> a list of functions, and then pass the list of functions to the agg function
>
>
>
>
>
> From: "ckgppl_...@sina.cn" <ckgppl_...@sina.cn>
> Reply-To: "ckgppl_...@sina.cn" <ckgppl_...@sina.cn>
> Date: Wednesday, March 16, 2022 at 8:16 AM
> To: Enrico Minack <i...@enrico.minack.dev>, Sean Owen <sro...@gmail.com>
> Cc: user <user@spark.apache.org>
> Subject: [EXTERNAL] 回复:Re: 回复:Re: calculate correlation
> between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
>
>
>
> CAUTION: This email originated from outside of the organization. Do not click
> links or open attachments unless you can confirm the sender and know the
> content is safe.
>
>
>
> Thanks, Enrico.
>
> I just found that I need to group the data frame then calculate the
> correlation. So I will get a list of dataframe, not columns.
>
> So I used following solution:
>
> 1. use following codes to create a mutable data frame df_all. I used
> the first datacol to calculate correlation.
> df.groupby("groupid").agg(functions.corr("datacol1","corr_col")
>
> 2. iterate all remaining datacol columns, create a temp data frame for
> this iteration. In this iteration, use df_all to join the temp data frame on
> the groupid column, then drop duplicated groupid column.
>
> 3. after the iteration, I will get the dataframe which contains all
> correlation data.
>
>
>
>
> I need to verify the data to make sure it is valid.
>
>
>
>
> Liang
>
> ----- 原始邮件 -----
> 发件人:Enrico Minack <i...@enrico.minack.dev>
> 收件人:ckgppl_...@sina.cn, Sean Owen <sro...@gmail.com>
> 抄送人:user <user@spark.apache.org>
> 主题:Re: 回复:Re: calculate correlation
> between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
> 日期:2022年03月16日 19点53分
>
>
>
> If you have a list of Columns called `columns`, you can pass them to the
> `agg` method as:
>
>
>
> agg(columns.head, columns.tail: _*)
>
>
>
> Enrico
>
>
>
>
>
> Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn <mailto:ckgppl_...@sina.cn>:
>
> Thanks, Sean. I modified the codes and have generated a list of columns.
>
> I am working on convert a list of columns to a new data frame. It seems that
> there is no direct API to do this.
>
>
>
> ----- 原始邮件 -----
> 发件人:Sean Owen <sro...@gmail.com> <mailto:sro...@gmail.com>
> 收件人:ckgppl_...@sina.cn <mailto:ckgppl_...@sina.cn>
> 抄送人:user <user@spark.apache.org> <mailto:user@spark.apache.org>
> 主题:Re: calculate correlation between multiple columns and one specific column
> after groupby the spark data frame
> 日期:2022年03月16日 11点55分
>
>
>
> Are you just trying to avoid writing the function call 30 times? Just put
> this in a loop over all the columns instead, which adds a new corr col every
> time to a list.
>
> On Tue, Mar 15, 2022, 10:30 PM <ckgppl_...@sina.cn
> <mailto:ckgppl_...@sina.cn>> wrote:
>
> Hi all,
>
>
>
> I am stuck at a correlation calculation problem. I have a dataframe like
> below:
>
> groupid
>
> datacol1
>
> datacol2
>
> datacol3
>
> datacol*
>
> corr_co
>
> 00001
>
> 1
>
> 2
>
> 3
>
> 4
>
> 5
>
> 00001
>
> 2
>
> 3
>
> 4
>
> 6
>
> 5
>
> 00002
>
> 4
>
> 2
>
> 1
>
> 7
>
> 5
>
> 00002
>
> 8
>
> 9
>
> 3
>
> 2
>
> 5
>
> 00003
>
> 7
>
> 1
>
> 2
>
> 3
>
> 5
>
> 00003
>
> 3
>
> 5
>
> 3
>
> 1
>
> 5
>
> I want to calculate the correlation between all datacol columns and corr_col
> column by each groupid.
>
> So I used the following spark scala-api codes:
>
> df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
>
>
>
> This is very inefficient. If I have 30 data_col columns, I need to input 30
> times functions.corr to calculate correlation.
>
> I have searched, it seems that functions.corr doesn't accept a List/Array
> parameter, and df.agg doesn't accept a function to be parameter.
>
> So any spark scala API codes can do this job efficiently?
>
>
>
> Thanks
>
>
>
> Liang
>
>
>