subject:"回复：Re\: 回复：Re\: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data

Re: 回复：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

2022-03-16 Thread Lalwani, Jayesh

No, You don’t need 30 dataframes and self joins. Convert a list of columns to a 
list of functions, and then pass the list of functions to the agg function

From: "ckgppl_...@sina.cn" 
Reply-To: "ckgppl_...@sina.cn" 
Date: Wednesday, March 16, 2022 at 8:16 AM
To: Enrico Minack , Sean Owen 
Cc: user 
Subject: [EXTERNAL] 回复：Re: 回复：Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

Thanks, Enrico.
I just found that I need to group the data frame then calculate the 
correlation. So I will get a list of dataframe, not columns.
So I used following solution:
1.   use following codes to create a mutable data frame df_all. I used the 
first datacol to calculate correlation.  
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")
2.   iterate all remaining datacol columns, create a temp data frame for 
this iteration. In this iteration, use df_all to join the temp data frame on 
the groupid column, then drop duplicated groupid column.
3.   after the iteration, I will get the dataframe which contains all 
correlation data.

I need to verify the data to make sure it is valid.

Liang
- 原始邮件 -
发件人：Enrico Minack 
收件人：ckgppl_...@sina.cn, Sean Owen 
抄送人：user 
主题：Re: 回复：Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期：2022年03月16日 19点53分

If you have a list of Columns called `columns`, you can pass them to the `agg` 
method as:

  agg(columns.head, columns.tail: _*)

Enrico

Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn>:
Thanks, Sean. I modified the codes and have generated a list of columns.
I am working on convert a list of columns to a new data frame. It seems that 
there is no direct  API to do this.

- 原始邮件 -
发件人：Sean Owen <mailto:sro...@gmail.com>
收件人：ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn>
抄送人：user <mailto:user@spark.apache.org>
主题：Re: calculate correlation between multiple columns and one specific column 
after groupby the spark data frame
日期：2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just put this 
in a loop over all the columns instead, which adds a new corr col every time to 
a list.
On Tue, Mar 15, 2022, 10:30 PM mailto:ckgppl_...@sina.cn>> 
wrote:
Hi all,

I am stuck at  a correlation calculation problem. I have a dataframe like below:
groupid

datacol1

datacol2

datacol3

datacol*

corr_co

1

1

2

3

4

5

1

2

3

4

6

5

2

4

2

1

7

5

2

8

9

3

2

5

3

7

1

2

3

5

3

3

5

3

1

5

I want to calculate the correlation between all datacol columns and corr_col 
column by each groupid.
So I used the following spark scala-api codes:
df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))

This is very inefficient. If I have 30 data_col columns, I need to input 30 
times functions.corr to calculate correlation.

I have searched, it seems that functions.corr doesn't accept a List/Array 
parameter, and df.agg doesn't accept a function to be parameter.
So any  spark scala API codes can do this job efficiently?

Thanks

Liang

回复：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

2022-03-16 Thread ckgppl_yan

Thanks, Enrico.I just found that I need to group the data frame then calculate 
the correlation. So I will get a list of dataframe, not columns. So I used 
following solution:use following codes to create a mutable data frame df_all. I 
used the first datacol to calculate correlation.  
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")iterate all 
remaining datacol columns, create a temp data frame for this iteration. In this 
iteration, use df_all to join the temp data frame on the groupid column, then 
drop duplicated groupid column.after the iteration, I will get the dataframe 
which contains all correlation data.
I need to verify the data to make sure it is valid.
Liang- 原始邮件 -
发件人：Enrico Minack 
收件人：ckgppl_...@sina.cn, Sean Owen 
抄送人：user 
主题：Re: 回复：Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期：2022年03月16日 19点53分

If you have a list of Columns called
  `columns`, you can pass them to the `agg` method as:



  agg(columns.head, columns.tail: _*)





Enrico






Am 16.03.22 um 08:02 schrieb
  ckgppl_...@sina.cn:



  
  Thanks, Sean. I modified the codes and have generated a list
of columns.
  I am working on convert a list of columns to a new data
frame. It seems that there is no direct  API to do this.
  

  
  
- 原始邮件 -

  发件人：Sean Owen 

  收件人：ckgppl_...@sina.cn

  抄送人：user 

  主题：Re: calculate correlation between multiple columns and one
  specific column after groupby the spark data frame

  日期：2022年03月16日 11点55分





  Are you just trying to avoid writing the function call 30
times? Just put this in a loop over all the columns instead,
which adds a new corr col every time to a list. 




  On Tue, Mar 15, 2022, 10:30 PM

wrote:

  
  
Hi all,




  I am stuck at
   a correlation calculation problem. I have a
  dataframe like below:
  

  
  groupid
  datacol1
  datacol2
  datacol3
  datacol*
  corr_co

  

  1
  1
  2
  3
  4
  5


  1
  2
  3
  4
  6
  5


  2
  4
  2
  1
  7
  5


  2
  8
  9
  3
  2
  5


  3
  7
  1
  2
  3
  5


  3
  3
  5
  3
  1
  5

  

  
  I want to calculate the
  correlation between all datacol columns and
  corr_col column by each groupid.

So I used the following spark
scala-api codes:

df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))

  


  This is very inefficient. If I
  have 30 data_col columns, I need to input 30 times
  functions.corr to calculate correlation.
  I have searched, it seems
  that functions.corr doesn't accept a List/Array
  parameter, and df.agg doesn't accept a function to
  be parameter.
  So any  spark scala API codes can

Re: 回复：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

回复：Re: 回复：Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

2 matches

Site Navigation

Mail list logo

Footer information