Re: 回复:Re: 回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

2022-03-16 Thread Lalwani, Jayesh
No, You don’t need 30 dataframes and self joins. Convert a list of columns to a 
list of functions, and then pass the list of functions to the agg function


From: "ckgppl_...@sina.cn" 
Reply-To: "ckgppl_...@sina.cn" 
Date: Wednesday, March 16, 2022 at 8:16 AM
To: Enrico Minack , Sean Owen 
Cc: user 
Subject: [EXTERNAL] 回复:Re: 回复:Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Thanks, Enrico.
I just found that I need to group the data frame then calculate the 
correlation. So I will get a list of dataframe, not columns.
So I used following solution:
1.   use following codes to create a mutable data frame df_all. I used the 
first datacol to calculate correlation.  
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")
2.   iterate all remaining datacol columns, create a temp data frame for 
this iteration. In this iteration, use df_all to join the temp data frame on 
the groupid column, then drop duplicated groupid column.
3.   after the iteration, I will get the dataframe which contains all 
correlation data.


I need to verify the data to make sure it is valid.


Liang
- 原始邮件 -
发件人:Enrico Minack 
收件人:ckgppl_...@sina.cn, Sean Owen 
抄送人:user 
主题:Re: 回复:Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期:2022年03月16日 19点53分

If you have a list of Columns called `columns`, you can pass them to the `agg` 
method as:

  agg(columns.head, columns.tail: _*)

Enrico


Am 16.03.22 um 08:02 schrieb ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn>:
Thanks, Sean. I modified the codes and have generated a list of columns.
I am working on convert a list of columns to a new data frame. It seems that 
there is no direct  API to do this.

- 原始邮件 -
发件人:Sean Owen <mailto:sro...@gmail.com>
收件人:ckgppl_...@sina.cn<mailto:ckgppl_...@sina.cn>
抄送人:user <mailto:user@spark.apache.org>
主题:Re: calculate correlation between multiple columns and one specific column 
after groupby the spark data frame
日期:2022年03月16日 11点55分

Are you just trying to avoid writing the function call 30 times? Just put this 
in a loop over all the columns instead, which adds a new corr col every time to 
a list.
On Tue, Mar 15, 2022, 10:30 PM mailto:ckgppl_...@sina.cn>> 
wrote:
Hi all,


I am stuck at  a correlation calculation problem. I have a dataframe like below:
groupid

datacol1

datacol2

datacol3

datacol*

corr_co

1

1

2

3

4

5

1

2

3

4

6

5

2

4

2

1

7

5

2

8

9

3

2

5

3

7

1

2

3

5

3

3

5

3

1

5

I want to calculate the correlation between all datacol columns and corr_col 
column by each groupid.
So I used the following spark scala-api codes:
df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))

This is very inefficient. If I have 30 data_col columns, I need to input 30 
times functions.corr to calculate correlation.

I have searched, it seems that functions.corr doesn't accept a List/Array 
parameter, and df.agg doesn't accept a function to be parameter.
So any  spark scala API codes can do this job efficiently?

Thanks

Liang




回复:Re: 回复:Re: calculate correlation between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame

2022-03-16 Thread ckgppl_yan
Thanks, Enrico.I just found that I need to group the data frame then calculate 
the correlation. So I will get a list of dataframe, not columns. So I used 
following solution:use following codes to create a mutable data frame df_all. I 
used the first datacol to calculate correlation.  
df.groupby("groupid").agg(functions.corr("datacol1","corr_col")iterate all 
remaining datacol columns, create a temp data frame for this iteration. In this 
iteration, use df_all to join the temp data frame on the groupid column, then 
drop duplicated groupid column.after the iteration, I will get the dataframe 
which contains all correlation data.
I need to verify the data to make sure it is valid.
Liang- 原始邮件 -
发件人:Enrico Minack 
收件人:ckgppl_...@sina.cn, Sean Owen 
抄送人:user 
主题:Re: 回复:Re: calculate correlation 
between_multiple_columns_and_one_specific_column_after_groupby_the_spark_data_frame
日期:2022年03月16日 19点53分

If you have a list of Columns called
  `columns`, you can pass them to the `agg` method as:



  agg(columns.head, columns.tail: _*)





Enrico






Am 16.03.22 um 08:02 schrieb
  ckgppl_...@sina.cn:



  
  Thanks, Sean. I modified the codes and have generated a list
of columns.
  I am working on convert a list of columns to a new data
frame. It seems that there is no direct  API to do this.
  

  
  
- 原始邮件 -

  发件人:Sean Owen 

  收件人:ckgppl_...@sina.cn

  抄送人:user 

  主题:Re: calculate correlation between multiple columns and one
  specific column after groupby the spark data frame

  日期:2022年03月16日 11点55分





  Are you just trying to avoid writing the function call 30
times? Just put this in a loop over all the columns instead,
which adds a new corr col every time to a list. 




  On Tue, Mar 15, 2022, 10:30 PM

wrote:

  
  
Hi all,




  I am stuck at
   a correlation calculation problem. I have a
  dataframe like below:
  

  
  groupid
  datacol1
  datacol2
  datacol3
  datacol*
  corr_co

  

  1
  1
  2
  3
  4
  5


  1
  2
  3
  4
  6
  5


  2
  4
  2
  1
  7
  5


  2
  8
  9
  3
  2
  5


  3
  7
  1
  2
  3
  5


  3
  3
  5
  3
  1
  5

  

  
  I want to calculate the
  correlation between all datacol columns and
  corr_col column by each groupid.

So I used the following spark
scala-api codes:

df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))

  


  This is very inefficient. If I
  have 30 data_col columns, I need to input 30 times
  functions.corr to calculate correlation.
  I have searched, it seems
  that functions.corr doesn't accept a List/Array
  parameter, and df.agg doesn't accept a function to
  be parameter.
  So any  spark scala API codes can