Hi all,
I am stuck at a correlation calculation problem. I have a dataframe like
below:groupiddatacol1datacol2datacol3datacol*corr_co000011234500001234650000242175000028932500003712350000335315I
want to calculate the correlation between all datacol columns and corr_col
column by each groupid.So I used the following spark scala-api
codes:df.groupby("groupid").agg(functions.corr("datacol1","corr_col"),functions.corr("datacol2","corr_col"),functions.corr("datacol3","corr_col"),functions.corr("datacol*","corr_col"))
This is very inefficient. If I have 30 data_col columns, I need to input 30
times functions.corr to calculate correlation.I have searched, it seems that
functions.corr doesn't accept a List/Array parameter, and df.agg doesn't accept
a function to be parameter.So any spark scala API codes can do this job
efficiently?
Thanks
Liang