Alternatively you can also try the ML library from System ML ( http://systemml.apache.org/) for covariance computation on Spark.
Regards, Sourav On Mon, Dec 28, 2015 at 11:29 PM, Sun, Rui <rui....@intel.com> wrote: > Spark does not support computing cov matrix now. But there is a PR for > it. Maybe you can try it: > https://issues.apache.org/jira/browse/SPARK-11057 > > > > > > *From:* zhangjp [mailto:592426...@qq.com] > *Sent:* Tuesday, December 29, 2015 3:21 PM > *To:* Felix Cheung; Andy Davidson; Yanbo Liang > *Cc:* user > *Subject:* 回复: how to use sparkR or spark MLlib load csv file on hdfs > thencalculate covariance > > > > > > Now i have huge columns about 5k -20k, so if i want to Calculate > covariance matrix ,which is the best method or common method ? > > > > ------------------ 原始邮件 ------------------ > > *发件人**:* "Felix Cheung";<felixcheun...@hotmail.com>; > > *发送时间**:* 2015年12月29日(星期二) 中午12:45 > > *收件人**:* "Andy Davidson"<a...@santacruzintegration.com>; "zhangjp"< > 592426...@qq.com>; "Yanbo Liang"<yblia...@gmail.com>; > > *抄送**:* "user"<user@spark.apache.org>; > > *主题**:* Re: how to use sparkR or spark MLlib load csv file on hdfs > thencalculate covariance > > > > Make sure you add the csv spark package as this example here so that the > source parameter in R read.df would work: > > > > > https://spark.apache.org/docs/latest/sparkr.html#from-data-sources > > > > _____________________________ > From: Andy Davidson <a...@santacruzintegration.com> > Sent: Monday, December 28, 2015 10:24 AM > Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then > calculate covariance > To: zhangjp <592426...@qq.com>, Yanbo Liang <yblia...@gmail.com> > Cc: user <user@spark.apache.org> > > Hi Yanbo > > > > I use spark.csv to load my data set. I work with both Java and Python. I > would recommend you print the first couple of rows and also print the > schema to make sure your data is loaded as you expect. You might find the > following code example helpful. You may need to programmatically set the > schema depending on what you data looks like > > > > > > public class LoadTidyDataFrame { > > static DataFrame fromCSV(SQLContext sqlContext, String file) { > > DataFrame df = sqlContext.read() > > .format("com.databricks.spark.csv") > > .option("inferSchema", "true") > > .option("header", "true") > > .load(file); > > > > return df; > > } > > } > > > > > > > > *From: *Yanbo Liang < yblia...@gmail.com> > *Date: *Monday, December 28, 2015 at 2:30 AM > *To: *zhangjp < 592426...@qq.com> > *Cc: *"user @spark" < user@spark.apache.org> > *Subject: *Re: how to use sparkR or spark MLlib load csv file on hdfs > then calculate covariance > > > > Load csv file: > > df <- read.df(sqlContext, "file-path", source = > "com.databricks.spark.csv", header = "true") > > Calculate covariance: > > cov <- cov(df, "col1", "col2") > > > > Cheers > > Yanbo > > > > > > 2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>: > > hi all, > > I want to use sparkR or spark MLlib load csv file on hdfs then > calculate covariance, how to do it . > > thks. > > > > >