Spark does not support computing cov matrix now. But there is a PR for it. Maybe you can try it: https://issues.apache.org/jira/browse/SPARK-11057
From: zhangjp [mailto:592426...@qq.com] Sent: Tuesday, December 29, 2015 3:21 PM To: Felix Cheung; Andy Davidson; Yanbo Liang Cc: user Subject: 回复: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance Now i have huge columns about 5k -20k, so if i want to Calculate covariance matrix ,which is the best method or common method ? ------------------ 原始邮件 ------------------ 发件人: "Felix Cheung";<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>; 发送时间: 2015年12月29日(星期二) 中午12:45 收件人: "Andy Davidson"<a...@santacruzintegration.com<mailto:a...@santacruzintegration.com>>; "zhangjp"<592426...@qq.com<mailto:592426...@qq.com>>; "Yanbo Liang"<yblia...@gmail.com<mailto:yblia...@gmail.com>>; 抄送: "user"<user@spark.apache.org<mailto:user@spark.apache.org>>; 主题: Re: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance Make sure you add the csv spark package as this example here so that the source parameter in R read.df would work: https://spark.apache.org/docs/latest/sparkr.html#from-data-sources _____________________________ From: Andy Davidson <a...@santacruzintegration.com<mailto:a...@santacruzintegration.com>> Sent: Monday, December 28, 2015 10:24 AM Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance To: zhangjp <592426...@qq.com<mailto:592426...@qq.com>>, Yanbo Liang <yblia...@gmail.com<mailto:yblia...@gmail.com>> Cc: user <user@spark.apache.org<mailto:user@spark.apache.org>> Hi Yanbo I use spark.csv to load my data set. I work with both Java and Python. I would recommend you print the first couple of rows and also print the schema to make sure your data is loaded as you expect. You might find the following code example helpful. You may need to programmatically set the schema depending on what you data looks like public class LoadTidyDataFrame { static DataFrame fromCSV(SQLContext sqlContext, String file) { DataFrame df = sqlContext.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("header", "true") .load(file); return df; } } From: Yanbo Liang < yblia...@gmail.com<mailto:yblia...@gmail.com>> Date: Monday, December 28, 2015 at 2:30 AM To: zhangjp < 592426...@qq.com<mailto:592426...@qq.com>> Cc: "user @spark" < user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance Load csv file: df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv", header = "true") Calculate covariance: cov <- cov(df, "col1", "col2") Cheers Yanbo 2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com<mailto:592426...@qq.com>>: hi all, I want to use sparkR or spark MLlib load csv file on hdfs then calculate covariance, how to do it . thks.