RE: 回复： how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

Sun, Rui Mon, 28 Dec 2015 23:31:26 -0800

Spark does not support computing cov matrix  now. But there is a PR for it. 
Maybe you can try it: https://issues.apache.org/jira/browse/SPARK-11057



From: zhangjp [mailto:592426...@qq.com]
Sent: Tuesday, December 29, 2015 3:21 PM
To: Felix Cheung; Andy Davidson; Yanbo Liang
Cc: user
Subject: 回复： how to use sparkR or spark MLlib load csv file on hdfs 
thencalculate covariance


Now i have huge columns about 5k -20k, so if i want to Calculate covariance 
matrix ,which is the best method or common method ?

------------------ 原始邮件 ------------------
发件人: "Felix 
Cheung";<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>;
发送时间: 2015年12月29日(星期二) 中午12:45
收件人: "Andy 
Davidson"<a...@santacruzintegration.com<mailto:a...@santacruzintegration.com>>; 
"zhangjp"<592426...@qq.com<mailto:592426...@qq.com>>; "Yanbo 
Liang"<yblia...@gmail.com<mailto:yblia...@gmail.com>>;
抄送: "user"<user@spark.apache.org<mailto:user@spark.apache.org>>;
主题: Re: how to use sparkR or spark MLlib load csv file on hdfs thencalculate 
covariance

Make sure you add the csv spark package as this example here so that the source 
parameter in R read.df would work:


https://spark.apache.org/docs/latest/sparkr.html#from-data-sources

_____________________________
From: Andy Davidson 
<a...@santacruzintegration.com<mailto:a...@santacruzintegration.com>>
Sent: Monday, December 28, 2015 10:24 AM
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance
To: zhangjp <592426...@qq.com<mailto:592426...@qq.com>>, Yanbo Liang 
<yblia...@gmail.com<mailto:yblia...@gmail.com>>
Cc: user <user@spark.apache.org<mailto:user@spark.apache.org>>

Hi Yanbo

I use spark.csv to load my data set. I work with both Java and Python. I would 
recommend you print the first couple of rows and also print the schema to make 
sure your data is loaded as you expect. You might find the following code 
example helpful. You may need to programmatically set the schema depending on 
what you data looks like



public class LoadTidyDataFrame {

    static  DataFrame fromCSV(SQLContext sqlContext, String file) {

        DataFrame df = sqlContext.read()

                .format("com.databricks.spark.csv")

                .option("inferSchema", "true")

                .option("header", "true")

                .load(file);



        return df;

    }

}



From: Yanbo Liang < yblia...@gmail.com<mailto:yblia...@gmail.com>>
Date: Monday, December 28, 2015 at 2:30 AM
To: zhangjp < 592426...@qq.com<mailto:592426...@qq.com>>
Cc: "user @spark" < user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance

Load csv file:
df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv", 
header = "true")
Calculate covariance:
cov <- cov(df, "col1", "col2")

Cheers
Yanbo


2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com<mailto:592426...@qq.com>>:
hi  all,
    I want  to use sparkR or spark MLlib  load csv file on hdfs then calculate  
covariance, how to do it .
    thks.

RE: 回复： how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

Reply via email to