Hi Yanbo I use spark.csv to load my data set. I work with both Java and Python. I would recommend you print the first couple of rows and also print the schema to make sure your data is loaded as you expect. You might find the following code example helpful. You may need to programmatically set the schema depending on what you data looks like
public class LoadTidyDataFrame { static DataFrame fromCSV(SQLContext sqlContext, String file) { DataFrame df = sqlContext.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("header", "true") .load(file); return df; } } From: Yanbo Liang <yblia...@gmail.com> Date: Monday, December 28, 2015 at 2:30 AM To: zhangjp <592426...@qq.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance > Load csv file: > df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv", > header = "true") > Calculate covariance: > cov <- cov(df, "col1", "col2") > > Cheers > Yanbo > > > 2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>: >> hi all, >> I want to use sparkR or spark MLlib load csv file on hdfs then >> calculate covariance, how to do it . >> thks. >