`mean()` and `variance()` are not defined in `Vector`. You can use the mean and variance implementation from commons-math3 (http://commons.apache.org/proper/commons-math/javadocs/api-3.4.1/index.html) if you don't want to implement them. -Xiangrui
On Fri, Feb 6, 2015 at 12:50 PM, SK <skrishna...@gmail.com> wrote: > Hi, > > I have a dataset in csv format and I am trying to standardize the features > before using k-means clustering. The data does not have any labels but has > the following format: > > s1, f12,f13,... > s2, f21,f22,... > > where s is a string id, and f is a floating point feature value. > To perform feature standardization, I need to compute the mean and > variance/std deviation of the features values in each element of the RDD > (i.e each row). However, the summary Statistics library in Spark MLLib > provides only a colStats() method that provides column-wise mean and > variance. I tried to compute the mean and variance per row, using the code > below but got a compilation error that there is no mean() or variance() > method for a tuple or Vector object. Is there a Spark library to compute the > row-wise mean and variance for an RDD, where each row (i.e. element) of the > RDD is a Vector or tuple of N feature values? > > thanks > > My code for standardization is as follows: > > //read the data > val data=sc.textFile(file_name) > .map(_.split(",")) > > // extract the features. For this example I am using only 2 features, but > the data has more features > val features = data.map(d=> Vectors.dense(d(1).toDouble, d(2).toDouble)) > > val std_features = features.map(f=> { > val fmean = f.mean() // Error: > NO MEAN() for a Vector or Tuple object > val fstd = > scala.math.sqrt(f.variance()) // Error: NO variance() for a Vector or > Tuple object > for (i <- 0 to f.length) // > standardize the features > { var fs = 0.0 > if (fstd >0.0) > fs = (f(i) - > fmean)/fstd > fs > } > } > ) > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-feature-standardization-tp21539.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org