Hi, I am writing a logistic regression prog with Spark based on SparkLR example.
Say, a data set containing 10000 DataPoints, where DataPoint is a case class like: case class DataPoint(x: Vector, y: Double) as defined in the SparkLR example. In order to divide the data set into 2 parts: training set and test set, I tried some code below: val trainingSet = points.sample(false, 0.6, 7) val testSet = points.subtract(trainingSet) ,where points is a RDD[DataPoint] contains 10000 points sample works well, trainingSet.count gives a number around 6000, but testSet.count gives 10000 which is not the expected 4000. It seems that subtract cant work with some custom class, as DataPoint here. 2 questions: 1) Which is the best way to divide data with a ratio, say 6/4, especially when Data is not a primitive type, like some custom classes ? 2) Why subtract doesn't work ? Maybe ordering and compare should be implemented for DataPoint class ? I have also checked the SubtractedRDD class. Without background about the Spark source code, I can not understand what the problem is. https://github.com/mesos/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/SubtractedRDD.scala Any help is highly appreciated ! Thank you in advance. =) Hao -- REN Hao Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL) Computer Science Etudiant à l'Université de Technologie de Compiègne (UTC) Génie Informatique - Fouille de Données Tel: +33 06 14 54 57 24 / +41 07 86 47 52 69
