RDD.subtract doesn't work

Hao REN Thu, 12 Sep 2013 01:19:56 -0700

Hi,

I am writing a logistic regression prog with Spark based on SparkLR example.


Say, a data set containing 10000 DataPoints, where DataPoint is a case
class like:  case class DataPoint(x: Vector, y: Double) as defined in the
SparkLR example.

In order to divide the data set into 2 parts: training set and test set, I
tried some code below:

val trainingSet = points.sample(false, 0.6, 7)
val testSet = points.subtract(trainingSet)

,where points is a RDD[DataPoint] contains 10000 points

sample works well, trainingSet.count gives a number around 6000, but
testSet.count gives 10000 which is not the expected 4000.

It seems that subtract cant work with some custom class, as DataPoint here.

2 questions:

1) Which is the best way to divide data with a ratio, say 6/4, especially
when Data is not a primitive type, like some custom classes ?

2) Why subtract doesn't work ? Maybe ordering and compare should be
implemented for DataPoint class ?


I have also checked the SubtractedRDD class. Without background about the
Spark source code, I can not understand what the problem is.
https://github.com/mesos/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/SubtractedRDD.scala


Any help is highly appreciated !

Thank you in advance. =)

Hao


-- 
REN Hao

Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL)

Computer Science

Etudiant à l'Université de Technologie de Compiègne (UTC)

Génie Informatique - Fouille de Données

Tel:  +33 06 14 54 57 24  ／  +41 07 86 47 52 69

RDD.subtract doesn't work

Reply via email to