I do not believe the order of points in a distributed RDD is in any
way guaranteed. For a simple test, you can always add a last column
which is an id (make it double and throw it in the feature vector).
Printing the rdd back will not give you the points in file order. If
you don't want to go that far you can always examine the full feature
vector carefully -- points 12 and 14 should differ from your input csv
in the feature vector as well as the label.
On Thu, Jul 10, 2014 at 6:28 PM, SK skrishna...@gmail.com wrote:
Hi,
I have a csv data file, which I have organized in the following format to
be read as a LabeledPoint(following the example in
mllib/data/sample_tree_data.csv):
1,5.1,3.5,1.4,0.2
1,4.9,3,1.4,0.2
1,4.7,3.2,1.3,0.2
1,4.6,3.1,1.5,0.2
The first column is the binary label (1 or 0) and the remaining columns are
features. I am using the Logistic Regression Classifier in MLLib to create a
model based on the training data and predict the (binary) class of the test
data. I use MLUtils.loadLabeledData to read the data file. My prediction
accuracy is quite low (compared to the results I got for the same data from
R), So I tried to debug, by first verifying that the LabeledData is being
read correctly.
I find that some of the labels are not read correctly. For example, the
first 40 points of the training data have a class of 1, whereas the training
data read by loadLabeledData has label 0 for point 12 and point 14. I would
like to know if this is because of the distributed algorithm that MLLib uses
or if there is something wrong with the format I have above.
thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/incorrect-labels-being-read-by-MLUtils-loadLabeledData-tp9356.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.