Hi,

I have a csv data file, which I have organized  in the following format to
be read as a LabeledPoint(following the example in
mllib/data/sample_tree_data.csv):

1,5.1,3.5,1.4,0.2
1,4.9,3,1.4,0.2
1,4.7,3.2,1.3,0.2
1,4.6,3.1,1.5,0.2

The first column is the binary label (1 or 0) and the remaining columns are
features. I am using the Logistic Regression Classifier in MLLib to create a
model based on the training data and predict the (binary) class of the test
data.   I use MLUtils.loadLabeledData to read  the data file. My prediction
accuracy is quite low (compared to the results I got for the same data from
R), So I tried to debug, by first verifying that the LabeledData is being
read correctly. 
I find that some of the labels are not read correctly. For example, the
first 40 points of the training data have a class of 1, whereas the training
data read by loadLabeledData has label 0 for point 12 and point 14. I would
like to know if this is because of the distributed algorithm that MLLib uses
or if there is something wrong with the format I have above.

thanks  





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/incorrect-labels-being-read-by-MLUtils-loadLabeledData-tp9356.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to