Hi, I have a csv data file, which I have organized in the following format to be read as a LabeledPoint(following the example in mllib/data/sample_tree_data.csv):
1,5.1,3.5,1.4,0.2 1,4.9,3,1.4,0.2 1,4.7,3.2,1.3,0.2 1,4.6,3.1,1.5,0.2 The first column is the binary label (1 or 0) and the remaining columns are features. I am using the Logistic Regression Classifier in MLLib to create a model based on the training data and predict the (binary) class of the test data. I use MLUtils.loadLabeledData to read the data file. My prediction accuracy is quite low (compared to the results I got for the same data from R), So I tried to debug, by first verifying that the LabeledData is being read correctly. I find that some of the labels are not read correctly. For example, the first 40 points of the training data have a class of 1, whereas the training data read by loadLabeledData has label 0 for point 12 and point 14. I would like to know if this is because of the distributed algorithm that MLLib uses or if there is something wrong with the format I have above. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/incorrect-labels-being-read-by-MLUtils-loadLabeledData-tp9356.html Sent from the Apache Spark User List mailing list archive at Nabble.com.