subject:"incorrect labels being read by MLUtils.loadLabeledData\(\)"

incorrect labels being read by MLUtils.loadLabeledData()

2014-07-10 Thread SK

Hi,

I have a csv data file, which I have organized  in the following format to
be read as a LabeledPoint(following the example in
mllib/data/sample_tree_data.csv):

1,5.1,3.5,1.4,0.2
1,4.9,3,1.4,0.2
1,4.7,3.2,1.3,0.2
1,4.6,3.1,1.5,0.2

The first column is the binary label (1 or 0) and the remaining columns are
features. I am using the Logistic Regression Classifier in MLLib to create a
model based on the training data and predict the (binary) class of the test
data.   I use MLUtils.loadLabeledData to read  the data file. My prediction
accuracy is quite low (compared to the results I got for the same data from
R), So I tried to debug, by first verifying that the LabeledData is being
read correctly. 
I find that some of the labels are not read correctly. For example, the
first 40 points of the training data have a class of 1, whereas the training
data read by loadLabeledData has label 0 for point 12 and point 14. I would
like to know if this is because of the distributed algorithm that MLLib uses
or if there is something wrong with the format I have above.

thanks  





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/incorrect-labels-being-read-by-MLUtils-loadLabeledData-tp9356.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: incorrect labels being read by MLUtils.loadLabeledData()

2014-07-10 Thread Yana Kadiyska

I do not believe the order of points in a distributed RDD is in any
way guaranteed. For a simple test, you can always add a last column
which is an id (make it double and throw it in the feature vector).
Printing the rdd back will not give you the points in file order. If
you don't want to go that far you can always examine the full feature
vector carefully -- points 12 and 14 should differ from your input csv
in the feature vector as well as the label.

On Thu, Jul 10, 2014 at 6:28 PM, SK skrishna...@gmail.com wrote:
 Hi,

 I have a csv data file, which I have organized  in the following format to
 be read as a LabeledPoint(following the example in
 mllib/data/sample_tree_data.csv):

 1,5.1,3.5,1.4,0.2
 1,4.9,3,1.4,0.2
 1,4.7,3.2,1.3,0.2
 1,4.6,3.1,1.5,0.2

 The first column is the binary label (1 or 0) and the remaining columns are
 features. I am using the Logistic Regression Classifier in MLLib to create a
 model based on the training data and predict the (binary) class of the test
 data.   I use MLUtils.loadLabeledData to read  the data file. My prediction
 accuracy is quite low (compared to the results I got for the same data from
 R), So I tried to debug, by first verifying that the LabeledData is being
 read correctly.
 I find that some of the labels are not read correctly. For example, the
 first 40 points of the training data have a class of 1, whereas the training
 data read by loadLabeledData has label 0 for point 12 and point 14. I would
 like to know if this is because of the distributed algorithm that MLLib uses
 or if there is something wrong with the format I have above.

 thanks





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/incorrect-labels-being-read-by-MLUtils-loadLabeledData-tp9356.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

incorrect labels being read by MLUtils.loadLabeledData()

Re: incorrect labels being read by MLUtils.loadLabeledData()

2 matches

Site Navigation

Mail list logo

Footer information