I can reproduce this issue, so looks like a bug of Random Forest, I will try to find some clue.
2015-08-05 1:34 GMT+08:00 Patrick Lam <pkph...@gmail.com>: > Yes, I rechecked and the label is correct. As you can see in the code > posted, I used the exact same features for all the classifiers so unless rf > somehow switches the labels, it should be correct. > > I have posted a sample dataset and sample code to reproduce what I'm > getting here: > > https://github.com/pkphlam/spark_rfpredict > > On Tue, Aug 4, 2015 at 6:42 AM, Yanbo Liang <yblia...@gmail.com> wrote: > >> It looks like the predicted result just opposite with expectation, so >> could you check whether the label is right? >> Or could you share several data which can help to reproduce this output? >> >> 2015-08-03 19:36 GMT+08:00 Barak Gitsis <bar...@similarweb.com>: >> >>> hi, >>> I've run into some poor RF behavior, although not as pronounced as you.. >>> would be great to get more insight into this one >>> >>> Thanks! >>> >>> On Mon, Aug 3, 2015 at 8:21 AM pkphlam <pkph...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> This might be a long shot, but has anybody run into very poor predictive >>>> performance using RandomForest with Mllib? Here is what I'm doing: >>>> >>>> - Spark 1.4.1 with PySpark >>>> - Python 3.4.2 >>>> - ~30,000 Tweets of text >>>> - 12289 1s and 15956 0s >>>> - Whitespace tokenization and then hashing trick for feature selection >>>> using >>>> 10,000 features >>>> - Run RF with 100 trees and maxDepth of 4 and then predict using the >>>> features from all the 1s observations. >>>> >>>> So in theory, I should get predictions of close to 12289 1s (especially >>>> if >>>> the model overfits). But I'm getting exactly 0 1s, which sounds >>>> ludicrous to >>>> me and makes me suspect something is wrong with my code or I'm missing >>>> something. I notice similar behavior (although not as extreme) if I play >>>> around with the settings. But I'm getting normal behavior with other >>>> classifiers, so I don't think it's my setup that's the problem. >>>> >>>> For example: >>>> >>>> >>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10) >>>> >>> logit_predict = lrm.predict(predict_feat) >>>> >>> logit_predict.sum() >>>> 9077 >>>> >>>> >>> nb = NaiveBayes.train(lp) >>>> >>> nb_predict = nb.predict(predict_feat) >>>> >>> nb_predict.sum() >>>> 10287.0 >>>> >>>> >>> rf = RandomForest.trainClassifier(lp, numClasses=2, >>>> >>> categoricalFeaturesInfo={}, numTrees=100, seed=422) >>>> >>> rf_predict = rf.predict(predict_feat) >>>> >>> rf_predict.sum() >>>> 0.0 >>>> >>>> This code was all run back to back so I didn't change anything in >>>> between. >>>> Does anybody have a possible explanation for this? >>>> >>>> Thanks! >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> -- >>> *-Barak* >>> >> >> > > > -- > Patrick Lam > Institute for Quantitative Social Science, Harvard University > http://www.patricklam.org >