I can reproduce this issue, so looks like a bug of Random Forest, I will
try to find some clue.

2015-08-05 1:34 GMT+08:00 Patrick Lam <pkph...@gmail.com>:

> Yes, I rechecked and the label is correct. As you can see in the code
> posted, I used the exact same features for all the classifiers so unless rf
> somehow switches the labels, it should be correct.
>
> I have posted a sample dataset and sample code to reproduce what I'm
> getting here:
>
> https://github.com/pkphlam/spark_rfpredict
>
> On Tue, Aug 4, 2015 at 6:42 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>
>> It looks like the predicted result just opposite with expectation, so
>> could you check whether the label is right?
>> Or could you share several data which can help to reproduce this output?
>>
>> 2015-08-03 19:36 GMT+08:00 Barak Gitsis <bar...@similarweb.com>:
>>
>>> hi,
>>> I've run into some poor RF behavior, although not as pronounced as you..
>>> would be great to get more insight into this one
>>>
>>> Thanks!
>>>
>>> On Mon, Aug 3, 2015 at 8:21 AM pkphlam <pkph...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> This might be a long shot, but has anybody run into very poor predictive
>>>> performance using RandomForest with Mllib? Here is what I'm doing:
>>>>
>>>> - Spark 1.4.1 with PySpark
>>>> - Python 3.4.2
>>>> - ~30,000 Tweets of text
>>>> - 12289 1s and 15956 0s
>>>> - Whitespace tokenization and then hashing trick for feature selection
>>>> using
>>>> 10,000 features
>>>> - Run RF with 100 trees and maxDepth of 4 and then predict using the
>>>> features from all the 1s observations.
>>>>
>>>> So in theory, I should get predictions of close to 12289 1s (especially
>>>> if
>>>> the model overfits). But I'm getting exactly 0 1s, which sounds
>>>> ludicrous to
>>>> me and makes me suspect something is wrong with my code or I'm missing
>>>> something. I notice similar behavior (although not as extreme) if I play
>>>> around with the settings. But I'm getting normal behavior with other
>>>> classifiers, so I don't think it's my setup that's the problem.
>>>>
>>>> For example:
>>>>
>>>> >>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
>>>> >>> logit_predict = lrm.predict(predict_feat)
>>>> >>> logit_predict.sum()
>>>> 9077
>>>>
>>>> >>> nb = NaiveBayes.train(lp)
>>>> >>> nb_predict = nb.predict(predict_feat)
>>>> >>> nb_predict.sum()
>>>> 10287.0
>>>>
>>>> >>> rf = RandomForest.trainClassifier(lp, numClasses=2,
>>>> >>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
>>>> >>> rf_predict = rf.predict(predict_feat)
>>>> >>> rf_predict.sum()
>>>> 0.0
>>>>
>>>> This code was all run back to back so I didn't change anything in
>>>> between.
>>>> Does anybody have a possible explanation for this?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>> --
>>> *-Barak*
>>>
>>
>>
>
>
> --
> Patrick Lam
> Institute for Quantitative Social Science, Harvard University
> http://www.patricklam.org
>

Reply via email to