Class imbalance can be an issue for algorithms, but decision forests should in general cope reasonably well with imbalanced classes. By default, positive and negative classes are treated 'equally' however, and that may not reflect reality in some cases. Upsampling the under-represented case is a crude but effective way to counter this.
Of course the model depends on the data distribution, but it also depends on the data, of course. And the ROC curve depends on the model and data. There is no inherent relationship between the class balance and ROC curve though. AUC for a random-guessing classifier should be ~0.5. 0.8 is generally good. I could believe that this doesn't change much just because you changed parameters or representation. This isn't really a Spark question per se so you might get some other answers on the Data Science or Stats StackExchange. On Mon, Aug 15, 2016 at 5:11 AM, Zhiliang Zhu <zchl.j...@yahoo.com.invalid> wrote: > Hi All, > > Here I have lot of data with around 1,000,000 rows, 97% of them are negative > class and 3% of them are positive class . > I applied Random Forest algorithm to build the model and predict the testing > data. > > For the data preparation, > i. firstly randomly split all the data as training data and testing data by > 0.7 : 0.3 > ii. let the testing data unchanged, its negative and positive class ratio > would still be 97:3 > iii. try to make the training data negative and positive class ratio as > 50:50, by way of sample algorithm in the different classes > iv. get RF model by training data and predict testing data > > by modifying algorithm parameters and feature work (PCA etc ), it seems that > the auc on the testing data is always above 0.8, or much more higher ... > > Then I lose into some confusion... It seems that the model or auc depends a > lot on the original data distribution... > In effect, I would like to know, for this data distribution, how its auc > would be for random guess? > What the auc would be for any kind of data distribution? > > Thanks in advance~~ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org