Difference in AUCs b/w Spark's GBT and sklearn's

Yahoo_SK Mon, 21 Dec 2015 08:18:36 -0800

I tried GBDTs both with Python's sklearn as well as Spark's local stand-alone 
MLlib implementation with default settings for a binary classification problem. 
I kept the numIterations, loss function same in both the cases. The features 
are all real valued and continuous. However, the AUC in MLLib implementation 
was way off compared to sklearn's. These were the parameters for sklearn's 
classifier:


GradientBoostingClassifier(
    init=None, learning_rate=0.001, loss='deviance',max_depth=8,
    max_features=None, max_leaf_nodes=None, min_samples_leaf=1, 
    min_samples_split=2, min_weight_fraction_leaf=0.0, 
    n_estimators=100, random_state=None, subsample=1.0, 
    verbose=0, warm_start=False) 
I wanted to check if there's a way to figure and set these params in MLlib or 
if MLlib also assumes same settings (which are pretty standard).

Any pointers to figure the difference would be helpful.

Difference in AUCs b/w Spark's GBT and sklearn's

Reply via email to