I tried GBDTs both with Python's sklearn as well as Spark's local stand-alone MLlib implementation with default settings for a binary classification problem. I kept the numIterations, loss function same in both the cases. The features are all real valued and continuous. However, the AUC in MLLib implementation was way off compared to sklearn's. These were the parameters for sklearn's classifier:
GradientBoostingClassifier( init=None, learning_rate=0.001, loss='deviance',max_depth=8, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, random_state=None, subsample=1.0, verbose=0, warm_start=False) I wanted to check if there's a way to figure and set these params in MLlib or if MLlib also assumes same settings (which are pretty standard). Any pointers to figure the difference would be helpful.