I'm running Spark v1.3.1 and when I run the following against my dataset: model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catFeatu res, maxDepth=6, numIterations=3)
The job will fail with the following message: Traceback (most recent call last): File "/Users/drake/fd/spark/mltest.py", line 73, in <module> model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3) File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py", line 553, in trainRegressor loss, numIterations, learningRate, maxDepth) File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py", line 438, in _train loss, numIterations, learningRate, maxDepth) File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py", line 120, in callMLlibFunc return callJavaFunc(sc, api, *args) File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py", line 113, in callJavaFunc return _java2py(sc, func(*args)) File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95 py4j.protocol.Py4JJavaError: An error occurred while calling o69.trainGradientBoostedTreesModel. : java.lang.IllegalArgumentException: requirement failed: DecisionTree requires maxBins (= 32) >= max categories in categorical features (= 1895) at scala.Predef$.require(Predef.scala:233) at org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138) at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60) at org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150) at org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63) at org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96) at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595) So, it's complaining about the maxBins, if I provide maxBins=1900 and re-run it: model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catFeatu res, maxDepth=6, numIterations=3, maxBins=1900) Traceback (most recent call last): File "/Users/drake/fd/spark/mltest.py", line 73, in <module> model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catF eatures, maxDepth=6, numIterations=3, maxBins=1900) TypeError: trainRegressor() got an unexpected keyword argument 'maxBins' It now says it knows nothing of maxBins. If I run the same command against DecisionTree or RandomForest (with maxBins=1900) it works just fine. Seems like a bug in GradientBoostedTrees. Suggestions? -Don -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ 800-733-2143