One more comment: That's a lot of categories for a feature. If it makes sense for your data, it will run faster if you can group the categories or split the 1895 categories into a few features which have fewer categories.
On Wed, May 20, 2015 at 3:17 PM, Burak Yavuz <brk...@gmail.com> wrote: > Could you please open a JIRA for it? The maxBins input is missing for the > Python Api. > > Is it possible if you can use the current master? In the current master, > you should be able to use trees with the Pipeline Api and DataFrames. > > Best, > Burak > > On Wed, May 20, 2015 at 2:44 PM, Don Drake <dondr...@gmail.com> wrote: > >> I'm running Spark v1.3.1 and when I run the following against my dataset: >> >> model = GradientBoostedTrees.trainRegressor(trainingData, >> categoricalFeaturesInfo=catFeatu >> res, maxDepth=6, numIterations=3) >> >> The job will fail with the following message: >> Traceback (most recent call last): >> File "/Users/drake/fd/spark/mltest.py", line 73, in <module> >> model = GradientBoostedTrees.trainRegressor(trainingData, >> categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3) >> File >> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py", >> line 553, in trainRegressor >> loss, numIterations, learningRate, maxDepth) >> File >> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py", >> line 438, in _train >> loss, numIterations, learningRate, maxDepth) >> File >> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py", >> line 120, in callMLlibFunc >> return callJavaFunc(sc, api, *args) >> File >> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py", >> line 113, in callJavaFunc >> return _java2py(sc, func(*args)) >> File >> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", >> line 538, in __call__ >> File >> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", >> line 300, in get_return_value >> 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95 >> py4j.protocol.Py4JJavaError: An error occurred while calling >> o69.trainGradientBoostedTreesModel. >> : java.lang.IllegalArgumentException: requirement failed: DecisionTree >> requires maxBins (= 32) >= max categories in categorical features (= 1895) >> at scala.Predef$.require(Predef.scala:233) >> at >> org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128) >> at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138) >> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60) >> at >> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150) >> at >> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63) >> at >> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96) >> at >> org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595) >> >> So, it's complaining about the maxBins, if I provide maxBins=1900 and >> re-run it: >> >> model = GradientBoostedTrees.trainRegressor(trainingData, >> categoricalFeaturesInfo=catFeatu >> res, maxDepth=6, numIterations=3, maxBins=1900) >> >> Traceback (most recent call last): >> File "/Users/drake/fd/spark/mltest.py", line 73, in <module> >> model = GradientBoostedTrees.trainRegressor(trainingData, >> categoricalFeaturesInfo=catF >> eatures, maxDepth=6, numIterations=3, maxBins=1900) >> TypeError: trainRegressor() got an unexpected keyword argument 'maxBins' >> >> It now says it knows nothing of maxBins. >> >> If I run the same command against DecisionTree or RandomForest (with >> maxBins=1900) it works just fine. >> >> Seems like a bug in GradientBoostedTrees. >> >> Suggestions? >> >> -Don >> >> -- >> Donald Drake >> Drake Consulting >> http://www.drakeconsulting.com/ >> 800-733-2143 >> > >